New Book Review: "Fast Data Architectures for Streaming Applications"
New book review for Fast Data Architectures for Streaming Applications: Getting Answers Now from Data Sets that Never End, by Dean Wampler PhD, O'Reilly, 2016, reposted here:
A brief 40-page read that walks the reader through an introduction to the concepts of streaming, event logs and message queues, working with infinite data sets, and considerations and recommendations around these topics with respect to real-world systems. In his introduction, Wampler (co-organizer of the local Chicago-area Hadoop User Group) asks why the industry started with batch processing rather than stream processing of big data, and comments that streaming (fast data) architectures are much harder to build than batch mode architectures when trying to gain control of ballooning data sets. Streaming imposes new challenges that go far beyond simply making batch systems run faster or more frequently, as it introduces new semantics for analytics and introduces new operational challenges.
Since there are so many streaming systems and streaming methods, Wampler narrows focus to a representative sample of systems and a reference architecture that covers most of the essential features, and subsequently walks the reader through an introductory overview of this architecture, diving into some of the details in later sections of the discussion. However, the author briefly addresses use of the Lambda Architecture before doing so, commenting that he sees it as an important transitional step toward streaming architectures that make use of the same infrastructure for batch and stream layers, with batch processing actually a subset of stream processing (although I find it odd that Kappa Architecture is never mentioned by name).
The author rightly comments that the tooling in this space is moving quickly, so keep in mind that this freely available text was published in 2016. That said, the provided overviews of event logs, message queues, and analysis of infinite data sets will likely be relevant for some time to come due to the high level of discussion. This text is recommended mainly to those new to this space, as all of the covered material can be found elsewhere (with some relevant blog posts cited by the author). Note that the author's reference architecture includes Apache Flink for stream processing, and some might be interested in knowing that O'Reilly offers a good discussion of this tool in another freely available text, entitled "Introduction to Apache Flink: Stream Processing for Real Time and Beyond" (see my review).