By Erik Gfesser — Mar 7, 2017

New Book Review: "Introduction to Apache Flink"

New book review for Introduction to Apache Flink: Stream Processing for Real Time and Beyond, by Ellen Friedman and Kostas Tzoumas, O'Reilly, 2016, reposted here:

Good effort on the first (and currently only) book available on Apache Flink. As the authors comment in the introductory pages, the purpose of this book is to investigate potential advantages of working with data streams in order to help readers determine whether a stream-based approach is an architecturally good fit for meeting business goals. Additionally, this book is intended to help its audience understand the technology behind Flink and how it tackles stream processing challenges.

For some readers, it is important to note that this book is conceptual in nature and does not provide any programmatic content. While I recently attended a Flink meetup event in which the presenter indicated they had significant difficulty figuring out how to use Flink in its early days over the past year or so, using the web documentation provided by the project should be considered the next logical step after understanding the underlying concepts and applicable use cases.

After discussing data streaming and the consequences of not streaming well, the authors present introductory material on the goals for processing continuous event data, the evolution of stream processing technologies, an overview of the advantages and limitations of Lambda architecture, and comparisons between Flink, Storm, and Spark Streaming, followed by discussions of the hows and whys behind Flink handling of both batch and stream processing via the DataSet API and DataStream API, as well as working with streaming data in general, regardless of chosen product.

The second chapter continues this discussion, delving deeper by taking a look at stream-first architectures in comparison to traditional architectures that attempt to maintain state across distributed systems, with the reminder that usage is not limited to low-latency use cases. The two main types of components, message transport and stream processor, are then explained, typically referring to Apache Kafka as the former and Flink as the latter, although the authors do later periodically mention MapR Streams when it offers functionality not currently provided by Flink (e.g. geo-distributed stream replication).

The focus of the third chapter is a discussion of the different types of correctness and what Flink provides in this context. One of the first questions the authors ask about is the level at which one's processing framework enables computational window fit for web activity analytics to actual user behavior. As explained, it is difficult to use micro-batches or fixed computational windows such as these do not overlap naturally occurring sessions. Flink enables more flexible definitions of these windows, for example, but taking inactivity into account. In addition, Flink handles event time in addition to traditional processing time. The authors provide a peek into the discussions of these topics in the following chapter, and explain how Flink use of checkpoints enable fault tolerance.

The fourth chapter turns its focus to handling time, and explain at the outset that one crucial difference between programming applications for a stream processor and programming applications for a batch processor (such as MapReduce) is the need to explicitly handle time in the former. Companies that use Hadoop typically have several pipelines running in their clusters which make use of a tool like Apache Flume and batch jobs scheduled by a scheduler for analyses. However, the authors explain that while this architecture can be made to work, there are several problems with it: too many moving parts, implicit treatment of time, inaccurate early alerts, out of order events, and unclear batch boundaries.

Use of a streaming architecture reduces complexity. An approach that uses Kafka and Flink treats the never-ending stream of incoming events as a stream rather than artificial segments, and encodes the definition of time in the application code rather than spreading this definition across ingestion, compuatation, and scheduling. While the authors discuss the concept of micro-matching and how this is implemented differently across tools, they explain that developers should not be concerned about whether this is being done, but whether out-of-order streams, sessions, and other misaligned windows can be handled, whether early alerts and accurate aggregates can be provided, and whether past data can be deterministically replayed.

Containing about 30% of the content, the fifth chapter is the longest. After explaining the differences between stateful and stateless computation, the authors explain that while the most interesting applications of stream processing are stateful, implementations are also much more challenging. The remainder of the chapter focuses on this aspect of these technologies, first by explaining the three different levels of consistency in the stream processing world with which readers will probably already be familiar from other readings, followed by a brief tangent about the history of earlier tooling, Flink use of checkpoints to provide "exactly once" consistency and savepoints to manage versions of state, an explanation of end-to-end consistency, and benchmarks.

The authors close the discussion with a chapter on batch processing, which they argue is really just a special case of streaming. Flink can process data both as a continuous unbounded stream or as bounded streams (i.e. batch), making use of the DataStream API or DataSet API with the same backend stream processing engine. The final use case that is presented compares processing time results using MapReduce 2.71, Tez 0.7.0, Spark 1.5.1, and Flink 0.9.1 for both TeraSort and HashJoin. Overall, a good presentation in a freely available report.

Subscribe to Erik on Software