New Book Review: "Building Real-Time Data Pipelines"
New book review for Building Real-Time Data Pipelines: Unifying Applications and Analytics with In-Memory Architectures, by Conor Doherty, Gary Orenstein, Steven Camina, and Kevin White, O'Reilly, 2015, reposted here:
Most readers will already likely be aware of the benefits of soft real-time data, but the authors provide a reminder in their introduction that there are actually three basic ways to win: you have something or know something, you are more intelligent, or you can process information faster. This book concentrates on use cases for in-memory databases (IMDBMS) such as HTAP (Hybrid Transactional/Analytical Processing), which address this latter category. There already exists a number of solutions for working with transactions and analytics independently of one another, such as stream processing frameworks for the former, and columnar databases or data warehouses for the latter, but HTAP provides a combination of low data latency and low query latency in a single solution, enabling new applications and real-time data pipelines.
As the authors explain, modern workloads require database systems to possess the ability to ingest and process data in real-time, generate reports over changing datasets, detect anomalies as events take place, and respond with subsecond timing. HTAP-capable systems are capable to meet all of these needs, and essentially bypass batched ETL processes. Although systems have been traditionally designed to focus on transactions of atomic units of work, a new model is emerging. This new model comes into play when the aggregate of all transactions is critical to understanding the shape of the business, and the authors argue that in-memory databases are needed in order to both keep up with the volume of real-time data, and the interest to understand this data in real time.
Architectural principles of modern in-memory databases include the ability to accept transactions directly into memory, the ability to easily add CPU horsepower and memory to a cluster, the ability to support interactive analytics and semi-structured data via relational and multimodel, and the ability to use multiple types of storage media types such as integrated disk or flash for longer term usage. The authors explain that in-memory approaches generally fit into three categories: memory after, memory before, and memory optimized. Each of these approaches designates where the database stores active data in its primary format. Memory-after architectures, for example, typically commit transactions directly to disk, then quickly stage transactions into memory afterward, and memory-optimized architectures follow the reverse sequence of events. Each of these approaches has its benefits and drawbacks.
Leading up to their discussion on real-time pipelines and converged processing, the authors provide a practical example of real-time display ad optimization. In the scenario provided, both the transactional and analytical components of the application must compete in the time it takes a page to load, ideally taking recent data into account, and doing so is challenging as long as siloed data warehouses remain in use. During this discussion about one-third into the text, Apache Kafka, Apache Spark Streaming, and MemSQL (a combination dubbed the "Real-Time Trinity") are summoned as one example of a common real-time pipeline configuration.
The authors provide a helpful reminder toward the conclusion of this example that stream processing tools are intended to be temporary data stores, ingesting and holding data for some limited time period such as an hour or a day. Because of this, access is limited to this window of data if the system provides a query interface, without the ability to to analyze the data within the broader, historical context. There essentially exists only one chance to analyze data as it flies by within the context of a pure stream processing system. And while distributed, high-throughput NoSQL data stores for complex event processing are used by some businesses, these stores do not provide common relational database features such as joins, and typically trade speed for data structure.
After diving into slightly more detail on the requirements of converged processing (processing transactions and analytics in a single database), the authors discuss Apache Spark and its comparison to relational databases, and the trend in enterprise data architecture to use a simplified, multipurpose infrastructure, followed by brief discussions on multimodal and multimodel systems and tiered storage. Later chapters repetitively explore some earlier covered subjects, albeit with some nice diagrams which summarize some of the material, as well as sufficient explanations on important topics such as data durability and data availability as these relate to in-memory databases.
Overall, a good introductory presentation in a freely available whitepaper-sized package. Just be aware that frequent mentions of MemSQL are no coincidence, as all four authors work at MemSQL. The discussion mentions a Spark-MemSQL Connector, made available about 6 months prior to this publication, as well as Spark Streaming (Apache Spark version 1.5 coincides with the September 2015 publication date, and this version was noted for several new Spark Streaming features which focus on operational stability for long-lived production streaming workloads). While readers need to be discerning as with all other technical publications, be aware that this text does not provide a look at the complete solutions landscape, and other options continue to be made available over time.