By Erik Gfesser — Feb 15, 2021

New Book Review: "Learning Spark (Second Edition)"

New book review for Learning Spark (Second Edition), by Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee, O'Reilly, 2020, reposted here:

Copy provided by Databricks at Spark + AI Summit 2020.

The foreword and preface to this book comment that an update to the first edition, published in 2015, was long overdue. After all, the first edition makes use of Apache Spark 1.3.0, whereas this update makes use of Apache Spark 3.0.0-preview2 (the latest version available at the time of writing). For the most part, I successfully ran all notebook code out of the box using Databricks Runtime 7.6 ML (includes Apache Spark 3.0.1, Scala 2.12), albeit minor issues are explained later in this review alongside my resolutions to these. I was, however, able to successfully run all standalone PySpark applications from chapters #2 and #3 out of the box using Apache Spark 3.0.1 and Python 3.7.9. As explained, the approach used here is intended to be conductive to hands-on learning, but with a focus on Spark's Structured APIs, so there are a few topics that aren't covered, such as the following: the older low-level Resilient Distributed Dataset (RDD) APIs, GraphX (Spark's API for graphs and graph-parallel computation), how to extend Spark's Catalyst optimizer, how to implement your own catalog, and how to write your own DataSource V2 data sinks and sources.

Content is broken down into 12 chapters: (1) "Introduction to Apache Spark: A Unified Analytics Engine", (2) "Downloading Apache Spark and Getting Started", (3) "Apache Spark's Structured APIs", (4) "Spark SQL and DataFrames: Introduction to Built-in Data Sources", (5) "Spark SQL and DataFrames: Interacting with External Data Sources", (6) "Spark SL and Datasets", (7) "Optimizing and Tuning Spark Applications", (8) "Structured Streaming", (9) "Building Reliable Data Lakes with Apache Spark", (10) "Machine Learning with MLlib", (11) "Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark", and (12) "Epilogue: Apache Spark 3.0". The longest chapter is chapter #8, followed closely behind by chapters #3, #4, #5, and #10, and the most notebooks are provided for chapters #10 and #11, although this is largely due to individual notebooks dedicated to a variety of topics.

This book is the fourth of four related books I've worked through, a couple years after the earlier three: "Spark: The Definitive Guide" (2018), "High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark" (2017), and "Practical Hive: A Guide to Hadoop's Data Warehouse System" (2016). As I mentioned in an earlier review, if you are new to Apache Spark, these four texts will help enable your going in the right direction, although keep in mind that the related tech stack is evolving and you will obviously need to supplement this material with web documentation and developer forums, as well as to get hands-on with the tooling. Reading the earlier three books in opposite order of publication date enabled exposure to more current material sooner rather than later, but this was largely just a coincidence. Now that this new book is available, I recommend working through this one first. While I wouldn't discount "Spark: The Definitive Guide", because it provides content not in this new book and I personally think it flows better, use it very judiciously because it was created using the Spark 2.0.1 APIs.

The only notebooks I wasn't able to successfully run out of the box are constrained to chapter #11. In notebook 11-3 ("Distributed Inference"), 11-5 ("Joblib"), and 11-7 ("Koalas"), FileNotFoundErrors were generated when attempting to use Pandas to read from CSV or Parquet files using "read_csv()" and "read_parquet()", respectively. In taking a look at what the community had to say, I discovered that this is a known issue, so I replaced these Pandas statements with "spark.read.option(…).csv("…") and "spark.read.option(…).parquet("…") instead, respectively, subsequently converting to Pandas using "toPandas()". In looking at the documentation, Pandas 1.0.1 is installed on both the CPU and GPU clusters for the aforementioned Databricks Runtime (the latest non-beta currently available). In notebook 11-3 ("Distributed Inference"), the following PythonException was generated when attempting to execute a "mapInPandas()" statement that uses a mix of numeric data types in the schema argument: "pyarrow.lib.ArrowInvalid: Could not convert 3.0 with type str: tried to convert to double". In the absence of decent community guidance, and because this statement is solely used for display purposes, I simply converted all of these data types to "STRING". According to the documentation, Pyarrow 1.0.1 is installed on both the CPU and GPU clusters for the aforementioned Databricks Runtime.

I personally got the most value out of chapters #7 and #8. Chapter #7 covers optimizing and tuning Spark for efficiency, caching and persistence of data, Spark joins, and inspecting the Spark UI. Chapter #8 covers evolution of the Apache Spark stream processing engine, the programming model of Structured Streaming, the fundamentals of a Structured Streaming query, streaming data sources and sinks, data transformations, stateful streaming aggregations, streaming joins, arbitrary stateful computations, and performance tuning. In particular, I especially appreciated the sections on the two most common Spark join strategies (the broadcast hash join and shuffle sort merge join), the Spark UI, stateful streaming aggregations, and streaming joins. Well recommended for anyone making use of Spark.

Subscribe to Erik on Software