New Book Review: "Spark: The Definitive Guide"
New book review for Spark: The Definitive Guide, by Bill Chambers and Matel Zaharla, O'Reilly 2018, reposted here:
Copy provided by O'Reilly.
Over the past few months, I've had the chance to work through three related books in the following order: "Spark: The Definitive Guide" (2018), "High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark" (2017), and "Practical Hive: A Guide to Hadoop's Data Warehouse System" (2016). If you are new to Apache Spark, these three texts will help enable your going in the right direction, although keep in mind that the related tech stack is evolving and you will obviously need to supplement this material with web documentation and developer forums, as well as to get hands-on with the tooling. Reading these books in opposite order of publication date enabled exposure to more current material sooner rather than later, but this was largely just a coincidence. The fact of the matter is that while I have high regard for this particular book, the others helped fill some of the gaps for me.
What distinguishes this Spark book from the others still available out there, other than its recent publication date (which is not a trivial matter), is that notebooks are available for much of the content, allowing readers to follow along while issuing Spark requests. The only drawback for me is that the accompanying notebooks provided in GitHub do not contain any comments whatsoever. My mistake was to read through the material in the book first and then work through the notebooks, leading me to discover that since the notebooks provide no context, I ended up revisiting pertinent content in the text. Sure, this may have helped solidify some of the material for me, but I wouldn't want anyone to make the same mistake. If you decide to work through the material with an electronic copy of the text, I suggest you copy and paste surrounding discussions of code into your notebooks for future reference so that it is all in one place.
Apart from this matter, readers should be aware of at least two additional things. The first is that all of the code presented in the book and provided in the accompanying notebooks is Python, Scala, and Spark SQL. While there are good reasons for the authors to have done so, readers should be aware that other languages such as R are simply not covered. The fact of the matter is that Python has made a lot of strides over the last few years, and many developers will simply prefer to use Python. Starting in 2012, I used R for a few years because it had the momentum, but in the meantime Python has taken hold and arguably risen to dominate, especially due to its greater ease of use, status as a general purpose programming language, and increasingly available relevant frameworks. As far as Scala is concerned, well, Spark is primarily written in Scala and is Spark's "default" language. And Spark SQL is being targeted at developers familiar with using SQL for other products, and many performance improvements are being made with these individuals in mind.
This hefty 550-page text is broken down into 33 chapters across 7 parts: (1) "Gentle Overview of Big Data and Spark", (2) "Structured APIs – DataFrames, SQL, and Datasets", (3) "Low-Level APIs", (4) "Production Applications", (5) "Streaming", (6) "Advanced Analytics and Machine Learning", and (7) "Ecosystem". The part providing the most content in terms of number of pages is the second, followed closely by the sixth, which together comprise about 50% of the material. The authors have done a good job to distribute coverage of each area of discussion relatively evenly across the 33 chapters, albeit with chapter 6 ("Working with Different Types of Data") and chapter 25 ("Preprocessing and Feature Engineering") the most lengthy, and several chapters such as the opening and closing chapters and a couple chapters in part 5 the least lengthy, each consisting of 10 pages or less. While this book provides the best overall Spark coverage of any currently available text, one area of coverage that is bit lackluster is performance tuning: while chapter 19 ("Performance Tuning") provides an initial 15-page look, you will need to go elsewhere for anything meaningful, such as the aforementioned book entitled "High Performance Spark".
Beginning with the tail end of chapter 1 ("What is Apache Spark"), where the authors indicate that "the majority of this book was written using Spark 2.2", discussions are sprinkled with reminders that Spark 2.2.0 is the referenced version, and Spark 2.3.0 is periodically mentioned. Readers should also keep in mind that Spark 2.2.0 was released in July 2017, Spark 2.3.0 was released in February 2018, and as of November 2018 the current release is 2.4.0. The authors recommend either downloading Apache Spark or making use of Databricks Community Edition (a free version of the Databricks cloud service) to execute the GitHub examples. As someone who just recently implemented Databricks for a client, readers should rest assured that the latest version of Spark makes its way into the product quickly. For example, I found that Spark 2.4.0 was incorporated into Databricks 5.0 only about a week after its release.
Since Spark has so many capabilities, and this book attempts to address many of them, writing a review that gives the book justice is challenging. And for full disclosure, I chose not to work through the chapters in part 6 ("Advanced Analytics and Machine Learning") at this point in time, because I already have a fair amount of machine learning training under my belt, and while my recent Databricks project involved execution of machine learning models, the data science team with which I worked chose to make use of the R language which is not covered by this text. If and when I decide to revisit these chapters, I will loop back with this review and amend it as necessary to include discussion of these chapters. As for now, rest assured that until either the second edition of this text or a competitor text is published, you are likely to be served well by what the authors provide here, as it originates from Databricks, the creators of Spark.