By Erik Gfesser — Dec 22, 2018

New Book Review: "High Performance Spark"

New book review for High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark, by Holden Karau and Rachel Warren, O'Reilly, 2017, reposted here:

Stars-4-0._V192240704_

The authors state in their preface that "this book is intended for those who have some working knowledge of Spark, and may be difficult to understand for those with little or no experience with Spark or distributed computing", that they "expect this text will be most useful to those who care about optimizing repeated queries in production, rather than to those who are doing primarily exploratory work", and that they "want to help our readers ask questions such as 'How is my data distributed?', 'Is it skewed?', 'What is the range of values in a column?', and 'How do we expect a given value to group?' and then apply the answers to those questions to the logic of their Spark queries."

This book is the second of three related books that I've had the chance to work through over the past few months, in the following order: "Spark: The Definitive Guide" (2018), "High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark" (2017), and "Practical Hive: A Guide to Hadoop's Data Warehouse System" (2016). If you are new to Apache Spark, these three texts will help enable your going in the right direction, although keep in mind that the related tech stack is evolving and you will obviously need to supplement this material with web documentation and developer forums, as well as to get hands-on with the tooling. Reading these books in opposite order of publication date enabled exposure to more current material sooner rather than later, but this was largely just a coincidence.

Keep in mind one of the initial assertions of the authors that "this book was created using the Spark 2.0.1 APIs, but much of the code will work in earlier versions of Spark as well. In places where this is not the case we have attempted to call that out". As Spark 2.4.0 was just released in November 2018, you will find that some of the material provided here is either outdated or seen to be commonplace with the newest aforementioned Spark text. The unfortunate dilemma is that a book specifically focusing on Spark performance simply isn't available outside of what the authors provide here, so you will need to account for differences across versions, especially in the several instances where the authors provide workarounds that they warn are likely not to provide long term viability.

Unlike "Spark: The Definitive Guide", which provides Python, Scala, and Spark SQL code, readers should be aware that the bulk of code provided in this book is Scala, "simply in the interest of time and space", because "it is the belief of the authors that 'serious' performant Spark development is most easily achieved in Scala", and while "these reasons are very specific to using Spark with Scala, there are many more general arguments for (and against) Scala's applications in other contexts." As the authors further state their case, they provide tips for learning Scala alongside additional arguments for picking up the language: "to be a Spark expert you have to learn a little Scala anyway", "the Spark Scala API is easier to use than the Java API", and "Scala is more performant than Python."

This densely written book of slightly over 300 pages in length is broken down into 10 chapters and an appendix: (1) "Introduction to High Performance Spark", (2) "How Spark Works", (3) "DataFrames, Datasets, and Spark SQL", (4) "Joins (SQL and Core)", (5) "Effective Transformations", (6) "Working with Key/Value Data", (7) "Going Beyond Scala", (8) "Testing and Validation", (9) "Spark MLlib and ML", (10) "Spark Components and Packages", and an appendix on "Tuning, Debugging, and Other Things Developers Like to Pretend Don't Exist". While the chapters aren't provided in the context of broader sections, chapters 1 and 2 are essentially an introduction, and chapters 3, 4, 5, 6, and 8 provide the bulk of the content (chapter 8 should likely join these other 4 chapters, as testing and validation are a likely follow-up to much of what is discussed. As far as the remaining 3 chapters are concerned, while chapter 7 would likely provide value as a last chapter, chapters 9 and 10 seem a bit misplaced, with chapter 10 seemingly better suited for an appendix alongside the one appendix provided.

The diagrams in chapters 3 through 6 are especially well done, and supplement the discussions very well. While the diagrams in chapters 1 and 2 are beneficial, these can be largely found in the documentation (perhaps with the exception of the diagrams provided in the section entitled "The Anatomy of a Spark Job"). For example, the diagram in chapter 3 on Spark SQL windowing (which personally helped supplement the cursory explanation in "Spark: The Definitive Guide"), the diagrams in chapter 4 on joins, the diagrams in chapter 5 on narrow versus wide dependencies between partitions and caching versus checkpointing, and the diagrams in chapter 6 on GroupByKey (although I found one of several errors here) and SortByKey.

The appendix is beneficial to the point that it could likely have been expanded and included in the body of the text, possibly following the introductory chapters, because the discussion here is all about what one can do outside one's application code (what the bulk of this book is essentially about). Topics covered here are broken down into sections on "Spark Tuning and Cluster Sizing", "Basic Spark Core Settings: How Many Resources to Allocate to the Spark Application?", "Serialization Options", and "Some Additional Debugging Techniques". Highly recommended text for anyone looking to broaden their understanding of the hows and whys behind optimizing Spark.

Subscribe to Erik on Software