New Book Review: "The Hadoop Performance Myth"
New book review for The Hadoop Performance Myth: Why Best Practices Lead to Underutilized Clusters, and Which New Tools Can Help, by Courtney Webster, O'Reilly, 2016, reposted here:
Technologists have been complaining about Hadoop performance for years. This small, 15-page text explores the Hadoop performance "myth", in the sense that following current best practices actually leads to underutilized clusters, and offers two tools that meet author recommendations in the process, with one understandably offered by the firm backing this book.
The bulk of this presentation revolves around the challenge of predictable performance, and optimizing cluster performance, after briefly exploring the evolution of Hadoop with these two challenges in mind. The introduction of YARN, for example, permitted Hadoop to evolve from a MapReduce engine to an ecosystem that can run heterogeneous MapReduce and non-MapReduce applications simultaneously.
While this introduction resulted in larger clusters with more users and workloads than ever before, traditional recommendations (provisioning, isolation, and tuning) typically result in highly underutilized clusters despite increasing performance and avoiding resource contention.
The bulk of this text focuses on the traditional recommendations listed above, with an emphasis on tuning, followed by discussions on how resource managers affect performance and utilization, and the need for improved resource prediction and real-time resource allocation tools.
The need that the author sites here is based on a 2012 study of Google's cluster, which provides an example of a large cluster with heterogeneous workloads. While the references in this book curiously do not include a link to this study (unlike the other listed references), I was able to find it at this link on the Carnegie Mellon University website.
Webster's summary explains that better prediction of resource needs will eliminate over-allocation of resources, and managers must be able to dynamically adjust resources based on real-time usage (i.e. not allocation). Two tools that meet both of these recommendations are Quasar, which focuses on the former, and Pepperdata, which focuses on the latter.
Interestingly, the author comments that Quasar, a performance constrained, workload profiling cluster manager, is not open source or commercialized for immediate adoption. Apparently, some Quasar code was released in July 2015 as part of Mesos 0.23.0, but other features are expected to be reserved for the enterprise edition of Mesosphere's DCOS (Data Center Operating System).
Pepperdata, a real-time resource allocation performance optimizer, on the other hand is obviously available in the marketplace. In the discussion of this software product, the author explains that Pepperdata installs a node agent on every node in the cluster that collects over 200 metrics in 3-5 second intervals, and uses this information to dynamically reshape hardware usage.
In conclusion, the author does comment that traditional best practices can improve performance and may be enough for clusters with single workloads and consistent needs, but other more complex clusters typically exhibit resources which are 88-94% dormant, due to conservative actions of resource managers and the practice of over-provisioning. Readers are advised to check the original references to determine accuracy and applicability of this statistic.