By Erik Gfesser — Mar 24, 2012

New Book Review: "Hadoop in Action"

New book review for Hadoop in Action, by Chuck Lam, Manning Publications, 2010, reposted here:

After checking out reviews of what O'Reilly and Apress had to offer with regard to Hadoop, I ended up purchasing this book based on positive reviews, my past positive experiences with the Manning "In Action" series of texts in general, such as "Spring in Action" and "Java Persistence with Hibernate", formerly "Hibernate in Action" (see my reviews), and the fact that this book was the most recently published on the subject. In short, this text is well organized, and covers its focus on Hadoop well, but potential readers should be aware that about one-third of what Lam has to offer here are ancillary to Hadoop, and not with regard to Hadoop itself. Inclusion of the larger ecosystem within which Hadoop sits personally makes sense, and I do not think this aspect of the book detracts from what the author provides in any way.

The author provides a good introduction to Hadoop in the first three chapters, which includes a discussion on differences between Hadoop and traditional technologies in this space, such as relational databases, as well as a tour of Hadoop building blocks, working with files in the Hadoop Distributed File System (HDFS), and the anatomy of a MapReduce program. The next three chapters contain the bulk of the text, which focuses on writing MapReduce programs, and includes segments on chaining MapReduce jobs, joining data from different sources, creating a Bloom filter, and monitoring, debugging, and tuning.

The next two chapters offer a short cookbook in which the author presents 5 different general MapReduce techniques (Lam admits that specialized MapReduce techniques can be found rather easily by Googling, and that he does not intend this cookbook to be comprehensive in any way), as well as a chapter on managing Hadoop, followed by four chapters on running Hadoop in the cloud, brief introductions on programming with Pig (a Hadoop extension that provides a language called Pig Latin) and using Hive (a package built on top of Hadoop that provides a SQL-like language called HiveQL). and a chapter that discusses four Hadoop case studies from the New York Times, China Mobile, StumbleUpon, and IBM (the case study from IBM takes up about 50% of the discussion, and the case study from the New York times is less than a page).

Be aware that at the time of this review, this book was published over a year ago. One of the common complaints I read about what O'Reilly and Apress have to offer in this space is that their counterparts to this book cover older versions of Hadoop. In chapter 4, Lam mentions that "one of the main design goals driving toward Hadoop's major 1.0 release is a stable and extensible MapReduce API. As of this writing, version 0.20 is the latest release and is considered a bridge between the older API (that we use throughout this book) and this upcoming stable API. The 0.20 release supports the future API while maintaining backward-compatibility with the old one by marking it deprecated."

"Future releases after 0.20 will stop supporting the older API. As of this writing, we don't recommend jumping into the new API yet for a couple reasons: (1) Many of Hadoop's own library classes in 0.20 aren't written under the new API yet. You won't be able to use those classes if your MapReduce code uses the new API in 0.20. (2) Many still consider the most production-ready and stable version of Hadoop as of this writing to be 0.18.3. Some users are warming up to version 0.20, but we suggest you wait a little longer before going full production with it." The author follows up by writing that "by the time you read this the situation may be different. In this section we cover the changes the new API presents. Fortunately, almost all the changes affect only the basic MapReduce template. We rewrite the template under the new API to enable you to use it in the future."

Exactly two weeks ago today, Hadoop 1.0.1 was released after 6 years of development. Inbetween the version that this book covers, and this most recent version, several intermediary versions were released, which provide bug fixes, improvements, optimizations, and new features, as well as support for some of the offerings in the Hadoop ecosystem. More timely information on open source technologies that enjoy wide community support is always going to be more readily available on the internet, especially via blog posts, but in my opinion this fact does not detract from the value of this text, which still serves as a good introduction to the Hadoop ecosystem, especially for those more comfortable starting out with a published text. Just be aware that you will be quickly referring to other materials after you make your way through this text.

The portions that I especially appreciated about what Lam has to offer include his presentations in chapter 5 on reduce-side joining and creating a Bloom filter, the cookbook that he provides in chapter 7 that includes segments on passing job-specific parameters to tasks, probing for task-specific information, partitioning into multiple output files, inputting from and outputting to a database, and keeping all output in sorted order, as well as chapters 9, 10, 11, which discuss the larger Hadoop ecosystem, especially the introduction to Pig Latin. Recommended to anyone looking for an introduction to the Hapoop ecosystem of technologies who understands that published texts such as this one cannot contain information about the latest releases.

Subscribe to Erik on Software