By Erik Gfesser — Dec 15, 2018

New Book Review: "Practical Hive"

New book review for Practical Hive: A Guide to Hadoop's Data Warehouse System, by Scott Shaw, Andreas Francois Vermeulen, Ankur Gupta, and David Kjerrumgaard, Apress, 2016:

Shaw comments about the moment he experienced when he first copied a file into HDFS and created a Hive table on top of the file. "I was blown away by the simplicity of the solution yet by the far-reaching impact it would have on data analytics. Since that first simple beginning, I have seen data projects using Hive go from design to real analytic value built in weeks, which would take months with traditional approaches. Hive and the greater Hadoop ecosystem is truly a game-changer for data driven companies and for companies who need answers to critical business questions. The purpose of this book is the hope that it will provide to you the same 'ah-ha' moment I experienced. The purpose is to give you the foundation to explore and experience what Hive and Hadoop have to offer and to help you begin your journey into the technology that will drive innovation for the next decade or more. To survive in the technology field, you must constantly reinvent yourself. Technology is constantly traveling forward. Right now there is a train departing; welcome aboard."

As a long-time technologist, I was certainly aware of Hive, but it wasn't clear to me what the component had to offer as it can be challenging to find satisfying explanations on the web. The liberal use of the term "Hive tables" by bloggers was especially confusing, not to mention the incomplete explanations that I was hearing from other technologists. Experiencing my own "ah-ha" moment, this book cleared everything up for me, and I am now ready to tackle Hive and Spark SQL on an upcoming project. While this book specifically addresses Hive, the authors explain that other Hive-compatible components such as Spark SQL exist, and additionally explain what provides this compatibility. And because Hive was built to be used with HDFS, why the HDFS-compatibility of other available file systems increasingly being made available extend the applicability of Hive. This book provided me with a springboard that has enabled me to delve into the documentation with confidence, as well as directly address some of the misinformation I was hearing from technologists in the workplace.

This book is broken down into 11 chapters and 2 appendices. Chapters 1-3 ("Setting the Stage for Hive: Hadoop", "Introducing Hive", and "Hive Architecture") provided the needed background for the rest of the book. Chapters 4-7 ("Hive Tables DDL", "Data Manipulation Language DML", "Loading Data Into Hive", and "Querying Semi-Structured Data") walk through common Hive operations. Chapter 8 ("Hive Analytics") covers Hive data warehousing with a whopping one-third of the book's content. Chapters 9-10 ("Performance Tuning: Hive" and "Hive Security") cover security and an introduction to tuning, and Chapter 11 ("The Future of Hive") covers in the author's words "the near future of Hive", serving as a reminder to readers that more than 2 years have already gone by since this book was written. In my research to find Hive books that are still relevant, this book along with "Hive Essentials (Second Edition)" published very recently in June 2018 were the only candidates I had added to my reading list, and I chose this book because it is double in size and was made freely available on the web.

I personally found the first 3 chapters especially relevant for what I was trying to get from a book. The authors provide a very gentle introduction to those not familiar with Hadoop, and in less than 50 pages cover a sizable number of topics in preparation for the remainder of the book's content. While some readers who are not architects may not be interested in some of the technical detail, the authors also state the case for Hive along the way, emphasizing why Hive is such a game-changer for working with HDFS. And readers can rest assured that the information provided is accurate, as he was working at Hortonworks at the time and had Hive committers at his disposal, while also keeping in mind one of the comments made by Shaw at the outset: "The struggle with writing a book on Hive is if you wait six months between writing then you’re writing a new book." I especially appreciated the author's straightforward explanations of MapReduce, Yarn, and Hadoop cluster architecture in preparation for his Hive architecture walkthrough, the diagrams of which prompted me to explore and share related documentation with colleagues.

The bulk of chapters 4 and 5 were not as relevant for me, as I've worked with databases for a long time, but as this content specifically addressed DDL and DML specifically for Hive and not databases in general, I could not simply skip these chapters. The "Tables" section were especially pertinent, with its explanations of external and internal (or managed) tables, partitioning and bucketing, and file formats. As with all examples in these chapters and the remainder of the book, explanations provide all input and output. While my reading through this book was done in a whirlwind of less than 3 days, and so I did not have time to actually execute all examples as I have typically done with other texts, since all input and output is complete I felt like I did not need to actually execute commands, and because I realize that some portion is likely outdated and I would need a bit more time to work through everything than permitted by my crash course.

As such, readers should take note of some software version related comments made in chapter 2, beginning with the statement that "the time it takes this book to be published and reach your hands, the versions will have already changed" in reference to Apache project versions listed in a table for three of the most recognized Hadoop distributions (Cloudera CDH, MapR, and Hortonworks) at the time, keeping in mind the recently announced planned merger of Cloudera and Hortonworks. A few pages later, the authors remind the reader that "despite all the features and functionalities packed into a Hadoop distribution, our focus is on Hive", and follow up on this statement by writing that "as of the writing of this book, Hive 2.0.1 is the latest Apache version. Although the latest Apache version is 2.0.1 we will work exclusively with version 1.2.1 of Hive throughout subsequent chapters because it is the latest version tested and offered in a distribution. If you happen to already be using CDH 5.7, which uses Hive 1.1 with patches, most of the functionality should still work. Functionality involving the Tez engine will not be available because Cloudera does not support Tez as a SQL engine." Taking a look at the documentation, Hive 2.0.1 was released in May 2016, and Hive 1.2.1 was released in June 2015.

While the "Loading Data into HDFS" and "Accessing the Data in Hive" sections of chapter 6 were beneficial, the "Design Considerations Before Loading Data" section was especially helpful for me, and it tied in nicely with the "High Performance Checklist", "Execution Engines", "Storage Formats", and "Query Execution Plan" performance tuning sections of chapter 9, as I have designed and tuned so many databases in the past and these sections helped complete the full picture. And as much as chapter 7 provides some decent explanations on creating, loading, and querying semi-structured data such as JSON, I personally had little interest here apart from its closing discussion on reading and writing data using a SerDe (serializer/deserializer). Additionally, as well put together as chapter 8 might be with its step-by-step walkthrough of building a data warehouse (although I haven't been sold on use of the "data vault" technique), since I have data warehousing experience much of this information was redundant for me. One exception here is the coverage of a technique used by some business analysts called "sun models", which are graphical representations of business query requirements.

Approaching 2019, chapter 11 on the future of Hive is outdated, but chapter 10 on Hive security is still relevant with its coverage of administration, authentication, authorization, auditing, and data protection. Much of this chapter consists of a screen-by-screen walkthrough of Hive authorization using Apache Ranger, but the discussion begins with brief walkthroughs of configuration file storage-based and SQL standards-based authorization, as well as managing access by making use of SQL. Well recommended text for anyone new to Hive or Hive-compatible components such as Spark SQL (which is still not 100% compatible as of yet with Hive). This book provides an excellent introduction to prepare readers for hands-on Hive development and the subsequent, inevitable delving into Hive documentation that will be needed as a result.

Subscribe to Erik on Software