By Erik Gfesser — May 31, 2016

New Book Review: "Architecting Data Lakes"

New book review for Architecting Data Lakes: Data Management Architectures for Advanced Business Use Cases, by Alice LaPlante and Ben Sharma, O'Reilly, 2016, by Jonas Bonér, O'Reilly, 2016, reposted here:

This freely available book from O'Reilly is only the third available on the topic of data lakes, and all three have been published within the last six months (an additional text is scheduled to be published a few months from now). The main complaints with the first two books appear to fall in line with gripes frequently communicated by technologists about other books around the impracticality of the presented material. While my thought is that there is a time and place for hands-on, step-by-step instruction, one big problem with too literal of an approach that leans in this direction is that technologies come and go, and there are many such as myself who do not wish to become tool jockeys in the sense of learning how to do something with a tool, but not understanding the concepts behind what they are doing (not to mention the fact that many have grown risk averse to becoming locked into using a particular commercial vendor tool to do something).

The authors present the problem space very well in their introductory chapter: "Organizations report success with these early endeavors in mainstream Hadoop deployments ranging from retail, healthcare, and financial services use cases. But currently Hadoop is primarily used as a tactical rather than a strategic tool, supplementing as opposed to replacing the EDW. That's because organizations question whether Hadoop can meet their enterprise service-level agreements (SLAs) for availability, scalability, performance, and security. One major challenge with traditional EDWs is their schema-on-write architecture, the foundation for the underlying extract, transform, and load (ETL) process required to get data into the EDW. With schema-on-write, enterprises must design the data model and articulate the analytic frameworks before loading any data. In other words, they need to know ahead of time how they plan to use that data. This is very limiting."

"In response, organizations are taking a middle ground. They are starting to extract and place data into a Hadoop-based repository without first transforming the data the way they would for a traditional EDW. After all, one of the chief advantages of Hadoop is that organizations can dip into the database for analysis as needed. All frameworks are created in an ad hoc manner, with little or no prep work required. Driven both by the enormous data volumes as well as cost – Hadoop can be 10 to 100 times less expensive to deploy than traditional data warehouse technologies – enterprises are starting to defer labor-intensive processes of cleaning up data and developing schema until they've identified a clear business need. In short, they are turning to data lakes. A data lake is a central location in which to store all your data, regardless of its source or format. It is typically, although not always, built using Hadoop."

After walking through the main benefits of a data lake, the authors explain the main differences between EDWs and data lakes, drawbacks of the traditional EDW, key attributes of a data lake, the business case for data lakes, and four options to implement data management and governance in the data lake: (1) address the challenge later, (2) adapt existing legacy tools, (3) write custom scripts, (4) or deploy a data lake management platform. This fourth option is the focus of this book. After defining data lakes and how they work, the authors provide a data lake reference architecture designed by Zaloni to represent best practices in building a data lake, followed by discussions on the challenges that companies face building and managing data lakes. Ben Sharma is CEO and co-founder of Zaloni, which makes the Bedrock data lake management platform, so keep this in mind as you read through the presentation.

The data lake "architecture" diagram which starts the second chapter does not rub me the right way, mainly because it reminds me of enterprise architecture "frameworks" presented by individuals on past projects of mine which did not explain how the practical aspects of presented functionality would be built, only to describe that execution would "obviously" be taken care of by other architects on the team. However, the authors save face by later providing example technology stacks for each of the covered areas in this reference architecture. For example, the four functions of a data lake are identified as ingestion, data storage, data processing, and data visualization and APIs, and the technology stack presented for ingestion is comprised of Apache Flume, Apache Kafka, Apache Storm, Apache Sqoop, and NFS Gateway, and descriptions are provided for all of these products.

The last chapter entitled "Looking Ahead" summarizes some emerging trends well, including cloud-to-ground environments, logical data lakes, federated queries, and data discovery portals, but these discussions are very limited and will require looking elsewhere for more detailed discussions on these areas of development. However, likely the most immediately practical section in this chapter is the high-level checklist for success provided by the authors in the concluding pages of this book, which includes the following: (1) a business benefit priority list, (2) architectural oversight, (3) security strategy, (4) I/O and memory model, (5) workforce skillset evaluation, (6) operations plan, (7) communications plan, (8) disaster recovery plan, and (9) five-year vision. Although only about a paragraph of explanation is provided for each of these items, I did earmark the last item on this checklist as a personal reminder that planning is needed while keeping in mind that the data lake will continue to evolve just like any other area of technology.

Subscribe to Erik on Software