By Erik Gfesser in media query — Dec 6, 2019

Media Query Source: Part 22 - TechTarget (US digital magazine); AWS data lake & data warehouse options for the cloud

TechTarget (US digital magazine)
AWS data lake & data warehouse options
Lake Formation, Glue & Redshift (Spectrum)
Use cases should drive product adoption

My responses ended up being included in an article at TechTarget (November 14, 2019). Extent of verbatim quote highlighted in orange, paraphrased quote highlighted in gray. Above image from cited article.

The responses I provided to a media outlet on October 18, 2019:

Media: What are the best ways to handle your data on AWS in terms of building and deploying data lakes and data warehouses with an emphasis on AWS services involved?

Media: How do the managed services Lake Formation and Redshift work and compare?

Gfesser: AWS Lake Formation and Amazon Redshift aren't directly comparable. Redshift can be integrated with Lake Formation, but these two services cannot in themselves be swapped with one another.

Lake Formation is a relatively new AWS service that became generally available (GA) in August 2019. In addition to coordinating with other existing services such as Redshift, it also provides conveniences not available previously with respect to setting up a secure data lake using Amazon S3. For example, Lake Formation "blueprints" that provide streamlined, out of the box patterns for common use cases, such as bulk and incremental data extraction from data sources.

Lake Formation builds on the capabilities of AWS Glue, which became generally available 2 years prior in August 2017. Glue consists of 4 components: Glue Data Catalog, Glue Crawlers, Glue Jobs, and Glue Workflows. These components work with one another to catalog and process inbound data coming from source systems. Data Catalog is a Hive metastore compatible catalog that stores metadata about data, and can be used across multiple AWS and non-AWS services, and Crawlers can be used to intelligently gather this metadata in an automated manner as data is ingested. Recently released Workflows can be used to stitch together Lake Formation compatible services, and Glue Jobs can be created in two variants to process and load data: Python Shell Jobs, and Spark Jobs. The former can be used for generic tasks as part of a Workflow, whereas the latter makes use of a serverless Apache Spark environment which will be familiar to many development shops.

Redshift is a relational database service intended as a data warehouse for analytical workloads involving petabyte volumes of data and read intensive queries for reporting and visualizations against this data. As such, Lake Formation can be used to load data to Redshift for these purposes. A separate AWS service called Redshift Spectrum can be used in a similar manner as Amazon Athena to query data in an S3 data lake, while also permitting the joining of this data with data stored in Redshift tables.

Media: When should an organization rely on a data lake vs. a data warehouse?

Gfesser: It depends on organization use cases. Use of data warehouses such as Redshift is a more traditional approach. Redshift provides a familiar environment for many development shops, provides out of the box compatibility with many commercial products, and enables more complex queries. But Redshift generally requires more administration, may require downtime to scale as volume and query activity increases, and works with fewer data formats than Athena. Additionally, increased performance using Redshift might not be seen until large data volumes materialize, and even then query speed may disappoint when compared to Athena. Even though Redshift Spectrum functionality overlaps with Athena, the cost is relatively higher than Athena, because use of Redshift Spectrum requires a Redshift cluster, whereas the cost of Athena is solely based on query volume.

I recently built an AWS data lake with my team from scratch for a client, and the way I explained the differences here is that use of Redshift requires the loading of data, and Athena does not, since it works with data already stored in S3. While Redshift Spectrum can do the same thing, it is not as fully featured and likely will not make sense to use unless data already exists in Redshift tables, as Redshift is an older AWS service of which there are preexisting customers. When it comes down to it, use of a data lake and a data warehouse are not mutually exclusive, as many organizations will likely find benefit from using both.

Media: How can these two types of data repositories work together?

Gfesser: See comments under bullet #2 above.

Subscribe to Erik on Software