By Erik Gfesser — Aug 25, 2020

Media Query Source: Part 27

The responses I provided to a media outlet on August 25, 2020:

Media: The current state of Hadoop. Is it dead? Has its death been greatly exaggerated? Is it just for enterprises? How significant were/are Cloudera's 2019 stock woes? Where does Hadoop fit into the future of data engineering?

Gfesser: Oftentimes, the term "Hadoop" can be a bit of a misnomer these days, as it's typically the Hadoop ecosystem that's being discussed, and the Hadoop ecosystem provides quite a few components.

While the Hadoop ecosystem has seen better days, it isn't technically dead. That said, the key reason that usage of it is on the wane is because of its limitations, and quite simply, more competing options have been rolled out the last several years because of this. One such option that has grown considerably in terms of adoption and community support is Apache Spark.

The Hadoop ecosystem was compelling, permitting the processing of large amounts of data that either wasn't previously possible, or potentially possible with great expense. However, for a variety of reasons, including the facts that its early components limited processing to batch (which was increasingly reputed to be slow), and the need for niche specialists to develop on it and administer it, usage is decreasing relative to other technologies.

Spark is a component that joined other Hadoop ecosystem components, but the community has since increasingly used it standalone with file systems initially only provided by HDFS (Hadoop Distributed File System). Spark operates on data in-memory (in contrast to MapReduce), permitting the more timely stream processing of data, as well as interactive processing by end users.

All of this said, it's also the case that the Hadoop ecosystem provides components with functionality not provided by Spark, for example, for orchestration involving discrete data processing steps. However, because the Hadoop ecosystem is commonly seen as being unnecessarily heavyweight, firms are increasingly willing to focus on Spark and look to either commercial or open source alternatives for functionality it doesn't provide. For example, many firms use Apache NiFi or Apache Airflow for orchestration.

Most firms need to process data in batch, stream, and interactive modes, and while separate, specialized components can be used for each of these, it stands to reason that since Spark offers all three, use of it can help enable decreased enterprise complexity. All technologies have tradeoffs, but for many firms the Hadoop ecosystem just doesn't offer enough compelling positive tradeoffs.

See all of my responses to media queries here.

Subscribe to Erik on Software