By Erik Gfesser — Aug 30, 2019

Media Query Source: Part 18

The responses I provided to a media outlet on August 30, 2019:

Media: Hadoop hasn't lived up to its early promise, and the fire sale of MapR to HPE earlier this month is the latest sign of that. What's next, then, for big data? Can Hadoop still find a place, or have other technologies like Spark and Kafka taken over?

Gfesser: There have actually been additionally similar, recent signs of Hadoop's demise: the finalization of the Cloudera and Hortonworks (both Hadoop distribution competitors of MapR) merger in January 2019, and just a few short months later in May 2019, the unexpected decision by Cloudera (the entity resulting from this merger) to henceforth follow a pure open source model.

With respect to "big data", the question for quite some time now has been what this term is intended to mean, but in many cases what this meant, albeit initially with capitalized letters ("Big Data"), was the framework used to process this data: Hadoop MapReduce. Over time, however, the Hadoop ecosystem continued to expand, with migrations away from MapReduce to other frameworks built on top of Hadoop.

As the Hadoop ecosystem evolved, however, some frameworks increasingly became used in standalone deployments apart from Hadoop, with one of the most common of these frameworks being Spark and it's commercialized open source counterpart, Databricks, now available in Azure as a managed service, and in AWS via the marketplace.

Even with managed Hadoop services such as Amazon EMR which help alleviate some of the complexity of Hadoop implementation and management, if one is largely using Spark it may not make sense to be encumbered by the larger Hadoop ecosystem. And depending on what one needs to actually do, it may no longer make sense to use either Hadoop or standalone Spark.

The question raised for this media query mentions Kafka alongside Spark as a technology potentially taking over, but depending on one's requirements it might make sense to use one or the other, or a combination of these two. For example, for some use cases it makes sense to use Kafka to stream data to Spark for processing.

And commercial alternatives now exist for both of these technologies, such as Amazon Kinesis for the former and the aforementioned Databricks product and AWS Glue for the latter, albeit Kinesis does not make use of Kafka like Databricks uses Spark (which is why AWS now offers Amazon Managed Streaming for Apache Kafka).

For greenfield technology projects not encumbered by legacy constraints related to existing installations or staff skillsets, Hadoop has largely fallen by the wayside. While it's not going away, unlike several years ago there are now viable Hadoop alternatives for many use cases.

See all of my responses to media queries here.

Subscribe to Erik on Software