By Erik Gfesser — Feb 26, 2024

Community Comment: Part 34 - Popular data engines use Java / the JVM because it ruled the enterprise for many years

Reasoning behind popular data engines & Java
Java / the JVM long ruled the enterprise
Spark, Flink, Presto & Trino are all Java-based
Remember: Presto & Trino share some history

The comments I provided in reaction to a community discussion thread.

Head of Product at Data Lakehouse Product Firm:

🤷‍♂️ GitHub PRs opened per month for some of the top open source data lake compute engines. The first trend I randomly stumbled on was #ApacheFlink and it surprised me so I looked across a few more projects. Opened PRs as a metric alone is far from a comprehensive view of a projects true success and community growth. I'm personally very bullish on Flink and I've seen a rise of success stories of organizations leveraging Flink alongside some of these other engines especially when paired with a low-latency lakehouse table format like Apache Hudi.

What are your thoughts on these trends?

Lead Software Engineer at German Security as a Service Product Firm:

why all java? (except ray)

Head of Product at Data Lakehouse Product Firm:

Lead Software Engineer at German Security as a Service Product Firm:

[Head of Product at Data Lakehouse Product Firm] nice meme, but seriously why?

Head of Product at Data Lakehouse Product Firm:

Why each of these projects chose to build on Java good question I'm not sure. But if your question is why I choose to look at these particular projects, I just grabbed some of the ones top of mind that I see most used around data lakes.

Gfesser:

Because the Java language has the JVM, and historically speaking, Java ruled the enterprise for many years. Ray is very new relative to all of the others listed here, with its APIs taking advantage of the subsequent heightened popularity of Python. It's also worth pointing out that Presto and Trino share some history, so listing both of these is a bit redundant. All of this said, while it's clear that Spark shows the most consistency over time with respect to PRs, demonstrating stability, I share the sentiment of some of the other comments here in the sense that what arguably matters more are the changes actually being made to the code base. I always consider activity as one of the criteria to take into account for adoption, but activity shouldn't be considered in isolation (as is the case when determining the productivity of engineers).

Subscribe to Erik on Software