By Erik Gfesser — Jan 21, 2021

Community Comment: Part 6

The comments I provided in reaction to a community discussion thread:
https://www.linkedin.com/feed/update/urn:li:activity:6747585009990852609?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A6747585009990852609%2C6747641413837631488%29&replyUrn=urn%3Ali%3Acomment%3A%28activity%3A6747585009990852609%2C6748184339915292672%29

Senior Software Engineer (ex-Facebook, ex-Netflix): A lot of data engineers think coding in SQL is “beneath” them. They’re like, “I only use the Spark data frame API because that’s what all the cool engineers are doing nowadays”. The problem with that line of thinking is pure SQL pipelines require a less sophisticated skill set to maintain and often solve the data problems just as elegantly as data frames. Engineers often fail to think about what will happen with their code if they ever leave a company or change teams. Coding to the common denominator skills is how you write durable, high-longevity pipelines.

Data Architect at Institutional Investment Firm: Coding to the common denominator? That's exactly what your example Spark engineers are doing. Garbage SQL and the "stupid developer" who doesn't understand SQL is such a common and stereotypical thing that it's basically a meme in the DBA community. SQL is hardly a common denominator on most dev teams. (And I mean something beyond basic inner joins or a where clause; even an anti-semijoin is a big ask for most devs.) It's simply not new or shiny enough to warrant consideration. And for most developers, even those with a lot of experience, it can be exceptionally difficult to shift from an iterative to a declarative mindset. It's not easier; it's harder. And all of that is fine. I'd rather not have to deal with more garbage in my life. Given the above, the important question becomes, is [your chosen pipeline technology] the a most *appropriate* solution to for a given problem? Usually, I hope, the answer is yes. But if, e.g., you refuse to write a join in SQL and so your pipeline has to pull 100x more rows across the network than it actually needs to work with, now you have an issue. But please, again, there's no point of SQL for the sake of SQL by developers who can't be bothered to properly invest the time and energy.

Senior Software Engineer (ex-Facebook, ex-Netflix): That's where "what is a data engineer" really makes a difference. In my experience at FB and Netflix, data engineers generally come from a DBA background and SQL is one of their strongest proficiencies. There are other data engineers who come from a more software engineering background as well but those are the minority among data engineers. These big companies screen very thoroughly for SQL in their data engineer roles. I agree that SQL is hardly a common denominator on most software dev teams but it definitely is a common denominator on most data eng teams. Whether you write your Spark in R, Python, Java, or Scala is much less of a common denominator although I see data engineers coalescing around PySpark in a lot of big companies.

Senior Data Engineer at Large Insurance Firm: To [Data Architect at Institutional Investment Firm] point, choosing the most appropriate solution/software matters. You shouldn’t just ‘SQLize’ every data engineering problem because it’s a more common code base that people know. If my data for instance is already in a DB and I need to make some changes for the purpose of an application, I’ll lean towards a stored procedure. If however the data resides in a bunch of hive structured or Hadoop files and the application does not warrant the data in a DB, I’ll lean to Spark. Also, Spark does provide an easy extension to ‘SQLize’ RDD’s as temp tables so you can run SQL queries including joins against them as if they were in a DB. Holding developers accountable to writing clean code is an entire different beast which requires a great team, a great leader, and accountability, and should not be coupled with the coding language they are using.

Senior Software Engineer (ex-Facebook, ex-Netflix): I agree there are quite a few data eng use cases where SQL isn’t suitable (graph processing, tumbling interval evaluation, anything real-time, etc). These use cases should never be done in SQL. Generally these are a minority of pipelines. A majority of pipelines, especially analytical ones, are just JOIN and Aggregation. SQL is so good at doing that.

Data Architect at Institutional Investment Firm: What's a tumbling interval?

Senior Software Engineer (ex-Facebook, ex-Netflix): It’s a way of sessionizing events based on their continuity something that’s very difficult to express declaratively.

Gfesser: I'm not familiar with the term "tumbling interval", but based on context it sounds like you're referring to a tumbling window, in which case SQL can often be used. There's a reason the Spark project has invested so much in SQL the past few years: it actually does tend to be more of a common denominator than the languages mentioned here (and remember that Spark support isn't equal across Spark-supported languages anyway), which is why support for it has been increasingly added to frameworks and products the last several years, regardless of whether it fits the context of what developers traditionally expect. Also, be careful using the term "real-time", as it's loaded with assumptions based on audience. I've developed for both soft and hard real-time use cases (the latter for machinery), but because I've seen client interpretations of "real-time" all over the map with respect to time interval etc, I rarely use this term anymore, preferring instead to explain that it is non-scheduled and non-batch, and then subsequently iron out the details of what this actually means for the use case.

Subscribe to Erik on Software