By Erik Gfesser — Dec 19, 2023

Community Comment: Part 32 - Citizen users of SQL need multiple skills

Citizen users of SQL need multiple skills
Understand what tables & views needed
Generate correct output
Performantly generate correct output

The comments I provided in reaction to a community discussion thread.

Chief Product Officer at Analytics Product Firm:

I've met data analysts who can run circles around data engineers in terms of their ability to write SQL. Why? They don't have all the latest tools and tech to make it easier.

Eg: dbt let's you split up your queries into multiple tables in a DAG. In its absence, you get one shot, one monolithic query to rule then all. That takes mastery folks.

So data engineers, think twice before you block your analysts from creating views in the data warehouse/ lakehouse.

Gfesser:

SQL that generates correct output is different than SQL that performantly generates correct output. While I realize many know SQL to some extent from a functional standpoint, this doesn't necessarily mean they understand all of the implications of the SQL that they write. As such, I generally don't like seeing "citizen" users of a given data lakehouse creating their own tables or views in production. One of the things that "mastery" requires is understanding the tables and views that are really needed. My entry into working with data quite some time ago as a software engineer was performance tuning databases and the queries executed against these databases, for a wide variety of clientele who had big messes on their hands. It's not clear what is meant by "all the latest tools and tech to make it easier", but it's arguably best using industry standard methods such as SQL as the default rather than proprietary tooling.

LinkedIn Top Database Development Voice & Senior DBA at Large Insurer Subsidiary:

Spot on, Erik Gfesser

I've seen a lot of people that write views that think that views can be treated like tables. People don't understand that the code of a view simply becomes a part of the code that is accessing them and you can easily end up with the equivalent of "monster queries" than move like cold molasses, especially if the view create aggregates (index views not included here). Then they wonder why their view of views of views run slowly when the outer query look for aggregate columns that are not equal to zero.

Again, the key is that people don't understand that a view is a code vessel and not a table.

The same holds for cascading CTEs (cCTEs for short and not the same as recursive CTEs, which are frequently worse). They're not tables and shouldn't be treated as such (unless you know the magic of a blocking operator, which can't be used for everything). They can be as bad or worse than cascading views. Another name for a CTE is an "inline view".

Finally and to say it out loud, Set-Based <> All in 1 big query. A good data "person" (regardless of title) understands these things and the appropriate "Divide'n'Conquer" methods and proper DB modeling.

Subscribe to Erik on Software