Community Comment: Part 29 - Usage of normalized versus denormalized data storage is dependent on usage patterns

  • Normalized versus denormalized data
  • Data storage dependent on usage patterns
  • Denormalized data makes reads efficient
  • Normalized data makes writes efficient

The comments I provided in reaction to a community discussion thread.

CEO at Business Intelligence Product Firm:

After 100s of conversations with analytics engineers, I'm convinced the way we build data pipelines needs to change.

We need a simpler, lower-effort way to build our data models. That's why I'm suggesting NERD modeling – Narrow, Entity-focused, Relational, and DRY:

The four pillars of NERD modeling:

Narrow:
Data teams are finally realizing they have too many models.

Recent…pricing discussions have been a trigger point for people to trim down. But there's other reasons: larger data pipelines have more places to break, are less maintainable, and are harder to understand.

Keep data pipelines small by keeping them short. Minimize the number of models between the raw ingested data and the BI tool. If you have 5 steps to prepare and stage, you're probably doing it wrong.

Entity-focused:
Maxime Beauchemin recently introduced ECM (link below).

The Entity-Centric idea is brilliant. It means aggregating fact features up into the dimension tables. For eg. in your users table, you'd pre-calculate features such as 'logins_last_90_days.'

It's brilliant because:
(a) thinking about features by entity type is an intuitive way to consume data ('total five-star reviews by product'),
(b) this gives you a huge surface area to discover new insights. Every analytics project has inputs and outputs. What features drive which outcomes? Adding in entity features gives you a ton of new inputs to use.

You may have noticed this looks like Feature Engineering for machine learning. That's true. And since it works great for ML models, it's also very helpful for exploring data as a human.

But now – one area where NERD differs: it is…

Relational:
Lately everyone has been ditching facts/dims in favor of One Big Table (OBT). This makes sense if you haven't set up a semantic layer yet.

But as more people switch to semantic layers, it makes more sense to leave the final 'Consumption' tables as relational facts and dimensions. Then, your semantic layer prepares more manageable tables dynamically with the metrics you need.

There's several reasons to keep your consumption-level data in facts and dimensions. It keeps your data models small, narrow (see N), and composable. It's intuitively understandable (especially when the fact tables are all Entities – see E).

And perhaps most imporantly – relational data models are great because they're…

DRY!
Don't Repeat Yourself.

You never have just 1 One Big Table. You have dozens of them, with partially overlapping metrics. This is a recipe for chaos.

If your end users need to know churn, they don't know if they should use the users_behavior_new OBT, or the retention_calculation_v3 table. If you need to change your definition of churn, you'll do it repeatedly, in many poorly-referenced places.

By basing metrics directly off of the same fact table, you do it exactly once.

That's the NERD modeling philosophy. If people want a playbook on how to do it, lmk in comments and I'll write a blog post.

Gfesser:


The manner in which data is stored should be dependent on planned usage of that data. The concept of "one big table" is an old concept: it's called "denormalized" data. And in order to understand denormalized data, it helps to understand normalized data. I'm not sure I see these concepts explained here. When data is normalized, it essentially means there are no relationships within a given table beyond the one that exists between the key for that table and the rest of the columns. This makes writes efficient. However, when data is denormalized it essentially means that data that would otherwise be normalized across multiple tables is stored in one table, meaning that, yes, there are additional relationships in that combined table. This makes reads efficient. Why? Because data is pre-joined: read queries don't need to waste time joining tables together at the time of consumption: the work to do this has already been done beforehand. And this is why usage patterns matter. The data in this denormalized, one big table typically can't serve all needs: the manner in which this data has been combined has already taken into account what queries will be needed to run against it.

CEO at Business Intelligence Product Firm:

Erik Gfesser Thanks Erik and fun fact: I’d actually
considered using Normalized for the N because I agree it’s valuable.

My philosophy on this is that strict 3NF usually isn’t necessary and you can store low-cardinality items directly in a dimensional table. But normalized is definitely my preferrence.

Yep, you’ll absolutely need to do more dynamic joins with it (and some huge tables you may need to denormalize – eg events). But most of those joins are small and it unlocks a lot more flexibility/detail in the analytics.

And normalized is DRYer!

Gfesser:


One of the key takeaways for everyone reading this post: a "relational" table doesn't imply third normal form, because relational tables can exhibit any of several degrees of normalization. And yes, from another perspective one could instead choose to say several degrees of denormalization. All that "relational" means is that relationships exist: it doesn't say anything about the nature of these relationships.

Subscribe to Erik on Software

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe