By Erik Gfesser — Mar 22, 2023

Community Comment: Part 26 - With respect to data normalization, performance is key

Normalized versus denormalized data
Discussion with purported training firm CEO
Performance is important consideration
More factors involved than just record count

The comments I provided in reaction to a community discussion thread.

Founder & CEO of "Big Data" Training Firm:

Normalization vs Denormalization

Normalization is a process of dividing the data into multiple smaller tables with an intent to reduce data redundancy & inconsistency.

However, Denormalization is totally opposite of above idea.
Denormalization is the technique of combining the data into a big single table. This definitely leads to redundancy in the data.

Note: Redundancy causes inconsistency –
Consider that same data is repeated 2 times and when updating you update at one place and forget to do at second place. This leads to inconsistent state.

When retrieving data from Normalized tables we need to read many tables and perform join which is a costly operation.
However, when reading the data from denormalized tables it's quite fast as no joins are required.

When to use Normalized tables vs Denormalized ones?
when we talk about OLTP systems (Online Transaction Processing) where we deal with lot of insert, delete and updates then you should go for Normalized tables.

However, when you talk about OLAP (Online Analytical Processing) systems where you need to analyse historical data then Denormalized tables are best fit. Since you wont be doing updates on data here, even though after having redundancy we wont end up in inconsistent state.

Let's take a simple Example – when you make purchase on amazon then it requires a OLTP system (a rdbms kind of database). Here the tables should be normalized.

when amazon is doing data analysis of historical data, they will create denormalized tables just to make sure analysis is faster and costly joins can be avoided.

I hope you liked the explanation, feel free to add more in comments!

What more topics you want me to cover?

PS ~ I teach big data and my students are leading big data teams in top companies. DM to know more!

#dataanalysis #dataengineering #dbms #normalization

What does this say to you?

Gfesser:

One overarching theme is missing here: performance. And performance for analytical operations typically needs to be handled differently than performance for transactional operations. Why is this? Because when analytics is performed, scanning many records is often needed to calculate aggregates etc, unlike targeted queries which are looking for specific records. From my experience, it's still relatively common to hear development teams in the workplace talk about the number of records involved when it comes to performance. But there are many other additional factors. How many fields does each record contain? What queries will be run against these records? What are the data types of the field values being queried? What are the cardinalities of the field values being queried? and What database product type is being used to store this data, or conversely, What query engine / compute layer is being used, if storage is kept separate in cases of data lakes etc? All of these are important considerations.

Subscribe to Erik on Software