By Erik Gfesser in media query — May 27, 2023

Media Query Source: Part 44 - InformationWeek (US digital magazine); Top Tips for Weeding Out Bad Data

InformationWeek (US digital magazine)
Best way to weed out bad data
"Bad data": low quality or inappropriate data
Acquire, cleanse, standardize & model data

My responses ended up being included in an article at InformationWeek (June 26, 2023). Extent of verbatim quote highlighted in orange, paraphrased quote highlighted in gray. Above image from cited article.

The responses I provided to a media outlet on May 25, 2023:

Media: Tips for weeding out bad data.

Media: 1. What's the best way to weed out bad data? 2. What makes this technique so effective? 3. How frequently should bad data be weeded out? 4. What can happen if bad data isn't weeded out? 5. Is there anything else you would like to add?

Media: What's the best way to weed out bad data?

Gfesser: It's important to first define what we mean by "bad" data. And perhaps just as important, we need to come to terms with what "weeding out" means.

Bad data often means low quality data, in which case "quality" would need to be defined. But bad data can also mean inappropriate data, in which case "appropriate" would need to be defined.

If it sounds like I'm being a stickler for definitions, it's because I am. If data is to provide any value for downstream usage, we need to help ensure that this data is reliable with respect to expected use cases.

To put this differently, not all data needs to adhere to the same strict standards because not all use cases are as demanding. And as such, judgement often needs to be used to determine what is appropriate.

To simplify, I consider several factors when acquiring data. First of all, from where am I sourcing the data? Does the source represent a viewpoint that I want the data to reflect when it comes time to consume it downstream?

If so, what data do I actually need to acquire to meet my needs? While we can refer to selectively choosing data as "weeding out", data professionals typically refer to this step from the opposite point of view: what filtered subset of data is needed?

A common example of filtering often involves choosing the subset of data involving a specific date range, but any field or set of fields can be chosen. More general use cases typically involve filtering less, and more specific use cases typically involve filtering more.

After "landing" this data from a given source, it often needs to be cleaned to make sure it is readable by all downstream tooling, by resolving any data issues such as the presence of special characters that might get in the way.

After this cleansing, some data might need to be standardized so that field values are the same when representing the same thing. For example, address standardization is common.

In situations not involving one-offs, a common data model is typically designed to store such data so that it can be used for a variety of use cases, and as such, rules need to be applied to the data in order to ensure conformance to such a model.

Conversely, the design of a given model may need to be revisited if it is discovered that sourced data would be more accurately represented with alternative data structures.

In the world of a modern data lakehouse, all of these steps typically execute via data pipelines which acquire, clean, standardize, and structure it.

Media: Is there anything else you would like to add?

Gfesser: As data traverses a given data pipeline and is modified or restructured to meet the needs of each step or stage, data that is output from each is typically stored for reference for the purposes of visibility or traceability, otherwise known as data "lineage".

Data lineage is a mechanism that keeps track of data fields that have been modified along the way, as well as the code that has been used to modify it, to enable the revisiting of upstream values when understanding is needed during consumption.

Note that modern data lakehouses also provide the ability to store multiple stages of data in one location, via the versioning of records. As such, cleansed values of fields can be referenced, as can the "raw" values from the originally acquired data.

Additionally, multiple patterns of data pipelines can be constructed. For example, intermediary stages can be deleted and re-created later if the raw data and the original rules used to clean and restructure it are kept.

Similarly, intermediary stages don't even need to be created in some cases, for example in the case of streaming data in which steps might best be executed in memory and not persisted along the way, due to this data providing no downstream value beyond immediate consumption.

Subscribe to Erik on Software