By Erik Gfesser in media query — Nov 2, 2018

Media Query Source: Part 16 - InformationWeek (US digital magazine); Data silos: now and forever?

InformationWeek (US digital magazine)
Data silo consolidation challenges & risks
Tight coupling between apps & data
Differing definitions of seemingly same data

My responses ended up being included in an article at InformationWeek (November 13, 2018). Extent of verbatim quote highlighted in orange, paraphrased quote highlighted in gray. Above image from cited article.

The responses I provided to a media outlet on October 31, 2018:

Media: Why is it taking so long for many organizations to consolidate their data silos?

Gfesser: Consolidation is challenging and time consuming…

This is exactly the case if applications are built directly on top of the data being tackled with little or no abstraction, or business logic is distributed across both the application and data layers.

If business logic resides in the data layer, consolidation can be especially challenging when attempting to also simultaneously migrate to a new database product, as stored procedures and functions tend to be an area where database products tend to differentiate.

Media: What are the drawbacks of having multiple data silos spread across an organization?

Gfesser: From a business perspective, data silos can simply get to the point where internal departments just aren't aware of other data existing in the organization.

Media: Are there any good reasons, such as security, for retaining data silos?

Gfesser: From my experience, security is the most cited reason for retaining data silos with respect to data owned or obtained from customers.

For example, several years ago I was brought on board by a client executive to help him get past years-long corporate infighting around data ownership…

Essentially, customers were advocating that their data be stored in separate databases for the same to-be built software product that I later built for this client.

Customers were anxious that data they sourced and fed to this client would be available for other customers to read…

But I ended up designing a single multi-tenant database that centrally stored all data from all customers, with access to this data controlled by application and database security as well as by the way data structures were designed.

However, if I had agreed with the customers that data be stored separately, this wouldn't mean that the data would be stored in silos as it would be important to maintain consistent design across the databases.

But part of my answer to this question resides under #5 below: if the data across silos is completely different (apart from standard data structures), there may be no need to consolidate.

Media: What are the risks involved when consolidating data silos?

Gfesser: One big risk is incorrectly interpreting data during the consolidation process, leading to inconsistency.

The problem here is that each silo is likely to have at least some data that is seemingly similar but should really be interpreted differently.

One simple example is party data comprising either people or organizations. In some professional domains, an organization name might be represented by an individual.

Some more challenging examples involve reference data and temporal data…

Data representing processes, for example, might involve process states or steps that sound the same but are actually different, sound different but are actually the same, or take place in a different order.

And it's important to understand that it might make sense in some scenarios to make use of different interpretations depending on context, depending on how heavily terms are connected to the core business.

Unfortunately, many databases don't store data in a bitemporal manner (storing data in a manner which records the time period during which a given entry is valid, as well as the time period during which a given entry is considered correct) by the business.

If some databases store such time periods and some do not, it will be difficult to consolidate them.

Also, when it is determined that common schemas need to be created to store all data across the enterprise, it is a given that some silos may have data elements that are unique, leading to many organizations making use of schemas at the time of reading (called "schema on read") rather than at the time of writing (called "schema on write") to provide more flexibility.

However, the need to design schemas is not really limited to the time of writing data, as the initial flexibility provided by document storage leads to similar issues at the application level when documents need to be processed.

From my experience, the reality is that all data structures need to be versioned, regardless of whether data needs to conform to specific schemas at the time writes or reads are performed, for both data at rest in databases or data in motion in message broker or streaming products.

Media: What's the biggest mistake organizations make when consolidating data silos?

Gfesser: In my view, the biggest mistake isn't necessarily what is done during the actual consolidation process, but what is not done with respect to data governance prior to such an activity…

When it comes time to consolidation, if siloed work involved very little governance or documentation, there will be much more work to do when attempting to complete this activity after the fact.

Additionally, one thing I've repeatedly seen over the years is misuse of several similar-sounding phrases about a somewhat esoteric concept called "data truth"…

To help address this misuse, I wrote a series of articles on my blog between 2010 and 2017 entitled "Data Truth – Revisited" that walks through the differences between "Single Source of Data", "Single Version of the Truth", "Single Source of Truth", and something that technologists have recently been referring to as "Shared Source of Truth": https://www.erikgfesser.com/single-shared-source-version-of-truth-data-streaming-revisited/

As I commented at the time, it's important to clarify what is meant by any expressions that are used in order to help ensure miscommunication is minimized.

Subscribe to Erik on Software