By Erik Gfesser in media query — Oct 20, 2017

Media Query Source: Part 9 - InformationWeek (US digital magazine); Non-RDBMS emerges as an analytics data choice

InformationWeek (US digital magazine)
Growing adoption of non-RDBMS for analytics
Further database product hybridization likely
Product selection due diligence important

My responses ended up being included in an article at InformationWeek (October 31, 2017). Extent of verbatim quote highlighted in orange, paraphrased quote highlighted in gray. Above image from cited article.

The responses I provided to a media outlet on October 20, 2017:

Media: The Growing Popularity of Non-Relational Database Management Systems. A new study released by big data visual tools developer Zoomdata shows that data analytics has crossed the digital Rubicon, with non-relational database management system (RDMBS) technologies now comprising 70% of the data sources for performing analytics.

The Zoomdata report by Matthew D. Sarrel can be found at this link. Prior to my responses, I took the time to read the entire report published through O'Reilly entitled "The State of Data Analytics and Visualization Adoption: A Survey of Usage, Access Methods, Projects, and Skills".

My responses addressed the more detailed questions as follows:

Media: Why is there a drive toward non-relational database management system (RDMBS) technologies?

Gfesser: Most of this drive is related to the structure and volume of data being processed, as well as the relative performance of databases involved in this processing. I tend to prefer using the term "non-traditional database" these days rather than "non-relational database", because offerings across the database product spectrum have evolved over time to the degree that many databases originally designed as relational are no longer limited to the relational model.

That said, even though many database products are evolving and able to take on a greater variety of challenges, these benefits often come attached with additional vendor licensing costs when involving commercial database products. For example, various commercial relational database products can scale significantly over time to meet the demands of additional data volume and usage (up to a limit), but licensing costs tend to increase as a result. And at some point, specialized hardware may be needed. With many non-traditional database products, commodity hardware can be used and the software is non-proprietary open source, with commercial support often available just in case it is needed, but not required.

During my computer science graduate school studies, one of my specializations was database systems. At the time, the variety of database products we see in the marketplace now simply didn't exist. Because of this situation, creative technologists devised ways of using relational databases to perform analytics. And the reason that technologists went this route, for example by making use of preaggregations of data during the data load process, is because relational database products tend to attempt a balancing act so to speak between reading and writing data, but tend to prioritize writing. When data is aggregated before it is needed, the overhead involved in doing so does not need to be done on the fly when users are issuing their queries.

The process of performing analytics is associated with the reading of data after the data has already been loaded from sources and is made available for reporting or visualizations. However, the data needs to be loaded first, and relational databases, while optimized for writing, still cannot typically keep up with the influx of large amounts of data. And once this data is loaded, it needs to be subsequently optimized for analytics – this is the dilemma.

Because of this dilemma, database products providing the ability to make use of "schema on read" rather than "schema on write" properties became popular so that both the writing and reading of data could be performed against the same data, with little to no needed changes to structure. In other words, the structure of data does not need to be enforced at the time of writing, and can be instead reinterpreted as necessary at the time of reading, providing significant flexibility for analytics which would otherwise involve significantly more upfront effort.

The Zoomdata report also mentions data streaming products such as open source Apache Kafka, and analytics can be performed on this data as well, but it is important to understand that Kafka is essentially a database. I sometimes refer to Kafka as a "data store" so as not to confuse it with relational database products, but depending on the audience this can be considered a nuance.

What is interesting is that products such as Kafka essentially turn the concept of the database inside-out in the sense that they act similarly to logs that exist under the covers within database products, providing the history of data as it works its way into the enterprise. Kafka also enables analytics to be performed on the fly, and enables consumers to keep track of what has been read from these streams rather than disposing of this data after being read like many traditional messaging products.

Because of these traits, Kafka enables going back in time and reprocessing the original data – unlike relational database products which tend to place the burden of being able to do so on database developers who need to spend significant amounts of time structuring data to be stored in a manner that provides this versioning. Additionally, Kafka can ingest vastly greater quantities of data due to the low overhead, and provides the ability to scale extensively with commodity hardware.

Media: Is the trend toward non-RDMBS likely to gain even great momentum in the years ahead?

Gfesser: The momentum that is likely to continue is further hybridization of more database products so that different types of processing spanning the transactional to analytical spectrum can be performed efficiently enough so that the need to use separate database products is lessened.

Several years ago, it was very common for enterprises to start introducing polyglot database environments in which separate products were used, each specialized for a specific type of processing. While doing so provided success in many cases, when it came down to maintaining these environments, many enterprises came to realize that the staff needed to support these products also necessitated specialization.

Despite being a standard that dates back to the earliest days of commercial relational database products circa the 1980s, SQL is still very much favored by database users, partially because of historical familiarity that doesn't require learning proprietary means of working with the data. So much so that while relational database products have taken on properties of their non-relational counterparts over the last several years, non-relational database products have started introducing either SQL or SQL-like variants to meet this demand.

Over time, it isn't difficult to imagine the further merging of these product domains. Those paying attention to database product ranking websites such as db-engines.com, which currently lists 310 different database products, are witnesses to the competition that exists on the database battleground, with reigning traditional database champions working to retain their relevancy in light of changing customer demands. These database vendors are likely not going anywhere, but over time they will continue to evolve in order to survive.

Media: Any other thoughts?

Gfesser: The database product selection process should take into account how the product is really going to be used, as well as who will be expected to provide maintenance over the long term. Enterprises should be careful not to adopt technologies simply because they view them as being commonplace, or because a handful of individuals advocate usage. Performing due diligence around product selection will pay dividends. As a consultant, I've seen many instances in which clients joined the bandwagon rather than first performing due diligence, and this typically doesn't end very well. As someone who periodically attends technology focused meetups, I'm reminded of a Hadoop consultant who last year commented to the audience that "most Hadoop clusters out there are a mess, people do not know what they are doing".

Subscribe to Erik on Software