By Erik Gfesser — Sep 1, 2019

New Book Review: Evolving Data Infrastructure

New book review for Evolving Data Infrastructure: Tools and Best Practices for Advanced Analytics and AI, by Ben Lorica and Paco Nathan, O'Reilly, 2019:

Copy provided by O'Reilly.

This text is all about O'Reilly's investigation into understanding how firms are making use of the ABC components (artificial intelligence, big data, and cloud) as they work towards implementing analytics and automation over time. While O'Reilly wanted to determine whether firms were building out key components, they also wanted to measure usage sophistication of these components, and perhaps uncover a roadmap for transitioning from legacy to modern practices.

At the same time, the authors state at the outset that while they recognize that firms are moving key pieces of their data infrastructure to the cloud, the lack of data is also a barrier to entry with respect to making use of artificial intelligence (a term toward which I continue to cringe, but at least the authors have decided to make use of lower case for the term "big data", and state in the next sentence that what they are really discussing is machine learning).

The catch here is that the manner in which firms are deciding to carry out their roadmaps does not necessarily relate to best practices, and interestingly enough the term "best practices" is only mentioned in this book's subtitle. What the authors discuss is really a survey carried out over a few weeks between late October and early November 2018, which resulted in responses from 3200 respondents (approximately 1400 of whom were from North America, 900 from Western Europe, and 350 from South and East Asia).

For the purposes of interpreting survey results, they grouped respondents into the following categories: (1) exploring respondents (31%) who work for organizations "just beginning to use" cloud-based data architecture, (2) early adopter respondents (43% ) who work for organizations using cloud-based data infrastructure in production for "one to three years", and (3) sophisticated respondents (26%) who work for organizations using cloud-based data infrastructure for "more than four years".

For a publication focused on data, these groupings don't seem continuous with respect to gaps in time. Presumably, the first group has used cloud-based data infrastructure for less than one year, the second group has used cloud-based-data architecture for at least one year but less than four years, and the third group has used cloud-based data architecture for four years or more. That said, the authors typically explain the numbers that they cite, and any deviation from this practice is largely a minor annoyance.

The following solutions were ranked by respondents in order of priority, with the first three categorized as "foundational data technologies": (1) data integration and ETL, (2) data science platform, (3) data preparation and cleaning, (4) data governance solutions, (5) anomaly detection, (6) metadata analysis and management, (7) data lineage and management, and (8) model transparency and explainability. While definitions are not provided for these solutions, they do clarify that the first three solutions fall under the broad category "data pipelines".

The survey addresses specialized roles, but as would be expected due to industry disputes these aren't well defined, which in this case includes overlapping roles such as "DevOps" and "DataOps". Interestingly, when the survey turns to inquiring about the biggest organizational skills gaps, something called "GitOps" is additionally mentioned, albeit at the bottom of a list that includes (1) data science, (2) data engineering, (3) DevOps / SRE / platform engineering, (4) data visualization / story telling, (5) DataOps, (6) security, (7) domain-specific expertise, (8) metadata analytics, (9) product management, (10) compliance, (11) ethics, bias, fairness, and (12) GitOps.

With respect to specific technologies used by survey respondents, I'm always interested in seeing how my own skills overlap. Not surprisingly, AWS is used the most, followed by Azure and Google. However, I was surprised that Apache Kafka was cited so frequently for data processing tools and frameworks, comparable in cited frequency to other tooling such as Apache Spark, Hadoop, and to a slightly lesser extent Amazon Elastic MapReduce (EMR). However, the results are a bit challenging to interpret since some responses in the lower half overlap with the Hadoop ecosystem.

A colleague of mine recently remarked to me that AWS Glue doesn't seem to be used very much in industry, for example, and while I agree that this might be the case, the same can likely be said about other products such as Databricks. While neither of these products is listed by survey respondents, this text mentions on several occasions that respondents were asked to select from chosen lists for some questions, and while it is unknown whether this was done for this part of the survey as well, it seems likely that some level of bias existed here.

While some individual Hadoop ecosystem components are listed here, such as Apache Pig, it would be helpful to know whether the reverse is true about Apache Spark. Did responses of "Apache Hadoop" include potential use of Apache Spark? And while Hadoop distributions EMR and Azure HDInsight were specifically mentioned by respondents, could it be that responses of "Apache Spark" include Hadoop, Databricks, and / or AWS Glue? I think it's very likely that this is the case, and as such these survey results are a bit misleading.

Subscribe to Erik on Software