By Erik Gfesser in media query — Sep 3, 2023

Media Query Source: Part 45 - Reworked (US digital magazine); OpenAI & the Databricks-Microsoft AI alliance

Reworked (US digital magazine)
OpenAI & the Databricks-Microsoft AI alliance
Databricks available on AWS, Azure & GCP
Microsoft providing AI options makes sense

My responses ended up being included in an article at Reworked (September 7, 2023). Extent of verbatim quote highlighted in orange, paraphrased quote highlighted in gray. Above image from cited article.

The responses I provided to a media outlet on August 29, 2023:

Media: What does Databricks do and what can it bring to Microsoft Azure?

Media: Recently, Microsoft announced that it is looking at offering Databricks AI through Azure. This would allow Microsoft to offer Databricks' machine learning (ML) and data analytics tools to Azure customers as part of its cloud services.

Databricks provides an AI-powered data analytics platform to help organizations build their own AI models.

However, this deal contrasts with OpenAI's approach of developing proprietary AI models and licensing them to partners like Microsoft. The two companies work closely on Microsoft 365, Windows, Bing and more.

By offering Databricks' software on Azure, Microsoft aims to meet the growing enterprise demand for customized AI tools. This allows companies to leverage AI in applications tailored to their specific business needs.

Gfesser: I first made use of Databricks in 2018, since the time it was made generally available (GA) in Azure.

My first experience while with SPR Consulting involved addressing all of an insurance client's issues in their unsuccessful attempts to adopt Databricks. This work included building a slew of proofs-of-concept (PoCs) demonstrating how to do everything they were looking to do, followed by migrating their machine learning (ML) models to Databricks to run a pilot, as well as providing recommendations on how team processes were being executed and how data pipelines were designed.

While with Deloitte these past two years, I've driven several initiatives that started with the roadmap I laid out for phased implementation of a data lakehouse with Databricks. While Databricks was already being used in production, few knew this because it had been solely implemented as a back-end engine to execute the workloads sent to it by a non-Azure third party product that the team had previously adopted.

Phased implementation involved exposing application programming interfaces (APIs) to the custom application layer the team had built on top of Databricks, re-architecting how data files were being stored in the Azure Data Lake Storage (ADLS) data lake, implementing data catalogs using Apache Hive alongside Databricks SQL warehouses to query this data while stored in this data lake (rather than first needing to load another database product in order to do so), and most recently, laying the groundwork to adopt Databricks Unity Catalog integrated with Microsoft Purview.

I've been an open source champion for the bulk of my career, as I came to value the advantages that open source can provide, with the given that appropriate due diligence is performed as part of the adoption process.

Databricks is a good example of commercialized open source. In other words, open source (e.g. Apache Spark) that is bundled alongside commercially supported proprietary software, often offered by a vendor to provide additional features not otherwise available, such as infrastructure management that simplifies or alleviates the need for customers to do this work themselves.

As I mentioned previously, AI (which includes all subsets, including ML) can already be executed on Databricks, alongside what has been traditionally called "big data" processing. But while Databricks is available on all three major public clouds (AWS, Azure, and GCP), only Azure provides Databricks as a native service. In other words, Databricks is only available on AWS and GCP via third-party marketplaces, unlike on Azure. Microsoft and Databricks worked together to offer Databricks as a first-party Azure service, and Microsoft continues to stand behind this offering.

Now, make no mistake, Microsoft already offers Azure services (e.g. Azure Synapse, and on its future roadmap, Microsoft Fabric) which overlap to some extent with Azure Databricks. While there are many reasons why Microsoft likely chooses to do so, two key reasons come to mind. First, Databricks was a first mover on AWS, bringing competitive advantage that Microsoft sought to replicate, albeit with a native service. Second, Microsoft tends to cater to enterprise technology shops that are arguably often not as technically savvy as smaller firms or tech firms, and many engineers have become accustomed to the Microsoft ecosystem over the years. For example, Synapse provides C# compatibility for Apache Spark alongside Python, Scala, and R, which Microsoft thinks will lower the bar for C# engineers who prefer not learning another programming language. Of course, this means that Microsoft will always be behind on Apache Spark releases due to their needing to first convert Spark code (written in Scala) to C#, and be satisfied with their test results before the corresponding Synapse release.

I'm not surprised that Microsoft is seemingly following a somewhat similar strategy for Databricks AI (if this is to be the name, as both Microsoft and Databricks have tended to change the names of their offerings over time). OpenAI was originally open source before Microsoft got involved, and the timing of this involvement coincided with its becoming a household name seemingly overnight, with hordes of people having a generally pleasant experience after experimenting with it. As a result, Azure OpenAI was born.

But OpenAI is a black box, not trained with customer data. At least, not trained with customer data unless a given customer opts in to do so, and opting in means sharing data, something which many firms such as Deloitte seek to prevent in many use cases, such as those associated with heavily regulated domains. Of course, models trained with customer data are likely going to be the most accurate, with the given that the training performed is on par with the training that a proprietary model such as OpenAI can provide. But there are always tradeoffs to be made.

Databricks continues to follow the commercialized open source route, offering open source models that can be trained with customer data, all while keeping the data plane separate from the control plane, essentially meaning that data stays in its current location. In reference to the Databricks implementation on which my teams and I have been working at Deloitte, for example, this means that the data in the data lake doesn't move anywhere. Yes, training costs will be incurred, but again this is one of the tradeoffs that every firm looking to implement AI needs to consider. In many cases, this tradeoff may be worth the competitive advantage that is gained as a result. And based on offerings that Databricks continues to churn out for Azure customers, the additional AI tooling that Databricks will be rolling out are expected to provide additional ease of use for their target markets.

Subscribe to Erik on Software