By Erik Gfesser — Jul 13, 2019

Implementing Databricks for Data Science and Engineering Teams: Part 2

This post is the second part of a soon to be published case study I wrote about a recent client project implementing Azure Databricks with my team for data science and engineering teams (the first part can be found here).

The analytics arm of a large midwestern U.S. insurance firm asked SPR to implement a Pilot phase as a follow-up to the initial proof of concept (PoC) phase that SPR had carried out during prior months. As discussed in part 1 of this two part series, the company had asked SPR to help by focusing on the portion of its architecture making use of Azure HDInsight, and building a PoC on Azure Databricks Unified Analytics Platform for the company's future state architecture.

While we took into account how Databricks usage might fit as part of the targeted production environment, production machine learning (ML) models and data workloads were not included in this initial effort due to the sensitive nature these were seen to have at the time. As such, following the conclusion of our work on the PoC phase a separate Pilot phase was carried out by SPR specifically to perform the following:

Migration of 3 production ML models (ideally with no changes to the original code)
Migration of production data used by these models
Recommendations based on what we saw during the Pilot phase

Using these directives, SPR first built out the components used in the initial PoC, albeit using the firm's existing Azure accounts.

Background on Databricks Proof of Concept (PoC)

As discussed in part 1 of this two part series, criteria for the PoC was threefold: (1) focus on the R language for ML models, (2) account for current data science team development processes, and (3) preference for use of Azure Data Factory for data pipelines, which the current data engineering was already looking to adopt.

As part of the PoC work, SPR assessed products associated with Databricks and the broader Azure ecosystem, and determined that some were either not production ready or were not an enterprise fit for the company. For example, while MLflow appeared to be a promising open source framework (created by Databricks) to standardize the end-to-end ML life cycle, it was also in alpha and R language support had just been added. Additionally, while we successfully set up Visual Studio 2017 to remotely connect to a Microsoft Machine Learning Server instance via Microsoft R Client, we did not recommend use of this tooling because it did not seem to be positioned well.

After building the PoC, we additionally broke down the types of work that can be done into 12 tasks that we categorized into DevOps and data pipeline work. Components used to build all of this out included Azure Databricks, Azure Blob Storage, Azure Data Factory, RStudio Server, Azure DevOps, R, Python, Databricks CLI, PowerShell, RStudio Desktop, and Git.

Databricks Pilot

We ended up migrating a total of 6 production ML models (2x the initial goal) from HDInsight to Databricks, along with the production data used by these models. All of these models are intended to predict the level of consumer adoption that can be expected for various insurance products, with some having been migrated from SAS a few years prior at which time the company moved to making use of HDInsight.

The strategy that we used for the first 3 models was to migrate a relatively easier model followed by models of increasing complexity, the goal being to get basic migration details worked out first before progressing to others which depend on these basics. Of the 14 specific technical implementation issues we encountered, we resolved or implemented workarounds for all 14, and all of these issues were encountered and surmounted during our work on the first 3 models, leading to significantly increased team velocity for the latter 3 models.

Key challenges along the way can be categorized into 4 areas:

poor data quality (biggest challenge)
lack of comprehensive versioning
tribal knowledge and incomplete documentation
immature Databricks documentation and support with respect to R usage

To address these key challenges, we made recommendations in 5 key areas as a follow-up to our implementation work:

big picture view
automation
configuration and versioning
separation of data concerns
documentation

Since client data, platform, and modeling teams are associated with these models, approaches to solutions and analysis and resolution of issues need to be performed across all three teams. While it is common for silos to form within large organizations, it is especially critical for teams to be in sync with each other with respect to data platform usage, and as such we advised that these teams work together to leverage their existing enterprise architecture group in order to achieve alignment.

We also noticed that while work on CI / CD (continuous integration and delivery) automation had been started, what had been built was largely confined to a portion of the work being performed across these three teams. We advised that end-to-end pipelines be built across artifacts managed by the data, platform, and modeling teams because the work products of individual teams has an impact on the others. During our model migration work, we saw first hand the mistakes that lack of automation was causing, and that manual steps were resulting in added cost.

Wholesale configuration and versioning will aid in fully enabling such automation, reducing errors and supporting repeatable builds and deployments, and as such we included doing so as a separately delineated recommendation. For example, all table schemas, reference data, and R packages used by models need to be versioned as one unit within the context of each model, supporting consistent results every time. Additionally, hard coded values were pervasive in the code base and should be converted to easily maintained configurations.

Code deployed as models contained content not constrained to the actual models, so we advised that data needs be managed as a separate concern to increase visibility of data work, as well as to prevent increased complexity and ripple effect in the code base. Addressing data and schema issues separately from work on models will help prevent diverting from business value added model development. For example, data preparatory work should be performed prior to model execution, as model execution should not fail in production due to data related issues.

Lastly, we commented that current documentation is not comprehensive or centralized, and work to move in this direction would support maintainability as well as to more quickly add members to the data, platform, and modeling teams. For example, how can models as currently deployed be executed from the command line? Which Hive tables does a model use? What is the path / lineage of data as it moves from sources to consumers, including models? And where are the responsibilities of the data, platform, and modeling teams documented?

The Pilot implementations of Databricks that SPR built out included all of these technologies save for Azure Data Factory, as DevOps tasks remained our focus for the duration of the Pilot phase rather than data pipeline tasks. That said, we did put together a high-level architecture document to show the firm how a data pipeline might be tackled following the Pilot phase. In addition to the aforementioned components, which included two subcomponents of Azure DevOps (Azure DevOps Repos and Azure DevOps Pipelines), Databricks and MLflow were also included in this architecture.

Databricks Connect permits connections from IDEs (integrated development environments) and notebook servers to Databricks clusters to run Spark code, and was not released until after the conclusion of the PoC phase, during which time it was in private preview. The status of MLflow, mentioned earlier, progressed to beta during the Pilot phase, and additionally was made available as a managed service called Managed MLflow during the concluding weeks of the Pilot phase. And so as momentum appears to be building up around this framework, we wanted to show the company how it might fit in with the rest of the architecture.

The Result

In addition to migrating 6 models (2x the initial goal) and the associated data used by these models from Apache Hive to Spark SQL (while the data remained in Azure Blob Storage, we also mounted to Databricks File System), our key deliverables also included dozens of annotated Databricks notebooks that we developed using PySpark and SparkR, cluster initialization ("init") scripts that we wrote to be executed on Databricks cluster startup, detailed documentation that explained our work and how to perform it, as well as recommendations as outlined above.

While the PoC and Pilot phases of our work were both successfully executed, subsequent work was not performed due to client budget. That said, our work on these two phases convinced a very skeptical Chief Data Scientist that Databricks is the way to go, and as such indicated during our final presentation to project stakeholders that their next goal following our departure was to migrate their remaining models to Databricks as quickly as possible!

Technologies used during this effort included Azure Databricks, RStudio Server, Azure Blob Storage, Azure HDInsight, Apache Spark, Azure DevOps, R, Python, Databricks CLI, PowerShell, RStudio Desktop, and Git.

Afternote: So Why Databricks?

A brief survey of factors you might find of interest which led to our client's Chief Data Scientist changing his mind from skepticism to adoption.

Databricks just works

At the outset, the Chief Data Scientist stated that they had already tried Databricks, and that it didn't work out for them. Databricks hadn't been made generally available in Azure until recent months, so the question was to what extent they had already worked with the product. It turned out that staff lacked needed skill sets, and didn't have the necessary bandwidth to pick these up, so the Director of Data Engineering brought my team and I on board to address this roadblock.

You might have noticed that the first client directive for the Pilot phase was to migrate a subset of ML models with no changes if at all possible to the original code. One of the primary reasons for not changing the code was to prevent downstream effects from introducing change to the code base. The only programmatic changes we ended up needing to make to the code were to address data quality issues, some of which were due to migrating from Apache Hive to Spark SQL as part of the process.

We rigorously compared model output from HDInsight and Databricks for full production datasets, and found only one output value of dozens which differed, affecting the latter 3 models we migrated. In reviewing these differences with the Chief Data Scientist, it was determined that these were due to changes to Spark function "percentile_approx" over time, as the versions of Spark used by HDInsight and Spark were different.

Note that we used a relative tolerance of 0.000000001 when comparing model output, and when adjusting for more tolerance the differences for this sole column disappeared. Regardless, the minor deviance experienced here was considered by the Chief Data Scientist to be immaterial.

No needed changes to library packaging

While this second part of the case study mentions that we encountered and surmounted 14 specific technical implementation issues during the Pilot phase, keep in mind that we had already become familiar with Databricks by building a PoC, during which time we also addressed specific client obstacles associated with their initial trial runs in prior months.

One of the remaining stumbling blocks for the client team had been the ability to install proprietary R packages that they had developed, in contrast with R packages made publicly available via CRAN (Comprehensive R Archive Network). After demonstrating how to do so via notebooks, we additionally implemented the best practice to move installations of all needed R packages to Databricks cluster initialization ("init") scripts so that these would be available to all cluster nodes used by executing models.

Integration with chosen Git implementation

Databricks did not initially offer out of the box support to specifically address integration with Azure DevOps, a product which also had just recently become generally available. But this lack of support was really a limitation of the notebook user interface, which offered minimal explicit support for Git compatible products.

Since we did not want any unneeded user interface dependencies, we first showed the client team how to programmatically integrate with Azure DevOps, and later also implemented automated syncing of notebooks to Git to minimize the need for developers to do so while programming, as well as to store all of the code into the same centralized location already being used by client teams.

Notebook alternative exists

While the Chief Data Scientist made it clear that modelers on his team use a variety of development tooling, he also made it clear that he was highly averse to using notebooks. While notebooks have quickly become widely popular for data science in recent years, he was accustomed to programming via command line and an IDE called RStudio Desktop, and did not like the idea of being constrained to using notebooks for development, as he found some scenarios (such as for debugging) a bit challenging.

Our solution made use of RStudio Server, the installation of which we included in our cluster initialization ("init") scripts, to provide the option to make use of a browser based version of RStudio Desktop. Also, since the Azure implementation (unlike the AWS implementation) of Databricks does not permit SSH to clusters, RStudio Server additionally provides command line access similarly to RStudio Desktop.

And as the notebooks that my team and I developed intermingled Python to execute pipeline code, R to execute models, and SQL for exploratory work, by the end of the Pilot phase client teams began to see this as an advantageous feature.

Hadoop goes away

While noting again that this afternote should just be considered a brief survey, several challenges were either expressed by client teams or experienced by my team and I during our migration work from HDInsight to Databricks. As we needed to compare data output from models executed in HDInsight and Databricks, this required us to first execute on HDInsight, and this required our needing to make use of Apache Hadoop YARN for resource management and job scheduling and monitoring.

The HDInsight cluster was being brought up at the beginning of the day and brought down at the end of the day, and Hadoop processes were sometimes unavailable, preventing the ability to resume our testing, and when this issue did not surface we still needed to be concerned with configuration, the improper setting of which resulted in obscure error messages.

After just migrating the first model, the Chief Data Scientist was already expressing satisfaction with not needing to fiddle with YARN etc. Databricks is not built on Hadoop, and therefore does not have the legacy constraints of Hadoop. And since it had been taking hours each day to bring up the HDInsight cluster used by client teams, bringing up Databricks clusters on an as needed basis was seen to be a significant benefit.

Client focus was on the Apache Spark component of Hadoop, so why be encumbered with unneeded legacy constraints, not to mention the much wider available selection of Databricks cluster node configurations.

Background on Databricks Proof of Concept (PoC)

Databricks Pilot

The Result

Afternote: So Why Databricks?

Subscribe to Erik on Software