Community Comment: Part 2

The comments I provided in reaction to a community discussion thread on September 21, 2020:
https://www.linkedin.com/feed/update/urn:li:ugcPost:6712112282559758336?commentUrn=urn%3Ali%3Acomment%3A%28ugcPost%3A6712112282559758336%2C6712156655460601856%29&replyUrn=urn%3Ali%3Acomment%3A%28ugcPost%3A6712112282559758336%2C6713928878362025984%29

Director of Data Analytics and Application Development: Some best practices for implementing cloud data lake:
Cloud Data Lake Best Practices
https://www.linkedin.com/pulse/cloud-data-lake-best-practices-jasmeet-singh/

Senior Solutions Architect at Databricks: Its interesting that you recommend using parquet on the data lake, but delta lake only if needed. How would you perform the merge operation on parquet between layers if you don't utilize delta lake? 

Director of Data Analytics and Application Development: That is a very good question, my understanding is it depends on the architecture. Lets say data bricks or spark is not part of your architecture you can just load parquet files into time partitioned folders and over write those files. You can then create external tables in Synapse DW to serve. Delta lake has its own benefits like time travel so in some scenarios it can compliment DW layer.

Senior Solutions Architect at Databricks: Time Travel is a nice feature of Delta Lake, but some of the more beneficial features are around the management of data with common ETL patterns. The ability to Insert, Update, Delete, and Merge with simple SQL commands on cloud object storage is an amazing value add of Delta Lake. Without Delta Lake these types of operations on object storage or in data warehouse layers would be very expensive in the cloud, no? I've seen the most successful patterns emerge using Delta Lake on cheap cloud object storage to do ETL, and using a cloud data warehouse as a serving layer for use cases that require that high SLA, and can absorb the higher cost.

Director of Data Analytics and Application Development: Yes I agree but not all organizations need to use Delta Lake as serving layer, for many MPP DW with data lake storage on back end will be sufficient, you can have all the ACID compliance, insert , merge , update in DW Layer. Please check how azure synapse is doing it, they are using azure storage not built in storage so you still get benefits of lower cost storage but do not have to maintain both delta lake and DW. "With decoupled storage and compute, when using Synapse SQL pool one can: Independently size compute power irrespective of your storage needs. Grow or shrink compute power, within a SQL pool (data warehouse), without moving data. Pause compute capacity while leaving data intact, so you only pay for storage. Resume compute capacity during operational hours" Just to clarify I am not against Delta Lake I am trying to say please evaluate if you need it and in some cases it will have additional value.

Gfesser: So for best practice #3, I'm trying to reconcile your comments here with the following statement: "Implement Delta Lake only if needed, If you already have separate data warehouse layer as part of your architecture then use that instead". At a minimum, I suggest stating that Delta Lake is recommended if Spark (the minimally compatible version) is already being used, because this is the architectural dependency. Alternatively, multiple scenarios could be laid out, but since the purpose here is to explain best practices, perhaps you could just state whether you consider use of Spark a best practice. I don't think use of a "separate data warehouse layer" explains whether Delta Lake should be implemented.

Subscribe to Erik on Software

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe