I wrote a lot of blog posts by chance, after losing myself on the Internet. It's also the case of the one you're currently reading. I looked for Delta Lake's learning resources and found an interesting schema depicting the Unified Data Management patterns. Since this term was something new for me, and I like everything with the "pattern" in the name, I couldn't miss the opportunity to explore this topic!
New ebook 🔥
Learn 84 ways to solve common data engineering problems with cloud services.
The post will start by presenting all patterns included in this Unified Data Management concept. Later, I will complete this generic description with the corresponding features of Delta Lake, followed by a short demo.
The 8 patterns composing the Unified Data Management are:
- Transactions - no rocket science here, the transactional support means full ACID guarantees of the data in case of a single job or the multiple concurrent jobs working on the same dataset as well.
- Schema metadata management - the dataset written to the storage is described by an external schema. Nothing complicated neither, you certainly know it already from Parquet format.
- Schema enforcement and evolution - in addition to the point above, the schema should not only describe the dataset but also be able to evolve with new generated data. The evolution means here the concepts like backward and forward compatibility.
- Open format - this point is also very clear. The solution shouldn't involve any vendor-locking, i.e. you should be able to freely access and manipulate your data.
- Fine grained updates and deletes - at the beginning of the Big Data era it was not obvious to perform in-place updates or deletes. Even though technically they're not the real in-place changes because always involve rewriting some part of the dataset in new files, it's now possible!
- Simplified Change Data Capture - in other words, the ability to easily stream the changes written on the table.
- Data assert - the ability to control the quality of the data, e.g. by avoiding data corruption in case of invalid schema.
- Optimized query performance - does the storage technology provide any extra query optimization components, like for example the indices?
UDM and Delta Lake
Let's see now how does Delta Lake, quoted in the original article as an example, supports all these features.
- Transactions - ACID guarantee supported.
- Schema metadata management - Delta Lake is a format based on Apache Parquet, so it automatically inherits from its schema capabilities. Moreover, you will find schema-related features like the one from the next point.
- Schema enforcement and evolution - as a writer, unless the schema migration is explicitly allowed, your dataset's schema cannot bring any unexpected changes like an extra column, an existing column of different type or case sensitivity. Schema evolution is supported with the mergeSchema option. The automatic evolution is limited to a bunch of use cases like new fields or some migrations between types. Otherwise, in addition to the mergeSchema option, you'll need to set the overwriteSchema to true, so that the dataset fitting to the new schema can be rewritten.
- Open format - Delta Lake is an Open Source solution within Linux Foundation.
- Fine grained updates and deletes - update and delete operations with filtering conditions that maybe you used in the past with RDBMS, are now also available with Delta Lake.
- Simplified Change Data Capture - Delta Lake supports CDC pattern as the data sink, for example with append write mode, and as the data source thanks to Structured Streaming integration.
- Data assert - can be related to the data corruption protection mechanism based on the schema consistency.
- Optimized query performance - Apache Parquet and columnar storage are good query accelerators. However, the Databricks version of Delta Lake has another accelerator, the Z-Order clustering, very useful to skip the blocks not useful regarding the data processing logic.
Below you can find a quick demo of some of these features:
To sum-up, the idea of the Unified Data Management concept is to provide a single storage layer able to fulfill all of the required patterns listed in the 2 previous sections. Delta Lake demonstrated in the video just above is one of the solutions supporting a major part of them. And now, after writing this article, I have a feeling that the Unified Data Management concept is another name to describe lakehouse architecture. Does it make sense?