My name is Bartosz Konieczny. I'm a freelance data engineer and author of the Data Engineering Design Patterns (O'Reilly) book. When I'm not helping clients solve data engineering challenges to drive business value, I enjoy sharing what I learned here.
Consulting β Courses & trainings β Data Engineering Design Patterns Book β Github β Most recent blog posts βWe all agree, data quality is essential to build trustworthy dashboards or ML algorithms. For so long the single possibility to validate the data for file formats before writing was reserved to the data processing jobs. Thankfully, Delta Lake constraints made this validation possible at the data storage layer (technically, it's still a compute layer but at a very high level of abstraction).
When you start a Structured Streaming job, your Spark UI will get a new tab in the menu where you follow the progress of the running jobs. In the beginning this part may appear a bit complex but there are some visual detection patterns that can help you understand what's going on.
Last time I wrote about a special - but logical - behavior of NULLs in joins. Today it's time to see other queries where NULLs behave differently than columns with values.
If you know it, lucky you. If not, I bet you'll spend some time on getting the reason why two - apparently the same rows - don't match in your full outer join statement.
This blog post completes the data duplication problem I covered in my recent Data Engineering Design Patterns book by approaching the issue from a different angle.
Dual writes - backend engineers have been facing this challenge for many years. If you are a data engineer with some projects running on production, you certainly faced it too. If not, I hope the blog post will shed some light on that issue and provide you a few solutions!
To close the topic of the new arbitrary stateful processing API in Apache Spark Structured Streaming let's focus on its...batch counterpart!
Last week we discovered the new way to write arbitrary stateful transformations in Apache Spark 4 with the transformWithState API. Today it's time to delve into the implementation details and try to understand the internal logic a bit better.
Arbitrary stateful processing has been evolving a lot in Apache Spark. The initial version with updateStateByKey evolved to mapWithState in Apache Spark 2. When Structured Streaming was released, the framework got mapGroupsWithState and flatMapGroupsWithState. Now, Apache Spark 4 introduces a completely new way to interact with the arbitrary stateful processing logic, the Arbitrary state API v2!
While I was writing about agnostic data quality alerts with ydata-profiling a few weeks ago, I had an idea for another blog post which generally can be summarized as "what do alerts do in data engineering projects". Since the answer is "it depends", let me share my thoughts on that.
Defining data quality rules and alerts is not an easy task. Thankfully, there are various ways that can help you automate the work. One of them is data profiling that we're going to focus on in this blog post!
One of the recommended ways of sharing a library on Databricks is to use the Unity Catalog to store the packages in the volumes. That's the theory but the question is, how to connect the dots between the release preparation and the release process? I'll try to answer this in the blog post.
MERGE, aka UPSERT, is a useful operation to combine two datasets if records identity is preserved. It appears then as a natural candidate for idempotent operations. Although it's true, there will be some challenges when things go wrong and you need to reprocess the data.
Even though data engineers enjoy discussing table file formats, distributed data processing, or more recently, small data, they still need to deal with legacy systems. By "legacy," I mean not only the code you or your colleagues wrote five years ago but also data formats that have been around for a long time. Despite being challenging for data engineers, these formats remain popular among business users. One of them is Excel.
Timely and accurate data is a Holy Grail for each data practitioner. To make it real, data engineers have to be careful about the transformations they make before exposing the dataset to consumers, but they also need to understand the timeline of the data.