Welcome to the blog!

My name is Bartosz Konieczny. I'm a freelance data engineer and author of the Data Engineering Design Patterns (O'Reilly) book. When I'm not helping clients solve data engineering challenges to drive business value, I enjoy sharing what I learned here.

Consulting β†’ Courses & trainings β†’ Data Engineering Design Patterns Book β†’ Github β†’ Most recent blog posts β†’

Data quality on Databricks - Spark Expectations

Previously we learned how to control data quality with Delta Live Tables. Now, it's time to see an open source library in action, Spark Expectations.

Continue Reading β†’

Data quality on Databricks - Delta Live Tables

Data quality is one of the key factors of a successful data project. Without a good quality, even the most advanced engineering or analytics work will not be trusted, therefore, not used. Unfortunately, data quality controls are very often considered as a work item to implement in the end, which sometimes translates to never.

Continue Reading β†’

Data contracts and Bitol project

Data contracts was a hot topic in the data space before LLMs and GenAI came out. They promised a better world with less communication issues between teams, leading to more reliable and trustworthy data. Unfortunately, the promise has been too hard to put into practice. Has been, or should I write "was"?

Continue Reading β†’

Apache Airflow XCom in Databricks with task values

If you have been working with Apache Airflow already, you certainly met XComs at some point. You know, these variables that you can "exchange" between tasks within the same DAG. If after switching to Databricks Workflows for data orchestration you're wondering how to do the same, there is good news. Databricks supports this exchange capability natively with Task values.

Continue Reading β†’

File trigger in Databricks

For over two years now you can leverage file triggers in Databricks Jobs to start processing as soon as a new file gets written to your storage. The feature looks amazing but hides some implementation challenges that we're going to see in this blog post.

Continue Reading β†’

The saveAsTable in Apache Spark SQL, alternative to insertInto

Is there an easier way to address the insertInto position-based data writing in Apache Spark SQL? Totally, if you use a column-based method such as saveAsTable with append mode.

Continue Reading β†’

Dealing with quotas and limits - Apache Spark Structured Streaming for Amazon Kinesis Data Streams

Using cloud managed services is often a love and hate story. On one hand, they abstract a lot of tedious administrative work to let you focus on the essentials. From another, they often have quotas and limits that you, as a data engineer, have to take into account in your daily work. These limits become even more serious when they operate in a latency-sensitive context, as the one of stream processing.

Continue Reading β†’

Overwriting partitioned tables in Apache Spark SQL

After publishing a release of my blog post about the insertInto trap, I got an intriguing question in the comments. The alternative to the insertInto, the saveAsTable method, doesn't work well on partitioned data in overwrite mode while the insertInto does. True, but is there an alternative to it that doesn't require using this position-based function?

Continue Reading β†’

The insertInto trap in Apache Spark SQL

Even though Apache Spark SQL provides an API for structured data, the framework sometimes behaves unexpectedly. It's the case of an insertInto operation that can even lead to some data quality issues. Why? Let's try to understand in this short article.

Continue Reading β†’

Event time skew and global watermark in Apache Spark Structured Streaming

A few months ago I wrote a blog post about event skew and how dangerous it is for a stateful streaming job. Since it was a high-level explanation, I didn't cover Apache Spark Structured Streaming deeply at that moment. Now the watermark topic is back to my learning backlog and it's a good opportunity to return to the event skew topic and see the dangers it brings for Structured Streaming stateful jobs.

Continue Reading β†’

Delta Lake and restore - traveling in time differently

Time travel is a quite popular Delta Lake feature. But do you know it's not the single one you can use to interact with the past versions? An alternative is the RESTORE command, and it'll be the topic of this blog post.

Continue Reading β†’

2024 retrospective on waitingforcode.com

Even though I was blogging less in the second half of the previous year, the retrospective is still the blog post I'm waiting for each year. Every year I summarize what happened in the past 12 months and share with you my future plans. It's time for the 2024 Edition!

Continue Reading β†’

DAIS 2024: Unit tests - configuration and declaration

Code organization and assertions flow are both important but even them, they can't guarantee your colleagues' adherence to the unit tests. There are other user-facing attributes to consider as well.

Continue Reading β†’

DAIS 2024: Orchestrating and scoping assertions in Apache Spark Structured Streaming

Testing batch jobs is not the same as testing streaming ones. Although the transformation (the WHAT from the previous article) is similar in both cases, more complete validation tests on the job logic are not. After all, streaming jobs often iteratively build the final outcome while the batch ones generate it in a single pass.

Continue Reading β†’

Data+AI Summit 2024 - Retrospective - Apache Spark

Welcome to the second blog post dedicated to the previous Data+AI Summit. This time I'm going to share with you a summary of Apache Spark talks.

Continue Reading β†’