Articles about Data engineering patterns on waitingforcode.com - articles for the pleasure of learning and discovery

September 1, 2025 • Data engineering patterns

Get it once, few words on data deduplication patterns in data engineering

This blog post completes the data duplication problem I covered in my recent Data Engineering Design Patterns book by approaching the issue from a different angle.

Continue Reading →

July 10, 2025 • Data engineering patterns

Good to know if you merge - reprocessing challenges

MERGE, aka UPSERT, is a useful operation to combine two datasets if records identity is preserved. It appears then as a natural candidate for idempotent operations. Although it's true, there will be some challenges when things go wrong and you need to reprocess the data.

Continue Reading →

January 3, 2024 • Data engineering patterns

Stream processing models

If you're interested in stream processing, I bet your thinking is technology-based. It's not wrong, after all, the ability to use a tool gives you and me a job. However, for a long-term consideration it's better to reason in terms of patterns or models. Being aware of a more general vision helps assimilate new tools.

Continue Reading →

April 11, 2021 • Data engineering patterns

Right to be forgotten patterns: vertical partitioning

In my previous post I shared with you an approach called crypto-shredding that eventually can end up as a solution for the "right to be forgotten" point of GDPR. One of its drawbacks was performance degradation due to the need to fetch and decrypt every sensible value. To overcome it, I thought first about a cache but ended up by understanding that it's not the cache but something else! And I will explain this in the blog post.

Continue Reading →

April 4, 2021 • Data engineering patterns

Right to be forgotten patterns: crypto-shredding

Thanks to the most recent data regulation policies, we can ask a service to delete our personal data. Even though it seems relatively easy in a Small Data context, it's a bit more challenging for Big Data systems. Hopefully - under the authorization of your legal department - there is a smart solution to that problem called crypto-shredding.

Continue Reading →

March 13, 2021 • Data engineering patterns

Unified Data Management patterns

I wrote a lot of blog posts by chance, after losing myself on the Internet. It's also the case of the one you're currently reading. I looked for Delta Lake's learning resources and found an interesting schema depicting the Unified Data Management patterns. Since this term was something new for me, and I like everything with the "pattern" in the name, I couldn't miss the opportunity to explore this topic!

Continue Reading →

November 22, 2020 • Data engineering patterns

Design patterns applied to the data

GoF Design Patterns are pretty easy to understand if you are a programmer. You can read one of many books or articles, and analyze their implementation in the programming language of your choice. But it can be less obvious for data people with a weaker software engineering background. If you are in this group and wondering what these GoF Design Patterns are about, I hope this article will help a bit.

Continue Reading →

October 4, 2020 • Data engineering patterns

Data deduplication with an intermediate data store

Last year I wrote a blog post about a batch layer in streaming-first architectures like Kappa. I presented there a few approaches to synchronize the streaming broker with an object or distributed file systems store, without introducing the duplicates. Some months ago I found another architectural design that I would like to share with you here.

Continue Reading →

July 19, 2020 • Data engineering patterns

Landing zone or direct writes?

I don't know whether it's a good sign or not, but I start having some convictions about building data systems. Of course, building an architecture will always be the story of trade-offs but there are some practices that I tend to prefer than the others. And in this article I will share my thoughts on one of them.

Continue Reading →

June 13, 2020 • Data engineering patterns

Duplicates in data engineering reprocessing - problems and solutions

Poor quality of data comes out in different forms. The incomplete datasets, inconsistent schemas, the same attribute represented in multiple formats are only some of the characteristics. Another point that I would like to address in this post, are duplicates.

Continue Reading →

April 11, 2020 • Data engineering patterns

Slowly changing dimensions types and Apache Spark SQL examples

Few times ago I got an interesting question in the comment about slowly changing dimensions data. Shame on me, but I encountered this term for the first time. After a quick search, I found some basic information and made a decision to document it in this blog post.

Continue Reading →

April 4, 2020 • Data engineering patterns

Output invalidation pattern with time travel

Some time ago I wrote a blog post about output invalidation pattern using immutable time-based tables. Today, even though I planned to start to explore new ACID-compliant file formats only by the end of this year, I decided to cheat a little (curiosity beat me) and try to adapt the pattern to one of these formats and use time travel feature to guarantee data consistency.

Continue Reading →

November 24, 2019 • Data engineering patterns

Output invalidation pattern

My last slides of Spark Summit 2019 were dedicated to an output invalidation pattern that is very useful to build maintainable data pipelines. In this post I will deep delve into it.

Continue Reading →

September 29, 2019 • Data engineering patterns

Big Data patterns implemented - processing abstraction

Do you imagine a world where everybody speaks the same language? It's difficult. Fortunately, it's much easier to do in data engineering where a single API can apply to batch and streaming processing.

Continue Reading →

August 18, 2019 • Data engineering patterns

Big Data patterns implemented - dataset decomposition

This next post about data engineering patterns implemented came to my mind when I saw a question about applying custom partitioning on a not pair RDD. If you don't know, it's not supported and IMO one of the reasons for that comes from the dataset decomposition pattern implementation in Apache Spark.

Continue Reading →

August 4, 2019 • Data engineering patterns

ETL data patterns with Apache Airflow

Some time ago I found an article presenting ETL patterns. It's quite interesting (link in "Read more" section) but it doesn't provide code examples. That's why I will try to complete it with the implementations for presented patterns in Apache Airflow.

Continue Reading →

July 4, 2019 • Data engineering patterns

Idempotent consumer with AWS DynamoDB streams

In my previous post I presented an implementation of idempotent consumer pattern with Apache Cassandra CDC. One of drawbacks of that solution was the necessity of producing the messages with slower lightweight transactions. In this post I will show you how to do the same with AWS DynamoDB streams and without that constraint.

Continue Reading →

July 3, 2019 • Data engineering patterns

Big Data patterns implemented - Complex Logic Decomposition

KISS principle is valid not only for software engineering but also for data pipelines. The pattern called Complex Logic Decomposition illustrates this pretty well.

Continue Reading →

June 27, 2019 • Data engineering patterns

Change Data Capture and Apache Cassandra idempotent consumer

Recently I wrote posts about idempotent consumer pattern analyzing Apache Camel implementation and CDC applied on NoSQL stores. After that I had an idea, what happened if we would mix both of them?

Continue Reading →

June 26, 2019 • Data engineering patterns

Big Data patterns implemented - data size reduction

After several weeks of inactivity, the series about data engineering patterns is back. In this resume's article, I will present a pattern called dataset reduction.

Continue Reading →

Data engineering patterns articles