Data engineering articles

Looking for something else? Check the categories of Data engineering:

Apache Airflow Big Data algorithms Big Data problems - solutions Data engineering patterns General Big Data General data engineering Graphs SQL

If not, below you can find all articles belonging to Data engineering.

Apache Airflow and sequential execution

One of patterns that you may implement in batch ETL is sequential execution. It means that the output of one job execution is a part of the input for the next job execution. Even though Apache Airflow comes with 3 properties to deal with the concurrence, you may need another one to avoid bad surprises.

Continue Reading β†’

Skewed data

Even data distribution is one of the guarantees of performant data processing. However, it's not a golden rule and sometimes you can encounter uneven distribution called skews.

Continue Reading β†’

CASE - SQL if-else

CASE operator is maybe one of the most unknown by the beginner users of SQL. Often when I see a question how to write an if-else condition in a SQL query, some people advise to write a UDF and use if-else directly inside. As you will see in this post, this solution is a little bit overkill though.

Continue Reading β†’

EXISTS operator in SQL

Years ago when I started to work as a software engineer, I was overusing IN/NOT IN operator. One day, one of my colleagues suggested me to replace it in some queries by EXISTS/NOT EXISTS. And it helped to improve the performances of these queries. If among you are some people like "me years ago", I prepared this short post introducing to EXISTS/NOT EXISTS operator by comparing it to IN/NOT IN one.

Continue Reading β†’

Big Data patterns implemented - dataset decomposition

This next post about data engineering patterns implemented came to my mind when I saw a question about applying custom partitioning on a not pair RDD. If you don't know, it's not supported and IMO one of the reasons for that comes from the dataset decomposition pattern implementation in Apache Spark.

Continue Reading β†’

Testing sensors in Apache Airflow

Unit tests are the backbone of any software, data-oriented included. However testing some parts that way may be difficult, especially when they interact with the external world. Apache Airflow sensor is an example coming from that category. Fortunately, thanks to Python's dynamic language properties, testing sensors can be simplified a lot.

Continue Reading β†’

ETL data patterns with Apache Airflow

Some time ago I found an article presenting ETL patterns. It's quite interesting (link in "Read more" section) but it doesn't provide code examples. That's why I will try to complete it with the implementations for presented patterns in Apache Airflow.

Continue Reading β†’

Externally triggered DAGs in Apache Airflow

In one of my previous posts, I described orchestration and coordination in the data context. At the end I promised to provide some code proofs to the theory and architecture described there. And that moment of truth is just coming.

Continue Reading β†’

Idempotent consumer with AWS DynamoDB streams

In my previous post I presented an implementation of idempotent consumer pattern with Apache Cassandra CDC. One of drawbacks of that solution was the necessity of producing the messages with slower lightweight transactions. In this post I will show you how to do the same with AWS DynamoDB streams and without that constraint.

Continue Reading β†’

Big Data patterns implemented - Complex Logic Decomposition

KISS principle is valid not only for software engineering but also for data pipelines. The pattern called Complex Logic Decomposition illustrates this pretty well.

Continue Reading β†’

Change Data Capture and Apache Cassandra idempotent consumer

Recently I wrote posts about idempotent consumer pattern analyzing Apache Camel implementation and CDC applied on NoSQL stores. After that I had an idea, what happened if we would mix both of them?

Continue Reading β†’

Big Data patterns implemented - data size reduction

After several weeks of inactivity, the series about data engineering patterns is back. In this resume's article, I will present a pattern called dataset reduction.

Continue Reading β†’

Idempotent consumer pattern

Idempotence is something I appreciate, maybe the most, in data engineering. If you write an idempotent logic you don't need to worry when your logic is reprocessed. You don't need to worry that it will generate duplicates or inconsistent results between runs. However, using it is not always easy and I'm actively looking for all related patterns to it. This time I will focus on idempotent consumer implementation in Apache Camel. Even though it may sounds old-school with modern streaming and messaging solutions, it's a good solution to know.

Continue Reading β†’

Data curation concept

There are a lot of data engineering ideas starting with "data" and sometimes they may be confusing. In this post I will focus on the data curation concept and, among others, show some differences with other "data-like" terms.

Continue Reading β†’

Message queues and streaming brokers

Before I came to data engineering, I was working a lot with web services and messaging technologies like RabbitMQ and Spring Integration. The day when I started with streaming brokers I was a little bit confused since everything seemed the same but in reality, was slightly different. There were and still are some subtle differences between queues and streaming brokers. In this post, I will focus on them and try to give a better definition for queues and streams.

Continue Reading β†’

Reservoir sampling for bounded and unbounded data

Every time when I see a new thing, I try to note it somewhere and come back later. The "later" is driven by how many times I will meet that thing. More often I see it in the books or conferences, earlier I deep delve into it. And that's the story of this post about reservoir sampling algorithm I met twice during last month.

Continue Reading β†’

Change Data Capture and NoSQL

Change Data Capture (CDC) is a technique helping to smoothly pass from classical and static data warehouse solution to modern streaming-centric architecture. To do that you can use solutions like Debezium which connects RDBMS or MongoDB to Apache Kafka. In this post, I will try to check whether CDC can also apply to other data stores like Apache Cassandra, Elasticsearch and AWS DynamoDB.

Continue Reading β†’

Bipartite graph recommendation example

When I was analyzing the API of Gelly, I was quite surprised for its support of bipartite graphs. First, because I didn't know that data structure and second because it wasn't supported in other analyzed frameworks. Hence, I added that graph structure to my backlog and sometime later wrote a post to explain it better.

Continue Reading β†’

Big Data patterns implemented - fan-out ingress in Apache Spark Structured Streaming

In the previous post from Big Data patterns implemented series, I wrote about a pattern called fan-in ingress. The idea was to consolidate the data coming from different sources. This time I will cover its companion called fan-out ingress, doing exactly the opposite.

Continue Reading β†’

Data pipelines: orchestration, choreography or both?

Some time ago I found an interesting article describing 2 faces of synchronizing the data pipelines - orchestration and choreography. The article ended with an interesting proposal to use both of them as a hybrid solution. In this post, I will try to implement that idea.

Continue Reading β†’