Data engineering articles

Looking for something else? Check the categories of Data engineering:

Apache Airflow Big Data algorithms Big Data problems - solutions Data engineering patterns General Big Data General data engineering Graphs SQL

If not, below you can find all articles belonging to Data engineering.

Externally triggered DAGs in Apache Airflow

In one of my previous posts, I described orchestration and coordination in the data context. At the end I promised to provide some code proofs to the theory and architecture described there. And that moment of truth is just coming.

Continue Reading →

Idempotent consumer with AWS DynamoDB streams

In my previous post I presented an implementation of idempotent consumer pattern with Apache Cassandra CDC. One of drawbacks of that solution was the necessity of producing the messages with slower lightweight transactions. In this post I will show you how to do the same with AWS DynamoDB streams and without that constraint.

Continue Reading →

Big Data patterns implemented - Complex Logic Decomposition

KISS principle is valid not only for software engineering but also for data pipelines. The pattern called Complex Logic Decomposition illustrates this pretty well.

Continue Reading →

Change Data Capture and Apache Cassandra idempotent consumer

Recently I wrote posts about idempotent consumer pattern analyzing Apache Camel implementation and CDC applied on NoSQL stores. After that I had an idea, what happened if we would mix both of them?

Continue Reading →

Big Data patterns implemented - data size reduction

After several weeks of inactivity, the series about data engineering patterns is back. In this resume's article, I will present a pattern called dataset reduction.

Continue Reading →

Idempotent consumer pattern

Idempotence is something I appreciate, maybe the most, in data engineering. If you write an idempotent logic you don't need to worry when your logic is reprocessed. You don't need to worry that it will generate duplicates or inconsistent results between runs. However, using it is not always easy and I'm actively looking for all related patterns to it. This time I will focus on idempotent consumer implementation in Apache Camel. Even though it may sounds old-school with modern streaming and messaging solutions, it's a good solution to know.

Continue Reading →

Data curation concept

There are a lot of data engineering ideas starting with "data" and sometimes they may be confusing. In this post I will focus on the data curation concept and, among others, show some differences with other "data-like" terms.

Continue Reading →

Message queues and streaming brokers

Before I came to data engineering, I was working a lot with web services and messaging technologies like RabbitMQ and Spring Integration. The day when I started with streaming brokers I was a little bit confused since everything seemed the same but in reality, was slightly different. There were and still are some subtle differences between queues and streaming brokers. In this post, I will focus on them and try to give a better definition for queues and streams.

Continue Reading →

Reservoir sampling for bounded and unbounded data

Every time when I see a new thing, I try to note it somewhere and come back later. The "later" is driven by how many times I will meet that thing. More often I see it in the books or conferences, earlier I deep delve into it. And that's the story of this post about reservoir sampling algorithm I met twice during last month.

Continue Reading →

Change Data Capture and NoSQL

Change Data Capture (CDC) is a technique helping to smoothly pass from classical and static data warehouse solution to modern streaming-centric architecture. To do that you can use solutions like Debezium which connects RDBMS or MongoDB to Apache Kafka. In this post, I will try to check whether CDC can also apply to other data stores like Apache Cassandra, Elasticsearch and AWS DynamoDB.

Continue Reading →

Bipartite graph recommendation example

When I was analyzing the API of Gelly, I was quite surprised for its support of bipartite graphs. First, because I didn't know that data structure and second because it wasn't supported in other analyzed frameworks. Hence, I added that graph structure to my backlog and sometime later wrote a post to explain it better.

Continue Reading →

Big Data patterns implemented - fan-out ingress in Apache Spark Structured Streaming

In the previous post from Big Data patterns implemented series, I wrote about a pattern called fan-in ingress. The idea was to consolidate the data coming from different sources. This time I will cover its companion called fan-out ingress, doing exactly the opposite.

Continue Reading →

Data pipelines: orchestration, choreography or both?

Some time ago I found an interesting article describing 2 faces of synchronizing the data pipelines - orchestration and choreography. The article ended with an interesting proposal to use both of them as a hybrid solution. In this post, I will try to implement that idea.

Continue Reading →

Batch layer in streaming-based architectures - approaches

Streaming processing is great because it guarantees low latency and quite fresh insight. But on the other side, we won't always need such latency and for these situations, a batch processing will often be a better fit because of apparently simpler semantics. In data architectures, batch layer is perceived differently. Kappa, which is a streaming-based model, makes it optional when the streaming broker can guarantee long data retention. But if it's not the case, the data must be copied into some more persistent storage like a distributed file system.

Continue Reading →

Big Data patterns implemented - fan-in ingress

The series about the implementation of Big Data patterns continues. This time I will focus on a streaming pattern called fan-in ingress.

Continue Reading →

Big Data patterns implemented - automated processing metadata insertion

Sometimes metadata is disregarded but very often it helps to retrieve the information easier and faster. One of such use cases are the headers of Apache Parquet where the stats about the column's content are stored. The reader can, without parsing all the lines, know whether what is he looking for is in the file or not. The metadata is also a part of one of Big Data patterns called automated processing metadata insertion.

Continue Reading →

Big Data patterns implemented - automated dataset execution

Some time ago I found a site listing Big Data patterns (link in "Read also" section). However, that site describes them from a very general point of view and it's not always obvious to figure out the what, why and how. That's why I decided to start a new series of posts where I will try to describe these patterns and give some more technical context.

Continue Reading →

Introduction to horizontal scalability

Two great features whose I experienced when I have been working with Dataflow were the serverless character and the auto-scalability. That's why when I first saw the Apache Spark on Kubernetes initiative, I was more than happy to write one day the pipelines automatically adapting to the workload. That also encouraged me to discover the horizontal scalability and this post is the first result of my recent research on that topic.

Continue Reading →

SQL GROUPING SETS operator

I have already described grouping sets feature in the context of Apache Spark. But natively they are a part of SQL standard and that's why I would like to extend the previous post here. After all, you don't need Big Data to use them - even though nowadays it's difficult to not to deal with it.

Continue Reading →

Minus/except operator in SQL

Last time we've discovered the INTERSECT operator. To recall it quickly, it returns all rows that are defined in the combined datasets. Today we'll discover another operator, doing the opposite and called depending on the vendor: MINUS or EXCEPT.

Continue Reading →