Data engineering articles

Looking for something else? Check the categories of Data engineering:

Apache Airflow Big Data algorithms Big Data problems - solutions Data engineering patterns General Big Data General data engineering Graphs SQL

If not, below you can find all articles belonging to Data engineering.

Becoming a data engineer - a feedback of my journey

Recently a reader asked me in a PM about the things to know and to learn before starting to work as a data engineer. Since I think that my point of view may be interesting for more than 1 person (if not, I'm really sorry), I decided to write a few words about it.

Continue Reading β†’

Duplicates in data engineering reprocessing - problems and solutions

Poor quality of data comes out in different forms. The incomplete datasets, inconsistent schemas, the same attribute represented in multiple formats are only some of the characteristics. Another point that I would like to address in this post, are duplicates.

Continue Reading β†’

Dealing with time delta in Apache Airflow

Often in batch processing we give the pipeline some time to catch up late data, ie. the pipeline for 9 will be executed only at 11. One of methods to do so in Airflow is to compute delta on the tasks but there is a more "native" way with TimeDeltaSensor.

Continue Reading β†’

DAG evolution - using start_date and end_date?

One of the greatest properties in data engineering is idempotency. No matters how many times you will run your pipeline, you will always end up with the same outcome (= 1 file, 1 new table, ...). However, this property may be easily broken when you need to evolve your pipeline. In this blog post, I will verify one possible way to manage it in Apache Airflow.

Continue Reading β†’

Optimistic concurrency control - a little bit of theory and a little bit more examples

It has been a while since I didn't write about general distributed systems topics. That's the reason for this article where I will focus on the topic of an optimistic concurrency control.

Continue Reading β†’

Managing task dependencies - data or triggering?

One of the most powerful features of an orchestration system is the ability to ... yes, orchestrate different and apparently unrelated pipelines. But how to do so? By directly triggering a task or by using the data?

Continue Reading β†’

Dark data and data discovery in Apache Spark SQL

Preparing an AWS exam is not only a good way to discover AWS services but also more general concepts. It happened to me when I first heard about dark data during a talk presenting AWS Glue.

Continue Reading β†’

Slowly changing dimensions types and Apache Spark SQL examples

Few times ago I got an interesting question in the comment about slowly changing dimensions data. Shame on me, but I encountered this term for the first time. After a quick search, I found some basic information and made a decision to document it in this blog post.

Continue Reading β†’

Output invalidation pattern with time travel

Some time ago I wrote a blog post about output invalidation pattern using immutable time-based tables. Today, even though I planned to start to explore new ACID-compliant file formats only by the end of this year, I decided to cheat a little (curiosity beat me) and try to adapt the pattern to one of these formats and use time travel feature to guarantee data consistency.

Continue Reading β†’

Big Data and data removal - truncate or delete?

When I started to work with data on my very first PHP and Java projects, I used only DELETE operator to remove the data. When I switched to (big) data engineering, I found more efficient ways to deal with this operation through TRUNCATE or DROP operations.

Continue Reading β†’

Data validation frameworks - Deequ and Apache Griffin overview

Poor data quality is the reason for big pains of data workers. Data engineers need often to deal with JSON inconsistent schemes, data analysts have to figure out dataset issues to avoid biased reportings whereas data scientists have to spend a big amount of time preparing data for training instead of dedicating this time on model optimization. That's why having a good tool to control data quality is very important.

Continue Reading β†’

Extended JSON validation with Cerberus - error definition and normalization

Last November I spoke at Paris.py meetup about integrating Cerberus with PySpark to enhance JSON validation. During the talk, I covered some points that I would like to share with you in this blog post, mostly about error definition and normalized validation.

Continue Reading β†’

Apache Airflow gotchas

From time to time I try to help other people on StackOverflow and one of my tagged topics is Apache Airflow. In this blog post I'll try to show you some problems I saw there last few months.

Continue Reading β†’

Output invalidation pattern

My last slides of Spark Summit 2019 were dedicated to an output invalidation pattern that is very useful to build maintainable data pipelines. In this post I will deep delve into it.

Continue Reading β†’

Big Data patterns implemented - processing abstraction

Do you imagine a world where everybody speaks the same language? It's difficult. Fortunately, it's much easier to do in data engineering where a single API can apply to batch and streaming processing.

Continue Reading β†’

Apache Airflow and sequential execution

One of patterns that you may implement in batch ETL is sequential execution. It means that the output of one job execution is a part of the input for the next job execution. Even though Apache Airflow comes with 3 properties to deal with the concurrence, you may need another one to avoid bad surprises.

Continue Reading β†’

Skewed data

Even data distribution is one of the guarantees of performant data processing. However, it's not a golden rule and sometimes you can encounter uneven distribution called skews.

Continue Reading β†’

CASE - SQL if-else

CASE operator is maybe one of the most unknown by the beginner users of SQL. Often when I see a question how to write an if-else condition in a SQL query, some people advise to write a UDF and use if-else directly inside. As you will see in this post, this solution is a little bit overkill though.

Continue Reading β†’

EXISTS operator in SQL

Years ago when I started to work as a software engineer, I was overusing IN/NOT IN operator. One day, one of my colleagues suggested me to replace it in some queries by EXISTS/NOT EXISTS. And it helped to improve the performances of these queries. If among you are some people like "me years ago", I prepared this short post introducing to EXISTS/NOT EXISTS operator by comparing it to IN/NOT IN one.

Continue Reading β†’

Big Data patterns implemented - dataset decomposition

This next post about data engineering patterns implemented came to my mind when I saw a question about applying custom partitioning on a not pair RDD. If you don't know, it's not supported and IMO one of the reasons for that comes from the dataset decomposition pattern implementation in Apache Spark.

Continue Reading β†’