Data engineering articles

Looking for something else? Check the categories of Data engineering:

Apache Airflow Big Data algorithms Big Data problems - solutions Data engineering patterns General Big Data General data engineering Graphs SQL

If not, below you can find all articles belonging to Data engineering.

Data Vault 2.0 and Big Data

In the previous blog post you discovered the first version of Data Vault methodology. But since the very first iteration, the specification evolved and a few years ago a version 2 was proposed. More adapted to the Big Data world, with several deprecation notes, and more examples adapted to the constantly evolving data world.

Continue Reading →

Good books to read for data engineers

Few weeks ago I got a comment asking me about the recommended data engineering books. I mentioned few of them in Becoming a data engineer - a feedback of my journey blog post but without explaining why. I will try to complete that in this blog post then.

Continue Reading →

Data modeling with Data Vault - part 1

If you hear "agile", "adapted to the changes", you certainly think about Scrum, Kanban and generally the Agile methodology. And you're correct but it's worth knowing that the agile term also applies to the data. More exactly, to the data modeling with the approach called Data Vault.

Continue Reading →

Design patterns applied to the data

GoF Design Patterns are pretty easy to understand if you are a programmer. You can read one of many books or articles, and analyze their implementation in the programming language of your choice. But it can be less obvious for data people with a weaker software engineering background. If you are in this group and wondering what these GoF Design Patterns are about, I hope this article will help a bit.

Continue Reading →

Project Oryx - Lambda architecture for data science

Lambda architecture is one of the first officially defined Big Data architectures. However, after few time it was replaced by simpler approaches like Kappa. But despite that, you can still find the projects on Lambda and one of them which grabbed my attention is Project Oryx.

Continue Reading →

Data deduplication with an intermediate data store

Last year I wrote a blog post about a batch layer in streaming-first architectures like Kappa. I presented there a few approaches to synchronize the streaming broker with an object or distributed file systems store, without introducing the duplicates. Some months ago I found another architectural design that I would like to share with you here.

Continue Reading →

Data validation frameworks - Great Expectations and orchestration

So far I played with Great Expectations and discovered the main classes. Today it's time to see how to automate our data validation pipeline.

Continue Reading →

Data validation frameworks - Great Expectations classes

In my previous post I presented a very simplified version of a Great Expectations data validation pipeline. Today, before going further and integrating the pipeline with a data orchestration tool, it's a good moment to see what's inside the framework.

Continue Reading →

Data validation frameworks - introduction to Great Expectations

When I published my blog post about Deequ and Apache Griffin in March 2020, I thought that there was nothing more to do with data validation frameworks. Hopefully, Alexander Wagner pointed me out another framework, Great Expectations that I will discover in the series of 3 blog posts.

Continue Reading →

Landing zone or direct writes?

I don't know whether it's a good sign or not, but I start having some convictions about building data systems. Of course, building an architecture will always be the story of trade-offs but there are some practices that I tend to prefer than the others. And in this article I will share my thoughts on one of them.

Continue Reading →

Becoming a data engineer - a feedback of my journey

Recently a reader asked me in a PM about the things to know and to learn before starting to work as a data engineer. Since I think that my point of view may be interesting for more than 1 person (if not, I'm really sorry), I decided to write a few words about it.

Continue Reading →

Duplicates in data engineering reprocessing - problems and solutions

Poor quality of data comes out in different forms. The incomplete datasets, inconsistent schemas, the same attribute represented in multiple formats are only some of the characteristics. Another point that I would like to address in this post, are duplicates.

Continue Reading →

Dealing with time delta in Apache Airflow

Often in batch processing we give the pipeline some time to catch up late data, ie. the pipeline for 9 will be executed only at 11. One of methods to do so in Airflow is to compute delta on the tasks but there is a more "native" way with TimeDeltaSensor.

Continue Reading →

DAG evolution - using start_date and end_date?

One of the greatest properties in data engineering is idempotency. No matters how many times you will run your pipeline, you will always end up with the same outcome (= 1 file, 1 new table, ...). However, this property may be easily broken when you need to evolve your pipeline. In this blog post, I will verify one possible way to manage it in Apache Airflow.

Continue Reading →

Optimistic concurrency control - a little bit of theory and a little bit more examples

It has been a while since I didn't write about general distributed systems topics. That's the reason for this article where I will focus on the topic of an optimistic concurrency control.

Continue Reading →

Managing task dependencies - data or triggering?

One of the most powerful features of an orchestration system is the ability to ... yes, orchestrate different and apparently unrelated pipelines. But how to do so? By directly triggering a task or by using the data?

Continue Reading →

Dark data and data discovery in Apache Spark SQL

Preparing an AWS exam is not only a good way to discover AWS services but also more general concepts. It happened to me when I first heard about dark data during a talk presenting AWS Glue.

Continue Reading →

Slowly changing dimensions types and Apache Spark SQL examples

Few times ago I got an interesting question in the comment about slowly changing dimensions data. Shame on me, but I encountered this term for the first time. After a quick search, I found some basic information and made a decision to document it in this blog post.

Continue Reading →

Output invalidation pattern with time travel

Some time ago I wrote a blog post about output invalidation pattern using immutable time-based tables. Today, even though I planned to start to explore new ACID-compliant file formats only by the end of this year, I decided to cheat a little (curiosity beat me) and try to adapt the pattern to one of these formats and use time travel feature to guarantee data consistency.

Continue Reading →

Big Data and data removal - truncate or delete?

When I started to work with data on my very first PHP and Java projects, I used only DELETE operator to remove the data. When I switched to (big) data engineering, I found more efficient ways to deal with this operation through TRUNCATE or DROP operations.

Continue Reading →