Data engineering articles

Home Data engineering

Looking for something else? Check the categories of Data engineering:

Apache Airflow Big Data algorithms Big Data problems - solutions Data engineering patterns Databricks General Big Data General data engineering Graphs SQL

If not, below you can find all articles belonging to Data engineering.

April 4, 2021 • Data engineering patterns

Right to be forgotten patterns: crypto-shredding

Thanks to the most recent data regulation policies, we can ask a service to delete our personal data. Even though it seems relatively easy in a Small Data context, it's a bit more challenging for Big Data systems. Hopefully - under the authorization of your legal department - there is a smart solution to that problem called crypto-shredding.

Continue Reading →

March 28, 2021 • General data engineering

ML for data engineers - what I learned when preparing GCP Data Engineer certification

I wrote this blog post a week before passing the GCP Data Engineer exam, hoping it'll help to organize a few things in my head (it did!). I also hope that it'll help you too in understanding ML from a data engineering perspective!

Continue Reading →

March 13, 2021 • Data engineering patterns

Unified Data Management patterns

I wrote a lot of blog posts by chance, after losing myself on the Internet. It's also the case of the one you're currently reading. I looked for Delta Lake's learning resources and found an interesting schema depicting the Unified Data Management patterns. Since this term was something new for me, and I like everything with the "pattern" in the name, I couldn't miss the opportunity to explore this topic!

Continue Reading →

March 7, 2021 • General data engineering

DataOps - good and bad points

After introducing DataOps concepts, it's a good time to share my feelings on them ?

Continue Reading →

February 14, 2021 • General data engineering

DataOps - DevOps in the data world or a bit more than that?

"DataOps", this term is present in my backlog since a while already and I postponed it multiple times. But I finally found some time to learn more about it and share my thoughts with you.

Continue Reading →

January 3, 2021 • General Big Data

Data Vault 2.0 and Big Data

In the previous blog post you discovered the first version of Data Vault methodology. But since the very first iteration, the specification evolved and a few years ago a version 2 was proposed. More adapted to the Big Data world, with several deprecation notes, and more examples adapted to the constantly evolving data world.

Continue Reading →

December 13, 2020 • General data engineering

Good books to read for data engineers

Few weeks ago I got a comment asking me about the recommended data engineering books. I mentioned few of them in Becoming a data engineer - a feedback of my journey blog post but without explaining why. I will try to complete that in this blog post then.

Continue Reading →

December 6, 2020 • General Big Data

Data modeling with Data Vault - part 1

If you hear "agile", "adapted to the changes", you certainly think about Scrum, Kanban and generally the Agile methodology. And you're correct but it's worth knowing that the agile term also applies to the data. More exactly, to the data modeling with the approach called Data Vault.

Continue Reading →

November 22, 2020 • Data engineering patterns

Design patterns applied to the data

GoF Design Patterns are pretty easy to understand if you are a programmer. You can read one of many books or articles, and analyze their implementation in the programming language of your choice. But it can be less obvious for data people with a weaker software engineering background. If you are in this group and wondering what these GoF Design Patterns are about, I hope this article will help a bit.

Continue Reading →

November 8, 2020 • Big Data problems - solutions

Project Oryx - Lambda architecture for data science

Lambda architecture is one of the first officially defined Big Data architectures. However, after few time it was replaced by simpler approaches like Kappa. But despite that, you can still find the projects on Lambda and one of them which grabbed my attention is Project Oryx.

Continue Reading →

October 4, 2020 • Data engineering patterns

Data deduplication with an intermediate data store

Last year I wrote a blog post about a batch layer in streaming-first architectures like Kappa. I presented there a few approaches to synchronize the streaming broker with an object or distributed file systems store, without introducing the duplicates. Some months ago I found another architectural design that I would like to share with you here.

Continue Reading →

August 23, 2020 • Big Data problems - solutions

Data validation frameworks - Great Expectations and orchestration

So far I played with Great Expectations and discovered the main classes. Today it's time to see how to automate our data validation pipeline.

Continue Reading →

August 9, 2020 • Big Data problems - solutions

Data validation frameworks - Great Expectations classes

In my previous post I presented a very simplified version of a Great Expectations data validation pipeline. Today, before going further and integrating the pipeline with a data orchestration tool, it's a good moment to see what's inside the framework.

Continue Reading →

July 26, 2020 • Big Data problems - solutions

Data validation frameworks - introduction to Great Expectations

When I published my blog post about Deequ and Apache Griffin in March 2020, I thought that there was nothing more to do with data validation frameworks. Hopefully, Alexander Wagner pointed me out another framework, Great Expectations that I will discover in the series of 3 blog posts.

Continue Reading →

July 19, 2020 • Data engineering patterns

Landing zone or direct writes?

I don't know whether it's a good sign or not, but I start having some convictions about building data systems. Of course, building an architecture will always be the story of trade-offs but there are some practices that I tend to prefer than the others. And in this article I will share my thoughts on one of them.

Continue Reading →

June 28, 2020 • General data engineering

Becoming a data engineer - a feedback of my journey

Recently a reader asked me in a PM about the things to know and to learn before starting to work as a data engineer. Since I think that my point of view may be interesting for more than 1 person (if not, I'm really sorry), I decided to write a few words about it.

Continue Reading →

June 13, 2020 • Data engineering patterns

Duplicates in data engineering reprocessing - problems and solutions

Poor quality of data comes out in different forms. The incomplete datasets, inconsistent schemas, the same attribute represented in multiple formats are only some of the characteristics. Another point that I would like to address in this post, are duplicates.

Continue Reading →

May 31, 2020 • Apache Airflow

Dealing with time delta in Apache Airflow

Often in batch processing we give the pipeline some time to catch up late data, ie. the pipeline for 9 will be executed only at 11. One of methods to do so in Airflow is to compute delta on the tasks but there is a more "native" way with TimeDeltaSensor.

Continue Reading →

May 24, 2020 • Apache Airflow

DAG evolution - using start_date and end_date?

One of the greatest properties in data engineering is idempotency. No matters how many times you will run your pipeline, you will always end up with the same outcome (= 1 file, 1 new table, ...). However, this property may be easily broken when you need to evolve your pipeline. In this blog post, I will verify one possible way to manage it in Apache Airflow.

Continue Reading →

May 9, 2020 • General Big Data

Optimistic concurrency control - a little bit of theory and a little bit more examples

It has been a while since I didn't write about general distributed systems topics. That's the reason for this article where I will focus on the topic of an optimistic concurrency control.

Continue Reading →

⟵ Previous
1
2
3
4
5
6
7
8
Next ⟶