My journey to Azure Data Engineer Associate

I'm happy to complete my quest for data engineering certification on top of 3 major cloud providers. Last year I became AWS Big Data certified, in January a GCP Data Engineer, and more recently, I passed DP-200 and DP-201 and became an Azure Data Engineer Associate. Although DP-203 will soon replace the 2 exams, I hope this article will help you prepare for it!

Continue Reading →

Checkpoint file manager - FileSystem and FileContext

If you read my blog post, you certainly noticed that very often I get lost on the internet. Fortunately, very often it helps me write blog posts. But the internet is not the only place where I can get lost. It also happens to me to do that with Apache Spark code and one of my most recent confusions was about FileSystem and FileContext classes.

Continue Reading →

Serverless streaming on AWS - an overview

If you already worked on AWS and tried to implement streaming applications, you certainly noticed one thing. There is no single way to do it! And if you didn't notice that, I hope that this blog post will convince you, and by the way, help you to get a better understanding of the available solutions.

Continue Reading →

What's new in Apache Spark 3.1.1 - new built-in functions

Every Apache Spark release brings not only completely new components but also new native functions. The 3.1.1 is not an exception and it also comes with some new built-in functions!

Continue Reading →

What's new on the cloud for data engineers - part 3 (02-04.2021)

It's time for the 3rd part of "What's new on the cloud for data engineers" series. This time I will cover the changes between February and April.

Continue Reading →

What's new in Apache Spark 3.1 - JDBC (WIP) and DataSource V2 API

Even though the change I will describe in this blog post is still in progress, it's worth attention, especially that I missed the DataSource V2 evolution in my previous blog posts.

Continue Reading →

Make your data disappear on the cloud

Even though the storage is cheap and virtually unlimited, it doesn't mean we have to store all the data all the time. And to deal with this lifecycle requirement, we can either write a pipeline that will remove obsolete records or we can rely on the cloud services offerings for data management. I propose a short overview of them in this blog post.

Continue Reading →

What's new in Apache Spark 3.1 - Kubernetes Generally Available!

After several months spent as an "experimental" feature in Apache Spark, Kubernetes was officially promoted to a Generally Available scheduler in the 3.1 release! In this blog post, we'll discover the last changes made before this promotion.

Continue Reading →

GCP Dataflow by an Apache Spark guy

Some months ago I wrote a blog post where I presented BigQuery from a perspective of an Apache Spark user. Today I will do the same exercise but applied to the same category of data processing frameworks. In other words, I will try to understand GCP Dataflow thanks to my Apache Spark knowledge!

Continue Reading →

What's new in Apache Spark 3.1 - nodes decommissioning

I have a feeling that a lot of things related to the scalability happened in the 3.1 release. General Availability of Kubernetes that I will cover next week is only one of them. The second one is the nodes decommissioning!

Continue Reading →

Right to be forgotten patterns: vertical partitioning

In my previous post I shared with you an approach called crypto-shredding that eventually can end up as a solution for the "right to be forgotten" point of GDPR. One of its drawbacks was performance degradation due to the need to fetch and decrypt every sensible value. To overcome it, I thought first about a cache but ended up by understanding that it's not the cache but something else! And I will explain this in the blog post.

Continue Reading →

What's new in Apache Spark 3.1 - predicate pushdown for JSON, CSV and Apache Avro

Predicate pushdown is a data processing technique taking user-defined filters and executing them while reading the data. Apache Spark already supported it for Apache Parquet and RDBMS. Starting from Apache Spark 3.1.1, you can also use them for Apache Avro, JSON and CSV formats!

Continue Reading →

Right to be forgotten patterns: crypto-shredding

Thanks to the most recent data regulation policies, we can ask a service to delete our personal data. Even though it seems relatively easy in a Small Data context, it's a bit more challenging for Big Data systems. Hopefully - under the authorization of your legal department - there is a smart solution to that problem called crypto-shredding.

Continue Reading →

What's new in Apache Spark 3.1 - Project Zen

I mentioned it very shortly in the first blog post ever about PySpark. Thanks to the Project Zen initiative, the Python part of Apache Spark will become more Pythonic and user friendly. How? Let's check that in this blog post!

Continue Reading →

ML for data engineers - what I learned when preparing GCP Data Engineer certification

I wrote this blog post a week before passing the GCP Data Engineer exam, hoping it'll help to organize a few things in my head (it did!). I also hope that it'll help you too in understanding ML from a data engineering perspective!

Continue Reading →