Object stores on the cloud

The next step of my multi-cloud exploration will be object stores. In the article I will try to find similarities between S3, Storage Account and GCS.

Continue Reading →

Shuffle writers: BypassMergeSortShuffleWriter

In the previous blog post we discovered the SortShuffleWriter. However, the SortShuffleManager's first choice is BypassMergeSortShuffleWriter, presented in this article.

Continue Reading →

Dead-letter pattern on the cloud

Data is not always as clean as we would like it to be. The statement is even more true for semi-structured formats like JSON, where we feel working with a structure, but unfortunately, it's not enforced. Hence, from time to time, our code can unexpectedly fail. To handle this problem - as for many others - there is a pattern. It's called dead-letter qnd I will describe it below in the context of cloud services.

Continue Reading →

Shuffle writers: SortShuffleWriter

In the beginning I thought that the mappers sent shuffle files to the reducers. After understanding that it was the opposite, I was thinking that a part of the shuffle data is kept in memory for the performance purposes... Once I corrected all these misbeliefs about shuffle, I noted a few points to explore. One of these points are shuffle writers that I will present in the next 3 blog posts.

Continue Reading →

Azure Data Factory control flows in Apache Airflow

How to orchestrate your data pipelines on the cloud? Often, you will have a possibility to use managed Open Source tools like Cloud Composer on GCP or Amazon Managed Workflows for Apache Airflow on AWS. Sometimes, you will need to use cloud services like for Azure and its Data Factory orchestrator. Is it complicated to create Data Factory pipelines with the Apache Airflow knowledge? We'll see that in this blog post.

Continue Reading →

Under-the-hood: repartition

Previously we discovered what happens when you coalesce a dataset. To recall, it doesn't involve shuffle operation. It's then the opposite of a repartition operation which is a first class shuffle citizen.

Continue Reading →

Streaming data sources on the cloud

Streaming broker is one of very common entry points for modern data systems. Since they're running on the cloud, and that one of my goals for this year is to acquire a multi-cloud vision, it's a moment to see what AWS, Azure and GCP propose in this field!

Continue Reading →

State store metrics

State store is a critical part of any stateful Structured Streaming application. It's important to know what happens when your business logic and input data interact with it. State store metrics will provide you some key insight into this interaction. If you don't know them now, no worries, it's the topic of this blog post!

Continue Reading →

My journey to Azure Data Engineer Associate

I'm happy to complete my quest for data engineering certification on top of 3 major cloud providers. Last year I became AWS Big Data certified, in January a GCP Data Engineer, and more recently, I passed DP-200 and DP-201 and became an Azure Data Engineer Associate. Although DP-203 will soon replace the 2 exams, I hope this article will help you prepare for it!

Continue Reading →

Checkpoint file manager - FileSystem and FileContext

If you read my blog post, you certainly noticed that very often I get lost on the internet. Fortunately, very often it helps me write blog posts. But the internet is not the only place where I can get lost. It also happens to me to do that with Apache Spark code and one of my most recent confusions was about FileSystem and FileContext classes.

Continue Reading →

Serverless streaming on AWS - an overview

If you already worked on AWS and tried to implement streaming applications, you certainly noticed one thing. There is no single way to do it! And if you didn't notice that, I hope that this blog post will convince you, and by the way, help you to get a better understanding of the available solutions.

Continue Reading →

What's new in Apache Spark 3.1.1 - new built-in functions

Every Apache Spark release brings not only completely new components but also new native functions. The 3.1.1 is not an exception and it also comes with some new built-in functions!

Continue Reading →

What's new on the cloud for data engineers - part 3 (02-04.2021)

It's time for the 3rd part of "What's new on the cloud for data engineers" series. This time I will cover the changes between February and April.

Continue Reading →

What's new in Apache Spark 3.1 - JDBC (WIP) and DataSource V2 API

Even though the change I will describe in this blog post is still in progress, it's worth attention, especially that I missed the DataSource V2 evolution in my previous blog posts.

Continue Reading →

Make your data disappear on the cloud

Even though the storage is cheap and virtually unlimited, it doesn't mean we have to store all the data all the time. And to deal with this lifecycle requirement, we can either write a pipeline that will remove obsolete records or we can rely on the cloud services offerings for data management. I propose a short overview of them in this blog post.

Continue Reading →