Azure Durable Functions

When I first heard about Durable Functions, my reaction was: "So cool! We can now build fully serverless streaming stateful pipelines!". Indeed, we can but it's not their single feature!

Continue Reading →

Stage level scheduling

The idea of writing this blog post came to me when I was analyzing Kubernetes changes in Apache Spark 3.1.1. Starting from this version we can use stage level scheduling, so far available only for YARN. Even though it's probably a very low level feature, it intrigued me enough to write a few words here!

Continue Reading →

What's new on the cloud for data engineers - part 4 (05-08.2021)

It's time for the 4th part of the "What's new on the cloud for data engineers" series. This time I will cover the changes between May and August.

Continue Reading →

Iterators in Apache Spark

I had this "aha moment" while I was preparing the blog posts about the shuffle readers. Apache Spark uses iterators a lot! In this blog post you will see the places where I had met them the last months.

Continue Reading →

Cloud networking aspects for data engineers

Guess what topic I was afraid of at the beginning of my cloud journey as a data engineer? Networking! VPC, VPN, firewalls, ... I thought I would be able to live without the network lessons from school, but how wrong I was! IMO, as a data engineer, you should know a bit about networking since it's often related to the security part of the architectures you'll design. And in this article, I'll share with you some networking points I would like to know before starting to work on the cloud.

Continue Reading →

Shuffle reading in Apache Spark SQL - wrapping iterators and beyond

It's time for the 2nd blog post about the shuffle readers. Recently, we discovered how Apache Spark fetches the shuffle blocks from local and remote hosts. Today, I would like to share with you the wrapping iterators. Sounds mysterious? It won't be if we start by looking at the iterators participating in the processing of shuffle block files.

Continue Reading →

Data pipeline patterns with Azure Data Factory

Almost 2 years ago (already!), I wrote a blog post about data pipeline patterns in Apache Airflow (link in the "Read also" section). Since then I have worked with other data orchestrators. That's why I would like to repeat the same exercise but for Azure Data Factory.

Continue Reading →

Shuffle reading in Apache Spark SQL

So far I've covered the writing part of the shuffle files. You've learned about 3 different shuffle writers, but what happens with their generated files? Who and how reads them? Is the reading an in-memory operation? I will try to answer this and some other questions in this blog post.

Continue Reading →

Costs management on the cloud

The easiest way to learn is by doing but what if it involves leaving your credit card number beforehand? I've never been comfortable with that but there is no other choice to get some hands-on experience on the cloud. Hopefully, it doesn't mean you can't control your expenses. In this article, we'll see how.

Continue Reading →

Apache Spark can be eagerly evaluated too - Commands

Some time ago I participated in an interesting meetup about the MERGE operation in Delta Lake (link in the Further reading section). Jacek Laskowski presented the operation internals and asked an interesting question about the difference between commands and execs. Since I didn't know the answer right away, I decided to explore the commands concepts in this blog post.

Continue Reading →

Serverless streaming processing on the cloud: Azure Stream Analytics vs AWS Kinesis Data Analytics

I was writing this blog post while preparing for Azure's DP-200 and DP-201 certification. Why? To make some cleaning in my head and organize what I learned about Azure Stream Analytics and compare it with what I knew about AWS Kinesis Analytics.

Continue Reading →

Structured Streaming and Apache Kafka Schema Registry

The topic of this post brought Luan Carvalho who shared with me an Open Source project connecting Apache Spark to Apache Kafka Schema Registry. Initially, I wanted to exclusively focus on the project but on my way I discovered some other interesting points.

Continue Reading →

Data architectures on the cloud

I haven't fully understood it yet, why the story of data architectures is the story of Greek letters. With time, they changed the context and had to adapt from an on-premise environment, often sharing the same main services, to the cloud. In this blog post, I will shortly present data architectures and try to fit them to cloud data services on AWS, Azure and GCP. Spoiler alert, there will be more pictures than usual!

Continue Reading →

Join hints in Apache Spark SQL

With the Adaptive Query Execution module, you can have a feeling that Apache Spark will optimize the job for you. In part, yes, because it'll be able to optimize the job based on the runtime parameters you don't necessarily know. However, you also can master the execution, and ones of these mastery tools are hints.

Continue Reading →

Windows to the clouds

Guess what? My time-consuming learning mode based on reading the documentation paid again! This time on Azure because while reading about Stream Analytics windows I discovered that I missed some of them in the past. And since today is the day of the cloud, I will see if the same types of windows exist on AWS and GCP streaming services. And if no, what are the differences.

Continue Reading →