Data processing articles

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

What's new in Apache Spark 3.2.0 - Data Source V2

Even though Data Source V2 is present in the API for a while, every release brings something new to it. This time too and we'll see what through this blog post!

Continue Reading →

What's new in Apache Spark 3.2.0 - push-based shuffle

In the previous Apache Spark releases you could see many shuffle evolutions such as shuffle files tracking or pluggable storage interface. And the things don't change for 3.2.0 which comes with the push-based merge shuffle.

Continue Reading →

What's new in Apache Spark 3.2.0 - SQL changes

Apache Spark SQL evolves and with each new release, it gets closer to the ANSI standard. The 3.2.0 release is not different and you can find many ANSI-related changes. But not only and hopefully, you'll discover all this in this blog post which has an unusual form because this time, I won't focus on the implementation details.

Continue Reading →

What's new in Apache Spark 3.2.0 - Structured Streaming

After previous blog posts focusing on 2 specific Structured Streaming features, it's time to complete them with a list of other changes made in the 3.2.0 version!

Continue Reading →

What's new in Apache Spark 3.2.0 - session windows

Initially I wanted to include the session windows in the blog post about Structured Streaming changes. But I changed my mind when I saw how many things it involves!

Continue Reading →

What's new in Apache Spark 3.2.0 - RocksDB state store

It's big news for Apache Spark Structured Streaming users. RocksDB is now available as a Vanilla Spark-backed state store backend!

Continue Reading →

Stage level scheduling

The idea of writing this blog post came to me when I was analyzing Kubernetes changes in Apache Spark 3.1.1. Starting from this version we can use stage level scheduling, so far available only for YARN. Even though it's probably a very low level feature, it intrigued me enough to write a few words here!

Continue Reading →

Iterators in Apache Spark

I had this "aha moment" while I was preparing the blog posts about the shuffle readers. Apache Spark uses iterators a lot! In this blog post you will see the places where I had met them the last months.

Continue Reading →

Shuffle reading in Apache Spark SQL - wrapping iterators and beyond

It's time for the 2nd blog post about the shuffle readers. Recently, we discovered how Apache Spark fetches the shuffle blocks from local and remote hosts. Today, I would like to share with you the wrapping iterators. Sounds mysterious? It won't be if we start by looking at the iterators participating in the processing of shuffle block files.

Continue Reading →

Shuffle reading in Apache Spark SQL

So far I've covered the writing part of the shuffle files. You've learned about 3 different shuffle writers, but what happens with their generated files? Who and how reads them? Is the reading an in-memory operation? I will try to answer this and some other questions in this blog post.

Continue Reading →

Apache Spark can be eagerly evaluated too - Commands

Some time ago I participated in an interesting meetup about the MERGE operation in Delta Lake (link in the Further reading section). Jacek Laskowski presented the operation internals and asked an interesting question about the difference between commands and execs. Since I didn't know the answer right away, I decided to explore the commands concepts in this blog post.

Continue Reading →

Structured Streaming and Apache Kafka Schema Registry

The topic of this post brought Luan Carvalho who shared with me an Open Source project connecting Apache Spark to Apache Kafka Schema Registry. Initially, I wanted to exclusively focus on the project but on my way I discovered some other interesting points.

Continue Reading →

Join hints in Apache Spark SQL

With the Adaptive Query Execution module, you can have a feeling that Apache Spark will optimize the job for you. In part, yes, because it'll be able to optimize the job based on the runtime parameters you don't necessarily know. However, you also can master the execution, and ones of these mastery tools are hints.

Continue Reading →

Collect action and determinism

Even though nowadays RDD tends to be a low level abstraction and we should use SQL API, some of its methods are still used under-the-hood by the framework. During one of my explorations I wanted to analyze the task responsible for listing the files to process. Initially, my thought was "oh,it uses collect() and the results will be different every time". However, after looking at the code I saw how wrong I was!

Continue Reading →

Arbitrary stateful processing: update and put dependency

At first glance, the update operation in an arbitrary stateful application looks just like another map's put function. However, it has an impact on what happens later with the state store. In this blog post, you will see an example that can eventually help you to reduce an I/O pressure of the updates.

Continue Reading →

Does maxOffsetsPerTrigger guarantee idempotent processing?

If you've used Apache Kafka source in Structured Streaming, you undoubtedly noticed a property called maxOffsetsPerTrigger. According to the documentation, it helps to "limit on maximum number of offsets processed per trigger interval". My initial reaction to this property was, "Cool! We can enforce idempotent processing". I was not wrong, but the blog post will show you that I wasn't entirely right either!

Continue Reading →

Apache Kafka transactional writer with foreach sink, is it possible?

Even though Apache Kafka supports transactional producers, they're not present in Apache Spark Kafka sink. But despite that, is it possible to implement a transactional producer in Apache Spark Structured Streaming? You should see that at the end of this article.

Continue Reading →

Shuffle writers: UnsafeShuffleWriter

It's the last part of the shuffle writers series. The picture so far composed of SortShuffleWriter and BypassMergeSortShuffleWriter, will be completed today with UnsafeShuffleWriter.

Continue Reading →

Shuffle writers: BypassMergeSortShuffleWriter

In the previous blog post we discovered the SortShuffleWriter. However, the SortShuffleManager's first choice is BypassMergeSortShuffleWriter, presented in this article.

Continue Reading →

Shuffle writers: SortShuffleWriter

In the beginning I thought that the mappers sent shuffle files to the reducers. After understanding that it was the opposite, I was thinking that a part of the shuffle data is kept in memory for the performance purposes... Once I corrected all these misbeliefs about shuffle, I noted a few points to explore. One of these points are shuffle writers that I will present in the next 3 blog posts.

Continue Reading →