Spark shuffle articles

December 3, 2022 • PySpark

Shuffle in PySpark

Shuffle is for me a never-ending story. Last year I spent long weeks analyzing the readers and writers and was hoping for some rest in 2022. However, it didn't happen. My recent PySpark investigation led me to the shuffle.py file and my first reaction was "Oh, so PySpark has its own shuffle mechanism?". Let's check this out!

Continue Reading →

June 18, 2022 • Apache Spark SQL

Radix and Tim sort

The topic of this blog post is not new because the discussed sort algorithms are there from Apache Spark 2. But it happens that I've never had a chance to present them and today I'll try to do it now.

Continue Reading →

March 19, 2022 • Apache Spark

Shuffle configuration demystified - part 2

It's time for the 2 of 3 parts dedicated to the shuffle configuration in Apache Spark.

Continue Reading →

March 12, 2022 • Apache Spark

Shuffle configuration demystified - part 1

Probably the most popular configuration entry related to the shuffle is the number of shuffle partitions. But it's not the only one and you will see it in this new blog post series!

Continue Reading →

August 21, 2021 • Apache Spark SQL

Shuffle reading in Apache Spark SQL - wrapping iterators and beyond

It's time for the 2nd blog post about the shuffle readers. Recently, we discovered how Apache Spark fetches the shuffle blocks from local and remote hosts. Today, I would like to share with you the wrapping iterators. Sounds mysterious? It won't be if we start by looking at the iterators participating in the processing of shuffle block files.

Continue Reading →

August 14, 2021 • Apache Spark SQL

Shuffle reading in Apache Spark SQL

So far I've covered the writing part of the shuffle files. You've learned about 3 different shuffle writers, but what happens with their generated files? Who and how reads them? Is the reading an in-memory operation? I will try to answer this and some other questions in this blog post.

Continue Reading →

June 19, 2021 • Apache Spark

Shuffle writers: UnsafeShuffleWriter

It's the last part of the shuffle writers series. The picture so far composed of SortShuffleWriter and BypassMergeSortShuffleWriter, will be completed today with UnsafeShuffleWriter.

Continue Reading →

June 12, 2021 • Apache Spark

Shuffle writers: BypassMergeSortShuffleWriter

In the previous blog post we discovered the SortShuffleWriter. However, the SortShuffleManager's first choice is BypassMergeSortShuffleWriter, presented in this article.

Continue Reading →

June 5, 2021 • Apache Spark

Shuffle writers: SortShuffleWriter

In the beginning I thought that the mappers sent shuffle files to the reducers. After understanding that it was the opposite, I was thinking that a part of the shuffle data is kept in memory for the performance purposes... Once I corrected all these misbeliefs about shuffle, I noted a few points to explore. One of these points are shuffle writers that I will present in the next 3 blog posts.

Continue Reading →

December 5, 2020 • Apache Spark

Shuffle in Apache Spark, back to the basics

If you are a newcomer in the distributed world, someone certainly told you that shuffle is bad and will slow down your processing. But what does it mean? What happens when this infamous shuffle exists in your code? In this article you should find some answers for the shuffle in Apache Spark.

Continue Reading →

August 1, 2020 • Apache Spark

What's new in Apache Spark 3.0 - shuffle service changes

One of Apache Spark's components making it hard to scale is shuffle. Fortunately, the community is on a good way to overcome this limitation and the new release of the framework brings some important improvements on this field.

Continue Reading →

July 7, 2018 • Apache Spark

External shuffle service in Apache Spark

To scale Spark applications automatically we need to enable dynamic resource allocation. But to make it work we need another feature called external shuffle service that will be covered here.

Continue Reading →

August 12, 2017 • Apache Spark SQL

Shuffle join in Spark SQL

Shuffle consists on moving data with the same key to the one executor in order to execute some specific processing on it. We could think that it concerns only *ByKey operations but it's not necessarily true.

Continue Reading →

February 19, 2017 • Apache Spark

Spark shuffle - complementary notes

This small post is the complement for previous article describing big lines of shuffle. It focuses more in details on writing part.

Continue Reading →

February 4, 2017 • Apache Spark

Shuffling in Spark

As already told in one of previous posts about Spark, shuffle is a process which moves data between nodes. It's orchestrated by a specific manager and it will be the topic of this post.

Continue Reading →

1