Shuffle is for me a never-ending story. Last year I spent long weeks analyzing the readers and writers and was hoping for some rest in 2022. However, it didn't happen. My recent PySpark investigation led me to the shuffle.py file and my first reaction was "Oh, so PySpark has its own shuffle mechanism?". Let's check this out!
The topic of this blog post is not new because the discussed sort algorithms are there from Apache Spark 2. But it happens that I've never had a chance to present them and today I'll try to do it now.
It's time for the 2 of 3 parts dedicated to the shuffle configuration in Apache Spark.
Probably the most popular configuration entry related to the shuffle is the number of shuffle partitions. But it's not the only one and you will see it in this new blog post series!
It's time for the 2nd blog post about the shuffle readers. Recently, we discovered how Apache Spark fetches the shuffle blocks from local and remote hosts. Today, I would like to share with you the wrapping iterators. Sounds mysterious? It won't be if we start by looking at the iterators participating in the processing of shuffle block files.
So far I've covered the writing part of the shuffle files. You've learned about 3 different shuffle writers, but what happens with their generated files? Who and how reads them? Is the reading an in-memory operation? I will try to answer this and some other questions in this blog post.
It's the last part of the shuffle writers series. The picture so far composed of SortShuffleWriter and BypassMergeSortShuffleWriter, will be completed today with UnsafeShuffleWriter.
In the previous blog post we discovered the SortShuffleWriter. However, the SortShuffleManager's first choice is BypassMergeSortShuffleWriter, presented in this article.
In the beginning I thought that the mappers sent shuffle files to the reducers. After understanding that it was the opposite, I was thinking that a part of the shuffle data is kept in memory for the performance purposes... Once I corrected all these misbeliefs about shuffle, I noted a few points to explore. One of these points are shuffle writers that I will present in the next 3 blog posts.
If you are a newcomer in the distributed world, someone certainly told you that shuffle is bad and will slow down your processing. But what does it mean? What happens when this infamous shuffle exists in your code? In this article you should find some answers for the shuffle in Apache Spark.
One of Apache Spark's components making it hard to scale is shuffle. Fortunately, the community is on a good way to overcome this limitation and the new release of the framework brings some important improvements on this field.
To scale Spark applications automatically we need to enable dynamic resource allocation. But to make it work we need another feature called external shuffle service that will be covered here.
Shuffle consists on moving data with the same key to the one executor in order to execute some specific processing on it. We could think that it concerns only *ByKey operations but it's not necessarily true.
This small post is the complement for previous article describing big lines of shuffle. It focuses more in details on writing part.
As already told in one of previous posts about Spark, shuffle is a process which moves data between nodes. It's orchestrated by a specific manager and it will be the topic of this post.