Spark shuffle articles

Shuffle in Apache Spark, back to the basics

If you are a newcomer in the distributed world, someone certainly told you that shuffle is bad and will slow down your processing. But what does it mean? What happens when this infamous shuffle exists in your code? In this article you should find some answers for the shuffle in Apache Spark.

Continue Reading →

What's new in Apache Spark 3.0 - shuffle service changes

One of Apache Spark's components making it hard to scale is shuffle. Fortunately, the community is on a good way to overcome this limitation and the new release of the framework brings some important improvements on this field.

Continue Reading →

External shuffle service in Apache Spark

To scale Spark applications automatically we need to enable dynamic resource allocation. But to make it work we need another feature called external shuffle service that will be covered here.

Continue Reading →

Shuffle join in Spark SQL

Shuffle consists on moving data with the same key to the one executor in order to execute some specific processing on it. We could think that it concerns only *ByKey operations but it's not necessarily true.

Continue Reading →

Spark shuffle - complementary notes

This small post is the complement for previous article describing big lines of shuffle. It focuses more in details on writing part.

Continue Reading →

Shuffling in Spark

As already told in one of previous posts about Spark, shuffle is a process which moves data between nodes. It's orchestrated by a specific manager and it will be the topic of this post.

Continue Reading →