Apache Spark articles

Collect action and determinism

Even though nowadays RDD tends to be a low level abstraction and we should use SQL API, some of its methods are still used under-the-hood by the framework. During one of my explorations I wanted to analyze the task responsible for listing the files to process. Initially, my thought was "oh,it uses collect() and the results will be different every time". However, after looking at the code I saw how wrong I was!

Continue Reading →

Shuffle writers: UnsafeShuffleWriter

It's the last part of the shuffle writers series. The picture so far composed of SortShuffleWriter and BypassMergeSortShuffleWriter, will be completed today with UnsafeShuffleWriter.

Continue Reading →

Shuffle writers: BypassMergeSortShuffleWriter

In the previous blog post we discovered the SortShuffleWriter. However, the SortShuffleManager's first choice is BypassMergeSortShuffleWriter, presented in this article.

Continue Reading →

Shuffle writers: SortShuffleWriter

In the beginning I thought that the mappers sent shuffle files to the reducers. After understanding that it was the opposite, I was thinking that a part of the shuffle data is kept in memory for the performance purposes... Once I corrected all these misbeliefs about shuffle, I noted a few points to explore. One of these points are shuffle writers that I will present in the next 3 blog posts.

Continue Reading →

What's new in Apache Spark 3.1 - Kubernetes Generally Available!

After several months spent as an "experimental" feature in Apache Spark, Kubernetes was officially promoted to a Generally Available scheduler in the 3.1 release! In this blog post, we'll discover the last changes made before this promotion.

Continue Reading →

What's new in Apache Spark 3.1 - nodes decommissioning

I have a feeling that a lot of things related to the scalability happened in the 3.1 release. General Availability of Kubernetes that I will cover next week is only one of them. The second one is the nodes decommissioning!

Continue Reading →

Performance optimization lessons from Spark+AI and Data+AI Summits

I wish I could say once day: "I optimized Apache Spark pipelines in all possible ways". But I'm aware of the realty and that can be very hard to achieve. That's why I decided to rely on the experience shared by experienced Spark users in Spark+AI and, recently, Data+AI Summit, and write a summary list of interesting optimization tips from the past talks.

Continue Reading →

Apache Spark and shuffle management - external services

Shuffle accompanies distributed data processing from the very beginning. Apache Spark is not an exception, and one of the prominent features targeted for 3.1 release is the full support for the pluggable shuffle backend. But it's not the single effort made these days by the community to handle shuffle drawbacks. And you will see it in this blog post.

Continue Reading →

Shuffle in Apache Spark, back to the basics

If you are a newcomer in the distributed world, someone certainly told you that shuffle is bad and will slow down your processing. But what does it mean? What happens when this infamous shuffle exists in your code? In this article you should find some answers for the shuffle in Apache Spark.

Continue Reading →

What's new in Apache Spark 3.0 - Kubernetes

I believe Kubernetes is the next big step in the framework after proposing Catalyst Optimizer, modernizing streaming processing with Structured Streaming, and introducing Adaptive Query Execution. Especially that Apache Spark 3 brings a lot of changes in this part!

Continue Reading →

What's new in Apache Spark 3.0 - GPU-aware scheduling

GPU-awareness was one of the topics I postponed the most in my Apache Spark 3.0 exploration. But its time has come and in this blog post you will discover what changed in the version 3 of the framework regarding the GPU-based computation.

Continue Reading →

What's new in Apache Spark 3.0 - UI changes

Apart from data processing-related changes, Apache Spark 3.0 also brings some changes at the UI level. The interface is supposed to be more intuitive and should help you understand processing logic better!

Continue Reading →

What's new in Apache Spark 3.0 - shuffle service changes

One of Apache Spark's components making it hard to scale is shuffle. Fortunately, the community is on a good way to overcome this limitation and the new release of the framework brings some important improvements on this field.

Continue Reading →

Docker images and Apache Spark applications

Containers are with us, data engineers, for several years. The concept was already introduced on YARN but the technology that really made them popular was Docker. In this post I will focus on its recommended practices to make our Apache Spark images better.

Continue Reading →

Setting up Apache Spark on Kubernetes with microk8s

When I discovered microk8s I was delighted! An easy installation in very few steps and you can start to play with Kubernetes locally (tried on Ubuntu 16). However, running Apache Spark 2.4.4 on top of microk8s is not an easy piece of cake. In this post I will show you 4 different problems you may encounter, and propose possible solutions.

Continue Reading →

Troubleshooting 'System memory must be at least' error

When the unit tests work on "your machine" but fail on your colleague's, you know you did something wrong. When the failures are not about test assertions but technical reasons, the "something wrong" transforms into "something strange". And it may happen with Apache Spark as well.

Continue Reading →

Validating JSON with Apache Spark and Cerberus

In one of recent Meetups I heard that one of the most difficult data engineering tasks is ensuring good data quality. I'm more than agree with that statement and that's the reason why in this post I will share one of solutions to detect data issues with PySpark (my first PySpark code !) and Python library called Cerberus.

Continue Reading →

FAIR jobs scheduling in Apache Spark

During my exploration of Apache Spark configuration options, I found an entry called spark.scheduler.mode. After looking for its possible values, I ended up with a pretty intriguing concept called FAIR scheduling that I will detail in this post.

Continue Reading →

Bzip2 compression in Apache Spark

Compression has a lot of benefits in the data context. It reduces the size of stored data, so you will save some space and also have less data to transfer across the network in the case of a data processing pipeline. And if you use Bzip2, you can process the compressed data in parallel. In this post, I will try to explain how does it happen.

Continue Reading →

Memory and Apache Spark classes

In previous posts about memory in Apache Spark, I've been exploring memory behavior of Apache Spark when the input files are much bigger than the allocated memory. After that it's a good moment to sum up that in the post dedicated to classes involved in memory using tasks.

Continue Reading →