If you need to go back in time and analyze your past Apache Spark applications, you can use the native Apache Spark History server. However, it can also be an infrastructure problem because of the continuously increasing historical logs for streaming applications. In this blog post we'll try to understand this component and to see different configuration options.
It's difficult to see all the use cases of a framework. Back in time, when I was a backend engineer, I never succeeded to see all applications of Spring framework. Now, when I'm a data engineer, I feel the same for Apache Spark. Fortunately, the community is there to show me some outstanding features!
Message bus is a common architectural design in the Enterprise Design Patterns. But it's also present at a lower level to enable the event-driven behavior. Apache Spark is not an exception. It uses a publish/subscribe approach in various places.
I've written my first Kubernetes on Apache Spark blog post in 2018 with a try to answer the question, what Kubernetes can bring to Apache Spark? Four years later this resource manager is a mature Spark component, but a new question has arisen in my head. Should I stay on YARN or switch to Kubernetes?
It's time for the last part of the shuffle configuration overview. This time you'll see the properties related to the shuffle service, reducer, I/O, and a few others.
It's time for the 2 of 3 parts dedicated to the shuffle configuration in Apache Spark.
Probably the most popular configuration entry related to the shuffle is the number of shuffle partitions. But it's not the only one and you will see it in this new blog post series!
A good CI/CD process avoids many pitfalls related to manual operations. Apache Spark also has one based on Github Actions. Since this part of the project has been a small mystery for me, I wanted to spend some time exploring it.
I had the idea for this blog post when I was preparing the "What's new in Apache Spark..." series. At that time, I was writing about Kubernetes in the context of Apache Spark but needed to "google" a lot of things aside - mostly the Kubernetes API terms.
My Apache Spark 3.2.0 comes to its end. Today I'll focus on the miscellaneous changes, so all the improvements I couldn't categorize in the previous blog posts.
In the previous Apache Spark releases you could see many shuffle evolutions such as shuffle files tracking or pluggable storage interface. And the things don't change for 3.2.0 which comes with the push-based merge shuffle.
The idea of writing this blog post came to me when I was analyzing Kubernetes changes in Apache Spark 3.1.1. Starting from this version we can use stage level scheduling, so far available only for YARN. Even though it's probably a very low level feature, it intrigued me enough to write a few words here!
I had this "aha moment" while I was preparing the blog posts about the shuffle readers. Apache Spark uses iterators a lot! In this blog post you will see the places where I had met them the last months.
Even though nowadays RDD tends to be a low level abstraction and we should use SQL API, some of its methods are still used under-the-hood by the framework. During one of my explorations I wanted to analyze the task responsible for listing the files to process. Initially, my thought was "oh,it uses collect() and the results will be different every time". However, after looking at the code I saw how wrong I was!
It's the last part of the shuffle writers series. The picture so far composed of SortShuffleWriter and BypassMergeSortShuffleWriter, will be completed today with UnsafeShuffleWriter.
In the previous blog post we discovered the SortShuffleWriter. However, the SortShuffleManager's first choice is BypassMergeSortShuffleWriter, presented in this article.
In the beginning I thought that the mappers sent shuffle files to the reducers. After understanding that it was the opposite, I was thinking that a part of the shuffle data is kept in memory for the performance purposes... Once I corrected all these misbeliefs about shuffle, I noted a few points to explore. One of these points are shuffle writers that I will present in the next 3 blog posts.
After several months spent as an "experimental" feature in Apache Spark, Kubernetes was officially promoted to a Generally Available scheduler in the 3.1 release! In this blog post, we'll discover the last changes made before this promotion.
I have a feeling that a lot of things related to the scalability happened in the 3.1 release. General Availability of Kubernetes that I will cover next week is only one of them. The second one is the nodes decommissioning!
I wish I could say once day: "I optimized Apache Spark pipelines in all possible ways". But I'm aware of the realty and that can be very hard to achieve. That's why I decided to rely on the experience shared by experienced Spark users in Spark+AI and, recently, Data+AI Summit, and write a summary list of interesting optimization tips from the past talks.