YARN or Kubernetes for Apache Spark?

Versions: Apache Spark 3.3.0

I've written my first Kubernetes on Apache Spark blog post in 2018 with a try to answer the question, what Kubernetes can bring to Apache Spark? Four years later this resource manager is a mature Spark component, but a new question has arisen in my head. Should I stay on YARN or switch to Kubernetes?

New ebook 🔥

Learn 84 ways to solve common data engineering problems with cloud services.

👉 I want my Early Access edition

Spoiler alert, in the blog post you won't find a clear recommendation because it always depends on the context. You might have a team familiar with YARN and not feeling the positive impact of switching to Kubernetes. On the other hand, you might be supported by a DevOps team experienced with Kubernetes and making them work on YARN would kill their productivity. So instead of answering the question directly I'm going to say "It depends", and invite you to compare YARN and Kubernetes features, so that you can make this decision on your own!

Technology

At a high level YARN and Kubernetes use the same concept to run Apache Spark jobs, the containers (although, Kubernetes hide them behind pods that can run one or multiple containers, but overly simplified, let's consider them as same). It's the result of the resource allocating executing the Apache Spark code. But it couldn't live alone. In both technologies they're a part of a more sophisticated infrastructure.

Apache Spark on YARN relies on 2 YARN components called the Resource Manager and Node Manager. The former is responsible for allocating the submitted job resources on the cluster nodes. Its responsibility doesn't stop with the allocation, though! Each node of the cluster has a Node Manager that monitors the activity of the created containers and sends this information to the Resource Manager. To close the communication loop, there is an Application Master component to communicate with the Resource Manager directly about the compute resources initialization. How does it relate toKubernetes? Quite similar actually!

The Kubernetes workflow starts with the driver pod creation. After, the driver asks Kubernetes to schedule executor pods and once the allocation is finished, it starts submitting the tasks for execution.

The workflow is not the single similarity for both solutions. Another one that might be the most impacting (I bet you know the errors like Container killed by YARN for exceeding memory limits?!) is the memory allocation. Apache Spark is not the single memory consumer because both, YARN and Kubernetes, require a definition of the memory overhead. This overhead represents the memory part used by the OS or the off-heap storage. It defaults to 10% meaning that 10% of the executor memory will be taken for not-job related activity.

Scalability

An Apache Spark job can scale at 2 different levels, locally within the job and externally, on the cluster. The former approach uses the existing compute power to launch more tasks simultaneously. The feature you're looking for here is the Dynamic Resource Allocation and despite this apparent simplicity [after all, it's "just" a conf], it has some difficult aspects from a resource manager standpoint. The shuffle files. Any downscaling action shouldn't lose the executors storing the shuffle data needed in the job.

To mitigate that risk, YARN relies on an external shuffle service which is a process running on other nodes of the cluster. Kubernetes doesn't support this service yet but it does support dynamic allocation. Instead of keeping the shuffle files alive with the external service, it relies on the shuffle files tracking feature, so that the executors storing the shuffle files will be alive as long as the job needing the shuffle data.

That was the first scaling level. The second one is the external, so the cluster-based. What if the Dynamic Resource Allocation works great but the cluster doesn't have room anymore to handle new tasks? It's where the cluster-based scaling goes into action. Kubernetes has an extra component called cluster-autoscaler. It's the tool that adjusts the size of the cluster if scheduled pods can't be run. It would obviously happen if the Dynamic Resource Allocation scheduled an executor pod on the cluster lacking the free space.

The cluster-based scaling part in YARN mostly relies on the cloud provider's auto scaling capabilities, such as EMR Auto Scaling on AWS, or Dataproc auto scaling policy on GCP.

Isolation

Another important point is the workload isolation. If we talk about a multi-tenant environment here, the isolation is much harder to achieve on YARN. YARN is a single cluster with already some dependencies pre-installed. It may pretty easily lead to dependency hell problem where the class loader loads bad versions of the libraries from the cluster and provokes the method not found or class not found exceptions. There are various solutions to this issue and the most common one is Shading. However, it adds an extra overhead to the build process by rewriting the conflicting libraries.

Kubernetes doesn't have this dependency hell problem. Each Apache Spark job is a packaged container, so it has its own isolated scope without the risk of the conflict. On the other hand, the cost of building the package can be higher because you'll need to include all the dependencies in the image. While for YARN, some of them might be already installed on the cluster.

Besides the dependency hell, Kubernetes isolation helps testing new versions of Apache Spark easier. It's just a dependency bump in the build files while for YARN - if we consider the cloud-based scenario - it's often the synonym of waiting for the cloud provider to upgrade the version of the service.

To know more about the isolation aspect, I invite to the great Spark+AI Summit 2020 talk of Jean-Yves Stephan and Julien Dumazert, Running Apache Spark on Kubernetes: Best Practices and Pitfalls. The screenshot above comes from the talk and summarizes greatly the differences between YARN and Kubernetes:

Other factors

Although scalability is one of the main important attributes for a resource manager, it's not the single one. There are many others that are important but probably less visible:

The topic of resource managers is not easy, exactly as answering the question on which one from YARN and Kubernetes, you should use for your future Apache Spark workloads. Hopefully, the elements presented in this article will help you make this decision!