Apache Spark on Kubernetes - global overview

Versions: Apache Spark 2.3.1

Last years are the symbol of popularization of Kubernetes. Thanks to its replication and scalability properties it's more and more often used in distributed architectures. Apache Spark, through a special group of work, integrates Kubernetes steadily. In current (2.3.1) version this new method to schedule jobs is integrated in the project as experimental feature.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

This post shows a global overview of Apache Spark workloads executed by Kubernetes. Even though it will list some of Kubernetes concepts, it won't present them precisely. They will be developed more in subsequent parts of Spark on Kubernetes posts. In the first section we'll discover main components of the project with their short description. The second part will show what happens with Apache Spark programs when they're submitted on Kubernetes.

Spark on Kubernetes components

Among the resources involved in running Spark applications on Kubernetes we can retrieve the resources belonging to Apache Spark and to Kubernetes. For the former ones we can include:

Kubernetes resource manager implementation in official Spark project (2.3.1) differs a little from the proposal of community group. Some of concepts were not taken to the official version but it's worth to mention them since they give an extra input about implementation process:

Please note that despite the lack of implementation of above 2 concepts, they'll be presented in further posts just to give practical examples of Kubernetes components in the context of Apache Spark.

Job submission

The magic of Spark application execution on Kubernetes happens thanks to spark-submit tool. It translates Spark program to the format schedulable by Kubernetes. That said it analyzes execution options (memory, CPU and so forth) and uses them to build driver and executor pods with the help of io.fabric8.kubernetes Java library for Kubernetes. The pods are created by org.apache.spark.scheduler.cluster.k8s.ExecutorPodFactory and the output looks like:

That it's not a good moment to show and analyze whole template file. Thus, only a part is presented here. But it's a good occasion to mention that Kubernetes deploys the application defined in a container image as Docker's one. Spark provides default image that can be build from kubernetes/dockerfiles/Dockerfile with bin/docker-image-tool.sh command.

Once all mandatory components built, Spark asks Kubernetes scheduler to physically deploy just created pods.

To summarize, deploying Spark application with Kubernetes is quite easy. We need obviously Kubernetes installed, Spark image published and, as in the case of other deployment strategies, JAR package with the code to execute. After the shipping of the package is handled by Kubernetes mechanism.