Apache Spark on Kubernetes - global overview on waitingforcode.com

Versions: Apache Spark 2.3.1

Last years are the symbol of popularization of Kubernetes. Thanks to its replication and scalability properties it's more and more often used in distributed architectures. Apache Spark, through a special group of work, integrates Kubernetes steadily. In current (2.3.1) version this new method to schedule jobs is integrated in the project as experimental feature.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

This post shows a global overview of Apache Spark workloads executed by Kubernetes. Even though it will list some of Kubernetes concepts, it won't present them precisely. They will be developed more in subsequent parts of Spark on Kubernetes posts. In the first section we'll discover main components of the project with their short description. The second part will show what happens with Apache Spark programs when they're submitted on Kubernetes.

Spark on Kubernetes components

Among the resources involved in running Spark applications on Kubernetes we can retrieve the resources belonging to Apache Spark and to Kubernetes. For the former ones we can include:

scheduler - the project comes with its own scheduler that is KubernetesClusterSchedulerBackend. It's responsible for executors lifecycle management (allocation, removal, replacement).
driver - obviously, any of Spark pipelines could exist without a driver. Thus, it's also present for the case of Kubernetes scheduler. The driver is created as Kubernetes pod and it keeps its responsibilities (governance for tasks execution and executors creation, starting point for program execution).
executor - as driver, executors are also represented as pods and keep their initial role of physical tasks executors.

Kubernetes resource manager implementation in official Spark project (2.3.1) differs a little from the proposal of community group. Some of concepts were not taken to the official version but it's worth to mention them since they give an extra input about implementation process:

external shuffle service - it stores shuffle files beyond the lifecycle of executors, i.e. even when some of executors goes down.
resource staging server - any files stored locally and used by executors are compressed to a tarball and uploaded to resource staging server. It's a daemon used to store and serve application dependencies.

Please note that despite the lack of implementation of above 2 concepts, they'll be presented in further posts just to give practical examples of Kubernetes components in the context of Apache Spark.

Job submission

The magic of Spark application execution on Kubernetes happens thanks to spark-submit tool. It translates Spark program to the format schedulable by Kubernetes. That said it analyzes execution options (memory, CPU and so forth) and uses them to build driver and executor pods with the help of io.fabric8.kubernetes Java library for Kubernetes. The pods are created by org.apache.spark.scheduler.cluster.k8s.ExecutorPodFactory and the output looks like:

for a driver

apiVersion: v1
kind: Pod
metadata:
  annotations:
    spark-app-name: spark-pi
  name: spark-pi-c4f8d5d3239a3dcabdb93d6c4f27347a-driver
  namespace: default
spec:
  containers:
  - args:
    - driver
    image: spark:latest
    imagePullPolicy: IfNotPresent
    name: spark-kubernetes-driver

for an executor

apiVersion: v1
kind: Pod
metadata: 
  labels:
    spark-app-selector: spark-application-1528459049572
    spark-exec-id: "1"
    spark-role: executor
spec:
  containers:
  - args:
    - executor
    image: spark:latest
    imagePullPolicy: IfNotPresent
    name: executor

That it's not a good moment to show and analyze whole template file. Thus, only a part is presented here. But it's a good occasion to mention that Kubernetes deploys the application defined in a container image as Docker's one. Spark provides default image that can be build from kubernetes/dockerfiles/Dockerfile with bin/docker-image-tool.sh command.

Once all mandatory components built, Spark asks Kubernetes scheduler to physically deploy just created pods.

To summarize, deploying Spark application with Kubernetes is quite easy. We need obviously Kubernetes installed, Spark image published and, as in the case of other deployment strategies, JAR package with the code to execute. After the shipping of the package is handled by Kubernetes mechanism.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #Apache Spark elasticity #Apache Spark scalability #Spark on Kubernetes