Last years are the symbol of popularization of Kubernetes. Thanks to its replication and scalability properties it's more and more often used in distributed architectures. Apache Spark, through a special group of work, integrates Kubernetes steadily. In current (2.3.1) version this new method to schedule jobs is integrated in the project as experimental feature.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
This post shows a global overview of Apache Spark workloads executed by Kubernetes. Even though it will list some of Kubernetes concepts, it won't present them precisely. They will be developed more in subsequent parts of Spark on Kubernetes posts. In the first section we'll discover main components of the project with their short description. The second part will show what happens with Apache Spark programs when they're submitted on Kubernetes.
Spark on Kubernetes components
Among the resources involved in running Spark applications on Kubernetes we can retrieve the resources belonging to Apache Spark and to Kubernetes. For the former ones we can include:
- scheduler - the project comes with its own scheduler that is KubernetesClusterSchedulerBackend. It's responsible for executors lifecycle management (allocation, removal, replacement).
- driver - obviously, any of Spark pipelines could exist without a driver. Thus, it's also present for the case of Kubernetes scheduler. The driver is created as Kubernetes pod and it keeps its responsibilities (governance for tasks execution and executors creation, starting point for program execution).
- executor - as driver, executors are also represented as pods and keep their initial role of physical tasks executors.
Kubernetes resource manager implementation in official Spark project (2.3.1) differs a little from the proposal of community group. Some of concepts were not taken to the official version but it's worth to mention them since they give an extra input about implementation process:
- external shuffle service - it stores shuffle files beyond the lifecycle of executors, i.e. even when some of executors goes down.
- resource staging server - any files stored locally and used by executors are compressed to a tarball and uploaded to resource staging server. It's a daemon used to store and serve application dependencies.
Please note that despite the lack of implementation of above 2 concepts, they'll be presented in further posts just to give practical examples of Kubernetes components in the context of Apache Spark.
Job submission
The magic of Spark application execution on Kubernetes happens thanks to spark-submit tool. It translates Spark program to the format schedulable by Kubernetes. That said it analyzes execution options (memory, CPU and so forth) and uses them to build driver and executor pods with the help of io.fabric8.kubernetes Java library for Kubernetes. The pods are created by org.apache.spark.scheduler.cluster.k8s.ExecutorPodFactory and the output looks like:
- for a driver
apiVersion: v1 kind: Pod metadata: annotations: spark-app-name: spark-pi name: spark-pi-c4f8d5d3239a3dcabdb93d6c4f27347a-driver namespace: default spec: containers: - args: - driver image: spark:latest imagePullPolicy: IfNotPresent name: spark-kubernetes-driver
- for an executor
apiVersion: v1 kind: Pod metadata: labels: spark-app-selector: spark-application-1528459049572 spark-exec-id: "1" spark-role: executor spec: containers: - args: - executor image: spark:latest imagePullPolicy: IfNotPresent name: executor
That it's not a good moment to show and analyze whole template file. Thus, only a part is presented here. But it's a good occasion to mention that Kubernetes deploys the application defined in a container image as Docker's one. Spark provides default image that can be build from kubernetes/dockerfiles/Dockerfile with bin/docker-image-tool.sh command.
Once all mandatory components built, Spark asks Kubernetes scheduler to physically deploy just created pods.
To summarize, deploying Spark application with Kubernetes is quite easy. We need obviously Kubernetes installed, Spark image published and, as in the case of other deployment strategies, JAR package with the code to execute. After the shipping of the package is handled by Kubernetes mechanism.