Apache Spark on Kubernetes - global overview

on waitingforcode.com

Apache Spark on Kubernetes - global overview

Last years are the symbol of popularization of Kubernetes. Thanks to its replication and scalability properties it's more and more often used in distributed architectures. Apache Spark, through a special group of work, integrates Kubernetes steadily. In current (2.3.1) version this new method to schedule jobs is integrated in the project as experimental feature.

This post shows a global overview of Apache Spark workloads executed by Kubernetes. Even though it will list some of Kubernetes concepts, it won't present them precisely. They will be developed more in subsequent parts of Spark on Kubernetes posts. In the first section we'll discover main components of the project with their short description. The second part will show what happens with Apache Spark programs when they're submitted on Kubernetes.

Spark on Kubernetes components

Among the resources involved in running Spark applications on Kubernetes we can retrieve the resources belonging to Apache Spark and to Kubernetes. For the former ones we can include:

  • scheduler - the project comes with its own scheduler that is KubernetesClusterSchedulerBackend. It's responsible for executors lifecycle management (allocation, removal, replacement).
  • driver - obviously, any of Spark pipelines could exist without a driver. Thus, it's also present for the case of Kubernetes scheduler. The driver is created as Kubernetes pod and it keeps its responsibilities (governance for tasks execution and executors creation, starting point for program execution).
  • executor - as driver, executors are also represented as pods and keep their initial role of physical tasks executors.

Kubernetes resource manager implementation in official Spark project (2.3.1) differs a little from the proposal of community group. Some of concepts were not taken to the official version but it's worth to mention them since they give an extra input about implementation process:

  • external shuffle service - it stores shuffle files beyond the lifecycle of executors, i.e. even when some of executors goes down.
  • resource staging server - any files stored locally and used by executors are compressed to a tarball and uploaded to resource staging server. It's a daemon used to store and serve application dependencies.

Please note that despite the lack of implementation of above 2 concepts, they'll be presented in further posts just to give practical examples of Kubernetes components in the context of Apache Spark.

Job submission

The magic of Spark application execution on Kubernetes happens thanks to spark-submit tool. It translates Spark program to the format schedulable by Kubernetes. That said it analyzes execution options (memory, CPU and so forth) and uses them to build driver and executor pods with the help of io.fabric8.kubernetes Java library for Kubernetes. The pods are created by org.apache.spark.scheduler.cluster.k8s.ExecutorPodFactory and the output looks like:

  • for a driver
    apiVersion: v1
    kind: Pod
    metadata:
      annotations:
        spark-app-name: spark-pi
      name: spark-pi-c4f8d5d3239a3dcabdb93d6c4f27347a-driver
      namespace: default
    spec:
      containers:
      - args:
        - driver
        image: spark:latest
        imagePullPolicy: IfNotPresent
        name: spark-kubernetes-driver
      
  • for an executor
    apiVersion: v1
    kind: Pod
    metadata: 
      labels:
        spark-app-selector: spark-application-1528459049572
        spark-exec-id: "1"
        spark-role: executor
    spec:
      containers:
      - args:
        - executor
        image: spark:latest
        imagePullPolicy: IfNotPresent
        name: executor
      

That it's not a good moment to show and analyze whole template file. Thus, only a part is presented here. But it's a good occasion to mention that Kubernetes deploys the application defined in a container image as Docker's one. Spark provides default image that can be build from kubernetes/dockerfiles/Dockerfile with bin/docker-image-tool.sh command.

Once all mandatory components built, Spark asks Kubernetes scheduler to physically deploy just created pods.

To summarize, deploying Spark application with Kubernetes is quite easy. We need obviously Kubernetes installed, Spark image published and, as in the case of other deployment strategies, JAR package with the code to execute. After the shipping of the package is handled by Kubernetes mechanism.

Read also about Apache Spark on Kubernetes - global overview here: Running Spark on Kubernetes (official) , Running Spark on Kubernetes (community group) .

Share, like or comment this post on Twitter: