I had the idea for this blog post when I was preparing the "What's new in Apache Spark..." series. At that time, I was writing about Kubernetes in the context of Apache Spark but needed to "google" a lot of things aside - mostly the Kubernetes API terms.
Apache Spark interacts with Kubernetes API via Fabric8 library. It provides a user-friendly way to communicate with the cluster. The first term you can encounter in the code base is a Watcher. It's a listener-oriented interface implemented in Apache Spark as the following classes:
- ExecutorPodsWatcher - responsible for updating the executor pod snapshot store.
- LoggingPodStatusWatcher - listens for pod state updates and logs them as "State changed, new state: ..." messages.
Executor pod snapshot store
One of the watches updates the executor pod snapshot store. It's represented by the ExecutorPodsSnapshotsStore interface and so far, it has a single implementation in the code base, the ExecutorPodsSnapshotsStoreImpl. It's adapted to the event-driven paradigm where the listeners read the information about the executor pods and trigger corresponding actions to update the store.
There are 2 snapshot events:
- incremental - the ExecutorPodsSnapshotsStore exposes a updatePod(updatedPod: Pod) method called by ExecutorPodsWatcher whenever it gets a pod update information.
- full - managed from ExecutorPodsSnapshotStore's replaceSnapshot(newSnapshot: Seq[Pod]) method, it replaces the state for all pods retrieved in the request. The class responsible for reading the complete information about the executor pods is ExecutorPodsPollingSnapshotSource. The update frequency is defined in spark.kubernetes.executor.apiPollingInterval property.
The information stored in the store comes from Fabric8's Pod class bringing among others, the information about: metadata, specification or status. Each Pod instance is wrapped into Apache Spark's ExecutorPodState being one of:
case class PodRunning(pod: Pod) extends ExecutorPodState case class PodPending(pod: Pod) extends ExecutorPodState sealed trait FinalPodState extends ExecutorPodState case class PodSucceeded(pod: Pod) extends FinalPodState case class PodFailed(pod: Pod) extends FinalPodState case class PodDeleted(pod: Pod) extends FinalPodState case class PodTerminating(pod: Pod) extends FinalPodState case class PodUnknown(pod: Pod) extends ExecutorPodState
The next item is a pure Kubernetes feature. A pod template is a way to generalize pod definitions. The feature has been available in Apache Spark 3.0 to respond to the increased needs for customization. Before this change, all new customization required adding a new configuration option to Apache Spark, meaning more code to maintain and a bigger gap with Kubernetes declarative model.
Pod template is a YAML definition (template) that can be reused by different jobs. One of the most self-explanatory examples I've found in the context of Apache Spark is the side container pattern running the executor alongside a monitoring container. You can find the full code example in this EMR documentation page.
Not all pod template properties can be overwritten in the Apache Spark context. Some of them, like restartPolicy (always never), volumes or service account, are always managed by Apache Spark. The full list is available in the Pod Template properties documentation.
The pod storage is volatile because it disappears when the pod goes offline. To bring some persistence to the data, Kubernetes uses volumes that are also available for Apache Spark jobs. The framework supports the following Kubernetes volumes:
- Persistent Volume Claim (PVC) - a Persistent Volume is a type of storage with a separate lifecycle from the pod's. It's defined by a capacity, type (azure, awsElasticBlockStore, gcePersistentDisk) and access mode. However, it's an administrator who manages this PersistentVolume, not the developer. The developer creates a claim requesting some Persistent Volume capacity.
Starting from Apache Spark 3.1.1, the library supports dynamic PVC creation - useful for the dynamic workloads where the number of executors, hence the number of claims, is unknown when the job starts (e.g, Dynamic Resources Allocation).
- Network File System (NFS) - mounts an existing NFS into a pod. It's not temporary storage, i.e., Kubernetes doesn't remove it when the pod stops. Hence, it's a good candidate to pre-populate it with some data and share it between various pods.
- host path - mounts the node's directory into the pod. Even though, it's technically doable, Kubernetes doesn't recommend avoiding it whenever possible due to security risk. And if you don't have a choice, it should be scoped to a specific place (directory, file) and limited to read-only access.
- empty dir - volume scoped to the Pod's lifespan but shareable between Pod containers. Even though this volume dies when the pod stops, it can survive container restarts due to the health check failures.
Apache Spark supports the volumes via the configuration entries starting with spark.kubernetes.driver.volumes or spark.kubernetes.executor.volumes.
The last interesting fact is about a service account. Why would an Apache Spark job ever need a service account? If you think about managed Apache Spark cloud services like EMR, you'll immediately understand the reason. An EMR cluster also needs a set of permissions. You will need it to get EC2 instances and create a new cluster, or to access data processing resources (S3, DynamoDB, Kinesis Data Streams, ...) used by the job.
Well, the service account in Apache Spark on Kubernetes is similar. It defines what the given pod can do. A driver pod will need the permissions to create and watch executor pods, service and config maps.
A service account won't be enough to authorize the driver to create executor pods. In addition to creating it, you will have to create a corresponding role or cluster role and bind it to the service account. Normal vs cluster resources have different scope. The latter apply cluster-wide, whereas the former work only for a namespace.
You will find the presented concepts in the video below:
The article started with 2 low-level implementation details of Kubernetes resource manager. You discovered that Apache Spark uses listeners to keep track of all the cluster changes. In the next 3 parts you saw more high-level concepts with the pod templates, various volume types and service accounts. I hope that altogether, it helped you get a better understanding of the Apache Spark on Kubernetes integration!