What's new in Apache Spark 3.0 - Kubernetes

Versions: Apache Spark 3.0.0

I believe Kubernetes is the next big step in the framework after proposing Catalyst Optimizer, modernizing streaming processing with Structured Streaming, and introducing Adaptive Query Execution. Especially that Apache Spark 3 brings a lot of changes in this part!

A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free

👉 Register here

This post is divided into 5 sections with one category of features each. You will see first the configuration changes followed by runtime environment and security enhancement. After that, you will learn about one of the major evolutions of Kubernetes - pod templates. By the end of the post, you will have learned what changes from the contributor's perspective.

Configuration

The first important change from this category is the...configuration validation. As of Apache Spark 3.0, Kubernetes attributes (= prefixed with spark.executorEnv) that do not match the pattern [-._a-zA-Z][-._a-zA-Z0-9]* will be ignored with a warning message as follows

Invalid key: ... a valid environment variable name must consist of alphabetic characters, digits, '_', '-', or '.', and must not start with a digit.

With the 3.0 release you can request a specific number of cores for the driver with spark.kubernetes.driver.request.cores property. It ensures the features parity with the executor whose property (spark.kubernetes.executor.request.cores) is available since Spark 2.4.0.

Apart from that, a few time-control properties were also added. With spark.kubernetes.submission.connectionTimeout and spark.kubernetes.submission.requestTimeout you can control the connection and request timeouts for starting the driver in client mode. Two similar properties also exist to handle the same timeouts for requesting the executors (spark.kubernetes.driver.requestTimeout, spark.kubernetes.driver.connectionTimeout).

In the end, let me point out an important bug regarding not resolved pod prefixes. The prefix set in spark.kubernetes.executor.podNamePrefix was not added to your pods, it shouldn't be the case anymore in Apache Spark 3.0.

Runtime environment

Another category of features is related to the runtime environments. First, for the PySpark's users, Python 3 is the default binding in the Docker images (the version 2 is still supported)

// Default change in the config
  val PYSPARK_MAJOR_PYTHON_VERSION =
    ConfigBuilder("spark.kubernetes.pyspark.pythonVersion")
      .doc("This sets the major Python version. Either 2 or 3. (Python2 or Python3)")
      .version("2.4.0")
      .stringConf
      .checkValue(pv => List("2", "3").contains(pv),
        "Ensure that major Python version is either Python2 or Python3")
      .createWithDefault("3") // was "2" in Spark 2.4.0

And if you curious about the binding resolution, it looks like that:

# /resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh
if [ "$PYSPARK_MAJOR_PYTHON_VERSION" == "2" ]; then
    pyv="$(python -V 2>&1)"
    export PYTHON_VERSION="${pyv:7}"
    export PYSPARK_PYTHON="python"
    export PYSPARK_DRIVER_PYTHON="python"
elif [ "$PYSPARK_MAJOR_PYTHON_VERSION" == "3" ]; then
    pyv3="$(python3 -V 2>&1)"
    export PYTHON_VERSION="${pyv3:7}"
    export PYSPARK_PYTHON="python3"
    export PYSPARK_DRIVER_PYTHON="python3"
fi

Regarding Scala-Spark API, the first runtime evolution concerns the JDK version. You may already know that Apache Spark 3.0 added a support for JDK version 11 (SPARK-24417) after skipping directly the compatibility work on the versions 9 and 10 due to their end of life (what?!). This support was materialized in the Kubernetes part with an example of tagging the built Spark Docker image with Java 11 from docker-image-tool.sh:

$0 -r docker.io/myrepo -t v3.0.0 -b java_image_tag=11-jre-slim build

Under-the-hood, this tag will be passed to the image to pull the correct version of the Java image:

ARG java_image_tag=8-jre-slim

FROM openjdk:${java_image_tag}
# Rest of the Spark image...

Finally, to optimize test executions, Apache Spark 3.0 test images use JRE instead of JDK.

Security enhancements

In the next category you can find all important security enhancements. If you remember my blog post about Docker images and Apache Spark applications, one of the best practices was not using the root user in the image. In the previous release, the images built from the project's docker-image-tool.sh used root. It changed in the 3.0 because the script sets the default user that can be overridden with -u parameter.

The second important evolution is about secret redaction. Before Apache Spark 3.0, only the values of the configuration attributes with the words like "secret" or "password" weren't displayed on the UI and logs. In 3.0, a new item was added to this list, the "token", targeted to spark.kubernetes.authenticate.submission.oauthToken property used in Kubernetes server authentication.

Finally, Apache Spark 3.0 also added the support for Kerberos in the cluster and client mode. To recall, Kerberos is a ticket-based network authentication protocol solving identity problems in non-secure networks. In the Apache Spark context, this feature helps to safely communicate with HDFS or any other kerberized service.

Extendibility

The next important feature is the extendibility. Kubernetes comes with a YAML specification called pod templates that you can use to create and run new pods. What is the link with Apache Spark? The community chose it as an alternative for the customization requests. Before that, any new Kubernetes option had to be added to the Apache Spark config file. In the long term, it could increase the gap between Kubernetes and Spark, and also make the project difficult to maintain.

One flexible approach to extend the Apache Spark pods executed on Kubernetes consists on using then the pod templates from these properties, the spark.kubernetes.driver.podTemplateFile for the driver and spark.kubernetes.executor.podTemplateFile for the executors. It's important to notice that some of the template attributes won't never overwrite the values managed by Apache Spark configuration, like for example the one for the namespace or pods restart policy.

Testing facility

Finally, the new version of Apache Spark also brought some changes for the contributors. Broken and quite heavy Ceph dependency used in tests was replaced by smaller and more robust Minio.

Also, a better error message handling for the secret tests were added. Also, the concurrency control with unpredictable Thread.sleep were replaced with the listener and CountDownLatch-based orchestration (read more about this class in CountDownLatch in Java article).

In addition to that, some extra integration tests were added for the features introduced in Apache Spark 2.4. One of them covers persistent volumes.

As of writing this blog post, the community shared an initial list of features targeted for Apache Spark 3.1.0. According to it, Kubernetes resource manager status should change from experimental to General Availability, with even more new features added!