Docker images and Apache Spark applications

Versions: Apache Spark 2.4.4

Containers are with us, data engineers, for several years. The concept was already introduced on YARN but the technology that really made them popular was Docker. In this post I will focus on its recommended practices to make our Apache Spark images better.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ contact@waitingforcode.com πŸ“©

The post idea born on Twitter where AmRajdip asked me if I've already wrote about Docker and Spark best practices. I didn't but the topic was so interesting that despite that, I gave a try. Thank you @AmRajdip for the idea!
I will start this article by describing some of Docker images best practices that you can find in a lot of places (meetups, confs, blog posts, ...). I put all the references below this article. By doing so, I wanted to gather them in a single place and try to find the counter-examples. From time to time I succeeded to, so you should also get some negative feedback about few of them. After that introduction, I will analyze the official Spark image and try to create a custom image to run on Kubernetes.

Docker images best practices

I gathered here all best practices that I will try to put in practice in the next section:

Official image analysis

In order to not reinvent the wheel, I will start here by analyzing the Docker image provided with Apache Spark. Its version for 2.4.4 respects pretty much of the above best practices:

Custom image lifecycle

Initially I wanted to create a custom Docker image. And I tried to imagine an application requiring all of the best practices listed in the first section. Unfortunately, it didn't happen. By forcing myself for doing that, I invented a lot of anti-patterns and to avoid any confusion, I decided to analyze the official image and show how to manage it with really smart customizations in this section.

To start working with Apache Spark Docker image, you have to build it from the image from the official Spark Github repository with docker-image-tool.sh script. Normally all official images are stored on Docker Hub and you can extend them directly, without downloading and building from scratch. I didn't find the Spark image though and that's why this quite cumbersome process. Why "cumbersome"? First, your have to build Spark JARs with /build/mvn -Pkubernetes -DskipTests clean package and it can fail. It failed on my Ubuntu because of "file too long exception". After, you can still get the JARs and configure the image to use them but it's also much slower than simply putting that to your Dockerfile:

FROM official_spark_image:spark_version_I_want_to_extend

Anyway, to build the image locally, you have to:

cd ~/workspace/spark-2.4.4
# The build below is required to avoid this error:
# COPY failed: stat 
# /var/lib/docker/tmp/docker-builder182148378/assembly/target/scala-2.12/jars: no such
# file or directory
./build/mvn -Pkubernetes -DskipTests clean package
./bin/docker-image-tool.sh -m -t v2.4.4 build

With these instructions, I built an image called "spark" tagged with "v.2.4.4" (see - best practices about using tags and avoid latest!). Let's run docker images to check if the images were correctly built:

bartosz:/tmp/spark$ docker images
REPOSITORY                                  TAG                 IMAGE ID            CREATED             SIZE
spark-r                                     v2.4.4              505c16a44002        2 days ago          720MB
spark-py                                    v2.4.4              9127a215f778        2 days ago          428MB
spark                                       v2.4.4              a221ab704368        2 days ago          337MB

As you can see, yes. It's now time to build the image for our custom application. It's very simplistic:

FROM spark:v2.4.4

COPY settings /opt/spark/app/

ENV EXECUTION_ENVIRONMENT 'local'
ENV ENABLED_TIME_REPORT 'no'

And to build and tag it, use the following commands:

bartosz:~/workspace/dockerized-spark/docker$ docker build -f ./Dockerfile . -t 'waitingforcode_spark:v0.1_spark2.4.4'
Sending build context to Docker daemon  4.096kB
Step 1/4 : FROM spark:v2.4.4
 ---> a221ab704368
Step 2/4 : COPY settings /opt/spark/app/
 ---> d9cb5174d876
Step 3/4 : ENV EXECUTION_ENVIRONMENT 'local'
 ---> Running in 38525d8811a2
Removing intermediate container 38525d8811a2
 ---> 2950f09eecd9
Step 4/4 : ENV ENABLED_TIME_REPORT 'no'
 ---> Running in 9271fb1b369e
Removing intermediate container 9271fb1b369e
 ---> d7a67b415fef
Successfully built d7a67b415fef
Successfully tagged waitingforcode_spark:v0.1_spark2.4.4

Let's change now the EXECUTION_ENVIRONMENT to an empty string ('') and rebuild the image with a new tag version:

bartosz:~/workspace/dockerized-spark/docker$ docker build -f ./Dockerfile . -t 'waitingforcode_spark:v0.2_spark2.4.4'
Sending build context to Docker daemon  4.096kB
Step 1/4 : FROM spark:v2.4.4
 ---> a221ab704368
Step 2/4 : COPY settings /opt/spark/app/
 ---> Using cache
 ---> d9cb5174d876
Step 3/4 : ENV EXECUTION_ENVIRONMENT ''
 ---> Running in 631698607d49
Removing intermediate container 631698607d49
 ---> c6a006134cb2
Step 4/4 : ENV ENABLED_TIME_REPORT 'no'
 ---> Running in 31d9943ec419
Removing intermediate container 31d9943ec419
 ---> dbd42ad9b5fb
Successfully built dbd42ad9b5fb
Successfully tagged waitingforcode_spark:v0.2_spark2.4.4

As you can see, the layer prior to the change were kept unchanged whereas the ones after it, were modified. Before terminating, let's launch our custom container on Kuberentes. I'm using here microk8s project (check how to Setting up Apache Spark on Kubernetes with microk8s):

 ./bin/spark-submit  --master k8s://127.0.0.1:16443  --deploy-mode cluster  --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2  --conf spark.kubernetes.container.image=waitingforcode_spark:v0.2_spark2.4.4 --conf spark.kubernetes.authenticate.submission.caCertFile=/var/snap/microk8s/current/certs/ca.crt --conf spark.kubernetes.authenticate.submission.oauthToken=${MY_TOKEN} --conf spark.app.name=spark-pi --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark local:///opt/spark/examples/target/original-spark-examples_2.11-2.4.4.jar 

How to find ${MY_TOKEN} value?

To retrieve your token, get first the default-token by looking for a name starting with default-token-:

bartosz:/data/spark-2.4.4$ microk8s.kubectl -n kube-system get secret
NAME                                             TYPE                                  DATA   AGE
attachdetach-controller-token-2462t              kubernetes.io/service-account-token   3      2d3h
certificate-controller-token-d2msj               kubernetes.io/service-account-token   3      2d3h
clusterrole-aggregation-controller-token-kgkh5   kubernetes.io/service-account-token   3      2d3h
coredns-token-5xwj5                              kubernetes.io/service-account-token   3      52m
cronjob-controller-token-v9lj5                   kubernetes.io/service-account-token   3      2d3h
daemon-set-controller-token-mswf4                kubernetes.io/service-account-token   3      2d3h

default-token-cndqq                              kubernetes.io/service-account-token   3      2d3h

After that, issue microk8s.kubectl -n kube-system describe secret default-token-cndqq:

bartosz:/data/spark-2.4.4$ microk8s.kubectl -n kube-system describe secret default-token-cndqq
Name:         default-token-cndqq
Namespace:    kube-system
Labels:       
Annotations:  kubernetes.io/service-account.name: default
              kubernetes.io/service-account.uid: 099279d3-1e07-4656-a525-638bae57f5b6

Type:  kubernetes.io/service-account-token

Data
====
ca.crt:     1103 bytes
namespace:  11 bytes
token:      MY_TOKEN

The code should compute Pi number, like in the following screenshot took from microk8s dashboard:

To be honest, writing this article took a pretty long time. I'm still not comfortable with devops concepts and had to challenge a lot all my findings by searching counter-examples. If you have an opinion on the best practices from the first section, please share them on the comment. I'm a YARN user who tries by all ways to jump to Kubernetes scheduler, for that time still locally. However, I hope that the points listed in the article can give you some value and initial guidance for the more advanced use than a simple local cluster execution.