Dockerize Spark on YARN - lessons learned on waitingforcode.com

Versions: Spark 2.1.0

Even if a lot of Docker containers exist for Apache Spark, it's always a good exercise to make one in your own. It can help to understand some new concepts as well as improve skills of building Docker images.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Through this post we'll try to build a Docker image allowing us to run Spark 2.1.0 in cluster mode, with Hadoop YARN resource management. The first part explains some steps from Dockerfile. The second part describes the first not working version and makes a zoom in the reason of the problems. The third part shows the second not working version and applied solutions.

Spark and YARN installation

Spark installation consists mainly on unzipping downloaded binary distribution and defining some environment variables.

Installing Hadoop with YARN is also pretty straightforward. The first step consists on download Hadoop binary distribution and unpack it somewhere on disk. After that, some configuration customization must be done. The yarn-site.xml file must be enriched with yarn.resourcemanager.hostname entry. It defines the hostname of ResourceManager . It also needs some entries defining logging behavior: yarn.log-aggregation-enable enables logs aggregation and yarn.nodemanager.remote-app-log-dir specifies the place where logs will be written. The changes must also be done in core-site.xml file where the entry defining used filesystem (fs.default.name) is added.

YARN ResourceManager

ResourceManager is master node. It manages available resources and schedules executed applications on them. Its activity can be tracked on UI, exposed by default on 8088 port.

Dockerfile mainly consists on downloading packages, exposing environment variables and ports, and defining an ENTRYPOINT . All implementations details are included in Github project of Spark Docker.

Docker ENTRYPOINT

ENTRYPOINT command in Docker defines the action executed when container starts.

First version (not working)

The first try was done on basic version built only on unpacked Hadoop and Spark binary distributions and below start script:

#!/bin/bash

echo "Starting YARN resource manager"
exec ~/hadoop-2.7.3/sbin/start-yarn.sh > ./start.log &

echo "Starting Spark master..."
exec start-master.sh -h spark-master -p 7077 &

After building the image, the container was started with docker run -p 127.0.0.1:7077:7077 -p 127.0.0.1:8088:8088 --hostname spark-master --name $IMAGE_NAME -i -t $CONTAINER_NAME command. Above script was included in Dockerfile and executed by ENTRYPOINT method (thus at container's execution). The problem was that as soon as container was launched, as soon it exited without any critical error:

Step 30/30 : ENTRYPOINT ./run.sh
 ---> Running in 9b7469cd51cc
 ---> 4dea2b4b7136
Removing intermediate container 9b7469cd51cc
Successfully built 4dea2b4b7136
Creating new runnable container
Starting YARN resource manager
Starting Spark master...
bartosz:~/projects/spark_docker/v1/master$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
bartosz:~/projects/spark_docker/v1/master$ docker ps -a
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS                      PORTS               NAMES
25cf6ea2a716        spark_yarn_master   "/bin/sh -c ./run.sh"    12 seconds ago      Exited (0) 10 seconds ago                       spar_yarn_master_image

So what was wrong with entrypoint's script ? Normally Docker container exits when the main process (entrypoint script) finishes. To prevent container's exiting immediately after start, startup script changed and the infinite loop was added at the end:

while true; do sleep 1000; done

After that container ran correctly.

Second (and still not working) version

After starting the master and slave containers correctly, I made a test of job submission with JARs provided with Spark binary distribution. As expected, the executed command (spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 2g --executor-memory 1g --executor-cores 1 ~/spark-2.1.0-bin-hadoop2.7/examples/jars/spark-examples*.jar 10) successfully registered new job in YARN. The problem was that the job was stuck in ACCEPTED state.

The problem was related to the hosts discovery. After adding a mapping between master host and its IP in slave's /etc/hosts file, submitted job got moving. But this change was made manually after starting the container. It wasn't optimal. To improve that, I decided to create Docker's network and assign static IPs to containers with Docker's --ip IP argument:

# created Docker network
docker network create --subnet=172.18.0.0/16 spark-network

# - assigns IP (--ip) 
# - defines mapping in /etc/hosts (--add-host)
docker run -p 127.0.0.1:7077:7077 -p 127.0.0.1:8088:8088 --net spark-network --hostname spark-master --ip 172.18.0.20  --add-host="spark-slave:172.18.0.21" --name $IMAGE_NAME -i -t

Other encountered problems

But the described visibility problem was not the only one encountered during the construction of Spark's Docker image. The following list contains 3 others that, because of easier fix, weren't described in this post:

long running tasks - certain tasks, as yarn resourcemanager prints output to the console. With so running tasks it's hard to add extract commands to run in run.sh file defined in ENTRYPOINT. One of possible fixes was the output redirection to the file with < $file 2<&1 instruction.
connection issue to NameNode - slave needs to be able to connect to HDFS NameNode. Initially, the value defined in fs.default.name was localhost:9000. However, it's the IP assigned to master that must be defined in this property.
SSH prompt - ssh connection was prompted every time. Adding the commands as ssh -oStrictHostKeyChecking=no spark-master uptime in endpoint scripts helped to fix the issue

With Docker network, there are no more need to manipulate manually /etc/hosts files. As proved in above snippet, Docker provides another argument, --add-host, allowing to specify the mapping between a host and its IP, both separated by ":". Thus, after adding it to my 2 docker run command (master and slave), my 2-nodes cluster became operational.

After these iterative steps, a basic Spark cluster can be used to launch sample applications. The code itself is not optimal. It could be refactored in several places and maybe the use of Docker compose could help to avoid several duplications. However, this post made some insight on methods that can be used to solve common issues in master-slave dockerized containers. The first part shown what was physically (archives, steps) needed to build both images. The second part presented the first encountered problem related to containers visibility. The last part gave a solution for that consisting on creating Docker network and defining IPs in the ranges of this network.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #dockerizing Big Data #Spark Docker #YARN Docker