Even if a lot of Docker containers exist for Apache Spark, it's always a good exercise to make one in your own. It can help to understand some new concepts as well as improve skills of building Docker images.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
Through this post we'll try to build a Docker image allowing us to run Spark 2.1.0 in cluster mode, with Hadoop YARN resource management. The first part explains some steps from Dockerfile. The second part describes the first not working version and makes a zoom in the reason of the problems. The third part shows the second not working version and applied solutions.
Spark and YARN installation
Spark installation consists mainly on unzipping downloaded binary distribution and defining some environment variables.
Installing Hadoop with YARN is also pretty straightforward. The first step consists on download Hadoop binary distribution and unpack it somewhere on disk. After that, some configuration customization must be done. The yarn-site.xml file must be enriched with yarn.resourcemanager.hostname entry. It defines the hostname of ResourceManager . It also needs some entries defining logging behavior: yarn.log-aggregation-enable enables logs aggregation and yarn.nodemanager.remote-app-log-dir specifies the place where logs will be written. The changes must also be done in core-site.xml file where the entry defining used filesystem (fs.default.name) is added.
YARN ResourceManager
ResourceManager is master node. It manages available resources and schedules executed applications on them. Its activity can be tracked on UI, exposed by default on 8088 port.
Dockerfile mainly consists on downloading packages, exposing environment variables and ports, and defining an ENTRYPOINT . All implementations details are included in Github project of Spark Docker.
Docker ENTRYPOINT
ENTRYPOINT command in Docker defines the action executed when container starts.
First version (not working)
The first try was done on basic version built only on unpacked Hadoop and Spark binary distributions and below start script:
#!/bin/bash echo "Starting YARN resource manager" exec ~/hadoop-2.7.3/sbin/start-yarn.sh > ./start.log & echo "Starting Spark master..." exec start-master.sh -h spark-master -p 7077 &
After building the image, the container was started with docker run -p 127.0.0.1:7077:7077 -p 127.0.0.1:8088:8088 --hostname spark-master --name $IMAGE_NAME -i -t $CONTAINER_NAME command. Above script was included in Dockerfile and executed by ENTRYPOINT method (thus at container's execution). The problem was that as soon as container was launched, as soon it exited without any critical error:
Step 30/30 : ENTRYPOINT ./run.sh ---> Running in 9b7469cd51cc ---> 4dea2b4b7136 Removing intermediate container 9b7469cd51cc Successfully built 4dea2b4b7136 Creating new runnable container Starting YARN resource manager Starting Spark master... bartosz:~/projects/spark_docker/v1/master$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES bartosz:~/projects/spark_docker/v1/master$ docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 25cf6ea2a716 spark_yarn_master "/bin/sh -c ./run.sh" 12 seconds ago Exited (0) 10 seconds ago spar_yarn_master_image
So what was wrong with entrypoint's script ? Normally Docker container exits when the main process (entrypoint script) finishes. To prevent container's exiting immediately after start, startup script changed and the infinite loop was added at the end:
while true; do sleep 1000; done
After that container ran correctly.
Second (and still not working) version
After starting the master and slave containers correctly, I made a test of job submission with JARs provided with Spark binary distribution. As expected, the executed command (spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 2g --executor-memory 1g --executor-cores 1 ~/spark-2.1.0-bin-hadoop2.7/examples/jars/spark-examples*.jar 10) successfully registered new job in YARN. The problem was that the job was stuck in ACCEPTED state.
The problem was related to the hosts discovery. After adding a mapping between master host and its IP in slave's /etc/hosts file, submitted job got moving. But this change was made manually after starting the container. It wasn't optimal. To improve that, I decided to create Docker's network and assign static IPs to containers with Docker's --ip IP argument:
# created Docker network docker network create --subnet=172.18.0.0/16 spark-network # - assigns IP (--ip) # - defines mapping in /etc/hosts (--add-host) docker run -p 127.0.0.1:7077:7077 -p 127.0.0.1:8088:8088 --net spark-network --hostname spark-master --ip 172.18.0.20 --add-host="spark-slave:172.18.0.21" --name $IMAGE_NAME -i -t
Other encountered problems
But the described visibility problem was not the only one encountered during the construction of Spark's Docker image. The following list contains 3 others that, because of easier fix, weren't described in this post:
- long running tasks - certain tasks, as yarn resourcemanager prints output to the console. With so running tasks it's hard to add extract commands to run in run.sh file defined in ENTRYPOINT. One of possible fixes was the output redirection to the file with < $file 2<&1 instruction.
- connection issue to NameNode - slave needs to be able to connect to HDFS NameNode. Initially, the value defined in fs.default.name was localhost:9000. However, it's the IP assigned to master that must be defined in this property.
- SSH prompt - ssh connection was prompted every time. Adding the commands as ssh -oStrictHostKeyChecking=no spark-master uptime in endpoint scripts helped to fix the issue
With Docker network, there are no more need to manipulate manually /etc/hosts files. As proved in above snippet, Docker provides another argument, --add-host, allowing to specify the mapping between a host and its IP, both separated by ":". Thus, after adding it to my 2 docker run command (master and slave), my 2-nodes cluster became operational.
After these iterative steps, a basic Spark cluster can be used to launch sample applications. The code itself is not optimal. It could be refactored in several places and maybe the use of Docker compose could help to avoid several duplications. However, this post made some insight on methods that can be used to solve common issues in master-slave dockerized containers. The first part shown what was physically (archives, steps) needed to build both images. The second part presented the first encountered problem related to containers visibility. The last part gave a solution for that consisting on creating Docker network and defining IPs in the ranges of this network.