https://github.com/bartosz25/spark-docker
Some months ago I written the notes about my experience from building Docker image for Spark on YARN cluster. Recently I decided to improve the project and transform it to Docker-compose format.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
This post summarizes the experience of that transformation. It's composed of several small sections. Each of them describes a problem I encountered during this task with chosen solution. Thus, the post explains some of Docker-compose features, such as networking, replicas or configuration.
First thoughts
With Docker-compose we can define and run multi-container Docker applications. Thus, it fits pretty good for fake-clustered Spark version on YARN, executed in standalone mode. It works by defining multiple images inside a YAML file, just like here:
version: '3' services: master: image: "spark_compose_master:latest" slave: image: "spark_compose_slave:latest"
As you can see, the way of working looks like the Kubernetes one. The difference are of course different options that, in the case of Docker-compose, are globally the same as during containers execution with Docker's CLI: environment, volumes, ports mapping and so on. Thanks to that, "translating" Bash scripted version to Docker-compose shouldn't be a big deal.
Network
The first translation step was the creation of spark-network, previously done with docker network create --subnet=172.18.0.0/16 spark-network command. Static network makes the discovery between slave and master possible. In Docker-compose it's declared outside of services scope, at the root of document:
networks: spark-network: driver: bridge ipam: driver: default config: - subnet: 172.18.0.0/16
The declaration looks easy but 2 properties were mysterious at the moment of writing: driver and ipam. The former option represents the default network driver used by Docker networks. Bridge type uses a software bridge that allows containers connected to it to communicate. It's adapted for the containers running on the same host, as in our case. The second concept concerns IP Address Management (IPAM). IPAM is a software to plan, manage and track IP addresses used in a network. One of IPAM's driver responsibility is the control of IP addresses assignment.
Networks are used in services pretty intuitively with the same networks keyword:
services: master: networks: spark-network: ipv4_address: 172.18.0.20 hostname: spark-master
We've just assigned the IP 172.18.0.20 and spark-master hostname to the master node. Thanks to that we can add such host (Docker's CLI --add-host option) to slaves:
services: slave: extra_hosts: - "spark-master:172.18.0.20"
At this "network" moment I also wondered what is the difference between ports exposed directly from Dockerfile and the ones defined in docker-compose file. In fact the ports from docker-compose file are officially published while the ones in Dockerfile are only a kind of information about the ports that can be published for the person using the image.
Slave instances
After pretty quick setup of master node it's the time to work on slaves. Their configuration is for now pretty straightforward and one of first complicated things to do consists on constructing 2 or more slave instances. After some quick research the number of slave instances appears as the first parameter that can't be configured inside Docker-compose template.
The working command creating x slave containers is the following:
docker-compose up --scale slave=x
Please note that Docker-compose provides another method to scale the number of containers with docker-compose scale slave=x command. However it's deprecated and moreover two solutions don't behave similarly. --scale flag inherits the behavior of docker-compose up and then will create already started containers if their configuration changed or they were substituted with a new image. The deprecated scale command doesn't behave in that way. To prevent the recreation of already started containers the --no-recreate should be used.
Refactoring
After all these steps Spark on YARN Docker-composed was almost terminated. But before crying the victory I apperceived that Dockerfiles for master and slave were slightly the same, except some subtle differences in entrypoints and configuration files. Both can be set up with appropriate configuration files but there is also another approach - Dockerfiles composition.
The composition led to one main image where all Spark and Hadoop common stuff (downloading packages, installing librairies...) is done. Later the master and slave images are built on top of that basic image, with some extra operations. For instance, the master's Dockerfile ended up with the following content:
FROM spark_base_image:latest # Copy start script COPY ./scripts/run_slave.sh ./run.sh USER root RUN chmod +x /home/sparker/run.sh RUN echo "sparker ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers USER sparker COPY ./conf-slave/yarn-site.xml ./hadoop-$HADOOP_VERSION/etc/hadoop COPY ./conf-slave/core-site.xml ./hadoop-$HADOOP_VERSION/etc/hadoop ENTRYPOINT ./run.sh # node manager ports EXPOSE 8040 EXPOSE 8042 EXPOSE 22 EXPOSE 8030 EXPOSE 8031 EXPOSE 8032 EXPOSE 8033 EXPOSE 8088 EXPOSE 10020 EXPOSE 19888
Refactoring problem
Unfortunately, the refactoring brought a new problem because the command creating HDFS directories used by Spark History didn't work anymore. After some digging the reason of that was undefined JAVA_HOME...that was defined as an environment variable in Dockerfile and didn't move from there after the refactoring.
The first thought was about the lack of inheritance for environment variables defined in parent image. However after some research I found it was not a valid supposition. Moreover, echo $JAVA_HOME executed in already running container correctly displayed "/usr/lib/jvm/java-8-oracle/".
Some minutes later I found another solution consisting on adding export JAVA_HOME=/usr/lib/jvm/java-8-oracle/ to /etc/hadoop/hadoop-env.sh file. After recreating the containers, the error for HDFS directories creation disappeared.
This post groups a list of points I've learned during the refactoring of Docker image for Spark on YARN project. After considering docker-compose as a templated form of Docker's CLI in the first section, the subsequent parts described learned points about: networking, scalability and images composition. We could see that a lot of concepts from CLI can be easily translated to YAML Docker-compose files. With such more declarative way we let Docker-compose create and up our service. It appears then as a declarative way to manage containers, pretty similarly to everything we can do with Docker CLI commands.