Docker-composing Apache Spark on YARN image on waitingforcode.com

Versions: Apache Spark 2.3.0 https://github.com/bartosz25/spark-docker

Some months ago I written the notes about my experience from building Docker image for Spark on YARN cluster. Recently I decided to improve the project and transform it to Docker-compose format.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

This post summarizes the experience of that transformation. It's composed of several small sections. Each of them describes a problem I encountered during this task with chosen solution. Thus, the post explains some of Docker-compose features, such as networking, replicas or configuration.

First thoughts

With Docker-compose we can define and run multi-container Docker applications. Thus, it fits pretty good for fake-clustered Spark version on YARN, executed in standalone mode. It works by defining multiple images inside a YAML file, just like here:

version: '3'
services:
  master:
    image: "spark_compose_master:latest"
  slave:
    image: "spark_compose_slave:latest"

As you can see, the way of working looks like the Kubernetes one. The difference are of course different options that, in the case of Docker-compose, are globally the same as during containers execution with Docker's CLI: environment, volumes, ports mapping and so on. Thanks to that, "translating" Bash scripted version to Docker-compose shouldn't be a big deal.

Network

The first translation step was the creation of spark-network, previously done with docker network create --subnet=172.18.0.0/16 spark-network command. Static network makes the discovery between slave and master possible. In Docker-compose it's declared outside of services scope, at the root of document:

networks:
  spark-network:
    driver: bridge
    ipam:
      driver: default
      config:
      - subnet: 172.18.0.0/16

The declaration looks easy but 2 properties were mysterious at the moment of writing: driver and ipam. The former option represents the default network driver used by Docker networks. Bridge type uses a software bridge that allows containers connected to it to communicate. It's adapted for the containers running on the same host, as in our case. The second concept concerns IP Address Management (IPAM). IPAM is a software to plan, manage and track IP addresses used in a network. One of IPAM's driver responsibility is the control of IP addresses assignment.

Networks are used in services pretty intuitively with the same networks keyword:

services:
  master:
    networks:
      spark-network:
        ipv4_address: 172.18.0.20
    hostname: spark-master

We've just assigned the IP 172.18.0.20 and spark-master hostname to the master node. Thanks to that we can add such host (Docker's CLI --add-host option) to slaves:

services:
  slave:
    extra_hosts:
      - "spark-master:172.18.0.20"

At this "network" moment I also wondered what is the difference between ports exposed directly from Dockerfile and the ones defined in docker-compose file. In fact the ports from docker-compose file are officially published while the ones in Dockerfile are only a kind of information about the ports that can be published for the person using the image.

Slave instances

After pretty quick setup of master node it's the time to work on slaves. Their configuration is for now pretty straightforward and one of first complicated things to do consists on constructing 2 or more slave instances. After some quick research the number of slave instances appears as the first parameter that can't be configured inside Docker-compose template.

The working command creating x slave containers is the following:

docker-compose up --scale slave=x

Please note that Docker-compose provides another method to scale the number of containers with docker-compose scale slave=x command. However it's deprecated and moreover two solutions don't behave similarly. --scale flag inherits the behavior of docker-compose up and then will create already started containers if their configuration changed or they were substituted with a new image. The deprecated scale command doesn't behave in that way. To prevent the recreation of already started containers the --no-recreate should be used.

Refactoring

After all these steps Spark on YARN Docker-composed was almost terminated. But before crying the victory I apperceived that Dockerfiles for master and slave were slightly the same, except some subtle differences in entrypoints and configuration files. Both can be set up with appropriate configuration files but there is also another approach - Dockerfiles composition.

The composition led to one main image where all Spark and Hadoop common stuff (downloading packages, installing librairies...) is done. Later the master and slave images are built on top of that basic image, with some extra operations. For instance, the master's Dockerfile ended up with the following content:

FROM spark_base_image:latest

# Copy start script 
COPY ./scripts/run_slave.sh ./run.sh
USER root 
RUN chmod +x /home/sparker/run.sh 
RUN echo "sparker ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers

USER sparker

COPY ./conf-slave/yarn-site.xml ./hadoop-$HADOOP_VERSION/etc/hadoop
COPY ./conf-slave/core-site.xml ./hadoop-$HADOOP_VERSION/etc/hadoop

ENTRYPOINT ./run.sh

# node manager ports
EXPOSE 8040 
EXPOSE 8042
EXPOSE 22
EXPOSE 8030
EXPOSE 8031
EXPOSE 8032
EXPOSE 8033
EXPOSE 8088
EXPOSE 10020
EXPOSE 19888

Refactoring problem

Unfortunately, the refactoring brought a new problem because the command creating HDFS directories used by Spark History didn't work anymore. After some digging the reason of that was undefined JAVA_HOME...that was defined as an environment variable in Dockerfile and didn't move from there after the refactoring.

The first thought was about the lack of inheritance for environment variables defined in parent image. However after some research I found it was not a valid supposition. Moreover, echo $JAVA_HOME executed in already running container correctly displayed "/usr/lib/jvm/java-8-oracle/".

Some minutes later I found another solution consisting on adding export JAVA_HOME=/usr/lib/jvm/java-8-oracle/ to /etc/hadoop/hadoop-env.sh file. After recreating the containers, the error for HDFS directories creation disappeared.

This post groups a list of points I've learned during the refactoring of Docker image for Spark on YARN project. After considering docker-compose as a templated form of Docker's CLI in the first section, the subsequent parts described learned points about: networking, scalability and images composition. We could see that a lot of concepts from CLI can be easily translated to YAML Docker-compose files. With such more declarative way we let Docker-compose create and up our service. It appears then as a declarative way to manage containers, pretty similarly to everything we can do with Docker CLI commands.

Consulting

With nearly 17 years of experience, including 9 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #dockerizing Big Data #Spark Docker #YARN Docker