Spark architecture members on waitingforcode.com

Versions: Spark 2.0.0

The knowledge of Spark's API is not a single useful thing. It's also so important to know when and by who programs are executed.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

This article talks about architecture in Spark. It presents the concepts of driver, executor and cluster manager. Each of them is described in separate part. Global architecture looks like in below image:

Driver in Spark

Spark's driver a Java process in charge of: context and RDD creation, tasks scheduling or hosting web UI. To see it simpler, it's a main() method of Java program where all computations are defined.

The driver is the first step of Spark's code execution flow. This flow can be resumed in these points:

User's program starts on driver
Driver computes physical execution of program computation
Driver connects to cluster manager and asks for available executors able to handle computed physical tasks
Driver asks executors to execute tasks
Tasks are executed by executors and their results send back to driver
When all results are collected on SparkContext is explicitly closed, driver program ends

Executor in Spark

Executors compose Spark application together with driver. They can not only execute tasks defined by the driver but also store cached RDDs.

To tell to the driver that they're alive, the executors send heartbeats by respecting the interval time defined in spark.executor.heartbeatInterval configuration entry. If one of executors doesn't send signal within defined delay, driver considers it as failed. After this detection, the driver tries next to trigger the task of failed executor in another executor.

But there is another case when driver can decide to launch a task on different executor. When executed task seems to be slow, driver can start its second execution on other executor. Triggered task is called a speculative copy. Finally, the result computed by faster executor will be used.

Cluster manager in Spark

As already told, it's a middleman between driver and executors. This pluggable component accepts the demand of driver and executes it by triggering executors in charge of tasks computation. Executors are chosen according to data location to optimize network communication.

Several types of cluster managers exist:

standalone - the simplest, based on master/worker architecture. Each application has at most 1 executor by worker node. By default, Spark tries to use the maximum number of different nodes. If for example the cluster has 4 workers and we configured the use of 4 cores and 4GB RAM, Spark will take 1 core and 1GB RAM of each node to execute the program.
Hadoop YARN - cluster manager coming from Hadoop and based on HDFS. Thanks to that it can read data stored in HDFS quicker.
Apache Mesos - general cluster manager which can be used in 2 modes: fine-grained (one node can share its CPU among different executors running on it) and coarse-grained (fixed number of CPUs reserved to executors in advance).

This post shows 3 main actors of Spark. The first one, driver, is the one from who all begins. It stored main program to execute. Next, it contacts the second actor, cluster manager, to get a list of executors able to compute planned tasks. These executors are the 3rd actors. Their responsibility consists on really executing tasks scheduled by driver.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects