The knowledge of Spark's API is not a single useful thing. It's also so important to know when and by who programs are executed.
This article talks about architecture in Spark. It presents the concepts of driver, executor and cluster manager. Each of them is described in separate part. Global architecture looks like in below image:
Driver in Spark
Spark's driver a Java process in charge of: context and RDD creation, tasks scheduling or hosting web UI. To see it simpler, it's a main() method of Java program where all computations are defined.
The driver is the first step of Spark's code execution flow. This flow can be resumed in these points:
- User's program starts on driver
- Driver computes physical execution of program computation
- Driver connects to cluster manager and asks for available executors able to handle computed physical tasks
- Driver asks executors to execute tasks
- Tasks are executed by executors and their results send back to driver
- When all results are collected on SparkContext is explicitly closed, driver program ends
Executor in Spark
Executors compose Spark application together with driver. They can not only execute tasks defined by the driver but also store cached RDDs.
To tell to the driver that they're alive, the executors send heartbeats by respecting the interval time defined in spark.executor.heartbeatInterval configuration entry. If one of executors doesn't send signal within defined delay, driver considers it as failed. After this detection, the driver tries next to trigger the task of failed executor in another executor.
But there is another case when driver can decide to launch a task on different executor. When executed task seems to be slow, driver can start its second execution on other executor. Triggered task is called a speculative copy. Finally, the result computed by faster executor will be used.
Cluster manager in Spark
As already told, it's a middleman between driver and executors. This pluggable component accepts the demand of driver and executes it by triggering executors in charge of tasks computation. Executors are chosen according to data location to optimize network communication.
Several types of cluster managers exist:
- standalone - the simplest, based on master/worker architecture. Each application has at most 1 executor by worker node. By default, Spark tries to use the maximum number of different nodes. If for example the cluster has 4 workers and we configured the use of 4 cores and 4GB RAM, Spark will take 1 core and 1GB RAM of each node to execute the program.
- Hadoop YARN - cluster manager coming from Hadoop and based on HDFS. Thanks to that it can read data stored in HDFS quicker.
Apache Mesos - general cluster manager which can be used in 2 modes: fine-grained (one node can share its CPU among different executors running on it) and coarse-grained (fixed number of CPUs reserved to executors in advance).
This post shows 3 main actors of Spark. The first one, driver, is the one from who all begins. It stored main program to execute. Next, it contacts the second actor, cluster manager, to get a list of executors able to compute planned tasks. These executors are the 3rd actors. Their responsibility consists on really executing tasks scheduled by driver.