Spark architecture members

Versions: Spark 2.0.0

The knowledge of Spark's API is not a single useful thing. It's also so important to know when and by who programs are executed.

New ebook 🔥

Learn 84 ways to solve common data engineering problems with cloud services.

👉 I want my copy

This article talks about architecture in Spark. It presents the concepts of driver, executor and cluster manager. Each of them is described in separate part. Global architecture looks like in below image:

Driver in Spark

Spark's driver a Java process in charge of: context and RDD creation, tasks scheduling or hosting web UI. To see it simpler, it's a main() method of Java program where all computations are defined.

The driver is the first step of Spark's code execution flow. This flow can be resumed in these points:

  1. User's program starts on driver
  2. Driver computes physical execution of program computation
  3. Driver connects to cluster manager and asks for available executors able to handle computed physical tasks
  4. Driver asks executors to execute tasks
  5. Tasks are executed by executors and their results send back to driver
  6. When all results are collected on SparkContext is explicitly closed, driver program ends

Executor in Spark

Executors compose Spark application together with driver. They can not only execute tasks defined by the driver but also store cached RDDs.

To tell to the driver that they're alive, the executors send heartbeats by respecting the interval time defined in spark.executor.heartbeatInterval configuration entry. If one of executors doesn't send signal within defined delay, driver considers it as failed. After this detection, the driver tries next to trigger the task of failed executor in another executor.

But there is another case when driver can decide to launch a task on different executor. When executed task seems to be slow, driver can start its second execution on other executor. Triggered task is called a speculative copy. Finally, the result computed by faster executor will be used.

Cluster manager in Spark

As already told, it's a middleman between driver and executors. This pluggable component accepts the demand of driver and executes it by triggering executors in charge of tasks computation. Executors are chosen according to data location to optimize network communication.

Several types of cluster managers exist:

If you liked it, you should read:

The comments are moderated. I publish them when I answer, so don't worry if you don't see yours immediately :)

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!