Spark architecture members

The knowledge of Spark's API is not a single useful thing. It's also so important to know when and by who programs are executed.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

This article talks about architecture in Spark. It presents the concepts of driver, executor and cluster manager. Each of them is described in separate part. Global architecture looks like in below image:

Driver in Spark

Spark's driver a Java process in charge of: context and RDD creation, tasks scheduling or hosting web UI. To see it simpler, it's a main() method of Java program where all computations are defined.

The driver is the first step of Spark's code execution flow. This flow can be resumed in these points:

  1. User's program starts on driver
  2. Driver computes physical execution of program computation
  3. Driver connects to cluster manager and asks for available executors able to handle computed physical tasks
  4. Driver asks executors to execute tasks
  5. Tasks are executed by executors and their results send back to driver
  6. When all results are collected on SparkContext is explicitly closed, driver program ends

Executor in Spark

Executors compose Spark application together with driver. They can not only execute tasks defined by the driver but also store cached RDDs.

To tell to the driver that they're alive, the executors send heartbeats by respecting the interval time defined in spark.executor.heartbeatInterval configuration entry. If one of executors doesn't send signal within defined delay, driver considers it as failed. After this detection, the driver tries next to trigger the task of failed executor in another executor.

But there is another case when driver can decide to launch a task on different executor. When executed task seems to be slow, driver can start its second execution on other executor. Triggered task is called a speculative copy. Finally, the result computed by faster executor will be used.

Cluster manager in Spark

As already told, it's a middleman between driver and executors. This pluggable component accepts the demand of driver and executes it by triggering executors in charge of tasks computation. Executors are chosen according to data location to optimize network communication.

Several types of cluster managers exist: