A 3-day bug hunt on a 3-person team costs up to β¬7,200 in lost engineering time. This workshop teaches you to prevent that β unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.
Often a misconfiguration is the reason of all kinds of issues - performance, security or functional. Spark isn't an exception for this rule and it's the reason why this article focuses on configuration properties available for driver and executors.
Keeping in mind which parts of Spark code are executed on driver and which ones on workers is important and can help to avoid some of annoying errors, as the ones related to serialization.
Often making errors helps to progress. It was my case with spark-submit and local/remote JAR pair. They helped me to understand the role of driver, closures, serialization and some configuration properties.
Defining the universal workload and associating corresponding resources is always difficult. Even if most of time expected resources will support the load, there always will be some interval in the year when data activity will grow (e.g. Black Friday). One of Spark's mechanisms helping to prevent processing failures in such situations is dynamic resource allocation.
One of problems in distributed computing is the failure detection. How a master node can know that some of its workers went down just a minute ? A popular and quite simple solution uses heartbeats sent at regular interval by the workers. Spark also implements this technique.
The communication in distributed systems is an important element. The cluster members rarely share the hardware components and the single solution to communicate is the exchange of messages in the client-server model.