A 3-day bug hunt on a 3-person team costs up to β¬7,200 in lost engineering time. This workshop teaches you to prevent that β unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.
Serialization frameworks are intrinsic part of Big Data systems. Spark is not an exception for this rule and it offers some different possibilities to manage serialization.
Keeping in mind which parts of Spark code are executed on driver and which ones on workers is important and can help to avoid some of annoying errors, as the ones related to serialization.
Issues with not serializable objects are maybe the most painful when we start to work with Spark. But hopefully there are several solutions to them.
Some of previous posts (Serialization issues - part 1) presented some of solutions for serialization problems. This post is its continuation.
Some time ago I was wondering why an object created once in the driver is recreated every time with new stage on executors - even if this object is sent through a broadcast variable. After some code digging, the response related to Java serialization appeared.
Often making errors helps to progress. It was my case with spark-submit and local/remote JAR pair. They helped me to understand the role of driver, closures, serialization and some configuration properties.