A 3-day bug hunt on a 3-person team costs up to β¬7,200 in lost engineering time. This workshop teaches you to prevent that β unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.
Even if it's always better to explicit things, in programming we have often the possibility to let the computer to guess. Spark SQL also has this level of intelligence, for example during schema resolving.
Structured data processing takes more and more place in Apache Spark project. Structured streaming is one of the proofs. But how does Spark SQL work - and particularly, how does it load data from sources of structured data as RDMBS ?
Without any explicit definition, Spark SQL won't partition any data, i.e. all rows will be processed by one executor. It's not optimal since Spark was designed to parallel and distributed processing.
Some months ago I presented save modes in Spark SQL. However, this post was limited to their use in files. I was quite surprised to observe some specific behavior of them for RDBMS sinks. Especially for SaveMode.Overwrite.
Some recent posts covered important Spark SQL options for RDBMS: partitioning and write modes. However they're not the only ones available for this data storage.