A 3-day bug hunt on a 3-person team costs up to β¬7,200 in lost engineering time. This workshop teaches you to prevent that β unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.
The use of Dataset abstraction is not a single difference between structured and unstructured data processing in Spark. Apart of that, Spark SQL uses a technique helping to get results faster.
Even if Project Tungsten was started in Spark 1.5 and Spark's current version is 2.1 at the time of writing, it's good to know what precious this Project brought to Spark.
One of powerful features of Spark SQL is dynamic generation of code. Several different layers are generated and this post explains some of them.
The optimizer in Spark SQL helps to improve the performance of processing pipelines. One of its techniques is predicate pushdown.
Pushdown predicate is one of the most popular optimizations in Spark SQL. But it's not the single one and their main list is defined in org.apache.spark.sql.catalyst.optimizer.Optimizer abstract class.
Prior to Spark 2.2.0 release, the data processing was based on a set of heuristic rules ignoring the typology of the data. But the most recent release brought a tool well known from the RDBMS world that is a Cost-Based Optimizer.
One of the reasons why I like my blogging activity is that from time to time the exchange is bidirectional. It happens mostly on Github but also on the comments under the post and I appreciate the situation when I don't know the answer and must dig a little to explain it in a blog post :) I wrote this one thanks to bithw1 issue created on my Spark playground repository (thank you for another interesting question btw :)).
In my previous post I explained how Apache Spark can reorder JOINs based on the logical plan. Today I'll focus on another aspect of reordering which uses cost estimation for the proposed plans.
I didn't know that join reordering is quite interesting, though complex, topic in Apache Spark SQL. The queries not only can be transformed into the ones using JOIN ... ON clauses. They can also be reordered accordingly to the star schema which we'll try to see in this post.