Spark SQL reorder join articles

4-day workshop Β· In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that β€” unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants Β· production-ready templates
See the full curriculum β†’ €7,000 flat fee Β· cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

Reorder JOIN optimizer

One of the reasons why I like my blogging activity is that from time to time the exchange is bidirectional. It happens mostly on Github but also on the comments under the post and I appreciate the situation when I don't know the answer and must dig a little to explain it in a blog post :) I wrote this one thanks to bithw1 issue created on my Spark playground repository (thank you for another interesting question btw :)).

Continue Reading β†’

Reorder JOIN optimizer - cost-based optimization

In my previous post I explained how Apache Spark can reorder JOINs based on the logical plan. Today I'll focus on another aspect of reordering which uses cost estimation for the proposed plans.

Continue Reading β†’

Reorder JOIN optimizer - star schema

I didn't know that join reordering is quite interesting, though complex, topic in Apache Spark SQL. The queries not only can be transformed into the ones using JOIN ... ON clauses. They can also be reordered accordingly to the star schema which we'll try to see in this post.

Continue Reading β†’