Query optimizers are impressive. They profile your data, estimate costs, and pick execution strategies faster than any human could. But they don't know everything and when they guess wrong, the consequences show up in your bill and your SLAs. Hints are the escape hatch. A small instruction you embed directly in SQL that tells the engine: I know something you don't
What would it take for you to trust your Databricks pipelines in production?
A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.
Konieczny
Hints 101
To understand hints, you need to understand the problem they were built to solve. Early relational databases used rule-based optimizers. They were choosing execution plans by following fixed priority rules such as always preferring an index over a full table scan. This was predictable, but rigid because the optimizer had no awareness of actual data distribution, table sizes, or statistics. It couldn't distinguish between an index on a 100-row table and one on a billion-row table.
Things have changed with Oracle who in the early 1990s proposed not only a new cost-based optimization engine but also a way to influence its decision, the hints! It simplified the migration process from the rule-based to the cost-based world because users could still suggest better execution plans to the engine when they knew better.
That being said, modern query engines ultimately decide whether hints are relevant. If they aren't, the engine may simply ignore them and, eventually, notify you after the fact. The notification is not required and some databases silently ignore your suggestions.
When it comes to the technical implementation, hints are often expressed as inline /*+ ... */ comments in the query.
Hints and Databricks
On Databricks, hints are written as SQL comments and concern:
- Partitioning. Similarly to Apache Spark, you can drive the number of partitions processing the selected rows.
- Joins. Besides data organization at reading, you can also suggest a better join execution strategy.
Historically Databricks supported skew join hint but it was retired in favor of the Adaptive Query Execution.
Partitioning hints
The partitioning hints let you suggest a better partitioning strategy for the query execution. If you are familiar with Apache Spark, you'll find in this category different data reorganization actions, such as:
- Coalesce. The first hint reduces the number of partitions without data shuffling. Put differently, it simply combines existing partitions into bigger ones.
- Repartition. It changes the number of partitions with a data shuffle. The new partitions are either created from a simple number of partitions, a column, or a column with a number.
- Repartition by range. It's similar to the Repartition hint except for the execution mode. The by range repartitioner samples the data to repartition and creates adequate ranges with the samples. It's not the case of the Repartition hint where all the rows are analyzed for repartitioning.
- Rebalance. This hint is helpful when you want a more balanced size of the output partitions. To use it, you have to enable the Adaptive Query Execution.
To see them in action, let's create a skewed events table:
Join hints
Joins are the second category where you can apply hints. Here too if you are familiar with Apache Spark you should find some familiar terms:
- Broadcast. This hint broadcasts the annotated table to other partitions. The hint ignores the autoBroadcastJoinThreshold parameter that may lead you into runtime issues if by mistake you broadcasted a too big table.
- Merge. Here you ask the query engine to use the sort merge join strategy where both sides are first sorted and later merged to the common row. You'll find more details on this execution mode in my blog post from 2017 about Sort-merge join in Spark SQL.
- Range join. This hint suggests usage of the range join strategy for non-equi joins with range predicates like e.event_time BETWEEN range.start AND range.end. Without the hint the execution engine would probably fall into a costly Cartesian product generation. With the range join hint, the time values from the range expression are transformed into fixed-size bins which makes possible using the hash-based lookup instead of nested loops on replicated records.
- Shuffle hash. With this you ask the query engine to apply the shuffle hash join strategy where rows sharing the same hash are moved to the same executor. You also find more details for this mode in this 2017's blog post, Shuffle join in Spark SQL.
- Shuffle replicate nested loop. It's a hint for a more optimized Cartesian product generation, i.e. when you need to compare all rows from one table with all rows from another table. How does it optimize this complex operation? First, the hint shuffles rows from both tables to the configured shuffle partitions. Next, it replicates one side - usually the smaller one - fully (replication) so that each row from this side is present on each shuffle partition. In the end, the query engine performs a nested loop inside each shuffle partition and compares the uniquely shuffled rows with the replicated ones. Consequently, only one side is fully shuffled here which is not the case of a dummy Cartesian product implementation.
Curious to see the join hints in action? Let's take a look at the demo:
Even though modern compute engines know a lot about the processed data and optimized execution strategies, hints remain are a good way to say "I know this query will work better if you execute it in that particular way". As you saw, it's also possible to do on Databricks - at least, as long as you know what you are doing!
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I wrote
one on that topic! You can read it online
on the O'Reilly platform,
or get a print copy on Amazon.
I also help solve your data engineering problems contact@waitingforcode.com đź“©
Read also about Hints on Databricks here:
- Optimizer Hints in Database Systems: A Survey Hints Range join optimization Spark SQL join on waitingforcode.com
Related blog posts:
- ASOF Join in Apache Spark SQL
- Broadcast join - complementary notes for local shuffle reader
- Regression tests with Apache Spark SQL joins
- Sort-merge join in Spark SQL
- Shuffle join in Spark SQL
