Hints on Databricks

Query optimizers are impressive. They profile your data, estimate costs, and pick execution strategies faster than any human could. But they don't know everything and when they guess wrong, the consequences show up in your bill and your SLAs. Hints are the escape hatch. A small instruction you embed directly in SQL that tells the engine: I know something you don't

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

Hints 101

To understand hints, you need to understand the problem they were built to solve. Early relational databases used rule-based optimizers. They were choosing execution plans by following fixed priority rules such as always preferring an index over a full table scan. This was predictable, but rigid because the optimizer had no awareness of actual data distribution, table sizes, or statistics. It couldn't distinguish between an index on a 100-row table and one on a billion-row table.

Things have changed with Oracle who in the early 1990s proposed not only a new cost-based optimization engine but also a way to influence its decision, the hints! It simplified the migration process from the rule-based to the cost-based world because users could still suggest better execution plans to the engine when they knew better.

That being said, modern query engines ultimately decide whether hints are relevant. If they aren't, the engine may simply ignore them and, eventually, notify you after the fact. The notification is not required and some databases silently ignore your suggestions.

When it comes to the technical implementation, hints are often expressed as inline /*+ ... */ comments in the query.

Hints and Databricks

On Databricks, hints are written as SQL comments and concern:

Historically Databricks supported skew join hint but it was retired in favor of the Adaptive Query Execution.

Partitioning hints

The partitioning hints let you suggest a better partitioning strategy for the query execution. If you are familiar with Apache Spark, you'll find in this category different data reorganization actions, such as:

To see them in action, let's create a skewed events table:

Join hints

Joins are the second category where you can apply hints. Here too if you are familiar with Apache Spark you should find some familiar terms:

Curious to see the join hints in action? Let's take a look at the demo:

Even though modern compute engines know a lot about the processed data and optimized execution strategies, hints remain are a good way to say "I know this query will work better if you execute it in that particular way". As you saw, it's also possible to do on Databricks - at least, as long as you know what you are doing!

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©