Variables in Databricks Asset Bundles

https://github.com/bartosz25/databricks-playground/tree/main/env-variables-dab

Variables are an essential part of any deployment process. You don't want to write a dedicated YAML or Python script for every environment, do you? Databricks Asset Bundles (DAB) is no exception, as its variable handling is designed to significantly simplify your workflow.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Variables referencing other variables

DAB variables are there to avoid repetitions, exactly like variables in Python. However, what if many variables share the same root. You can imagine different jobs in the bundle processing different directories inside a volume. Without any optimization, it gives variables declaration like that:

variables:
  items_catalog_path:
    default: /Volumes/wfc/default/items
  orders_path:
    default: /Volumes/wfc/default/orders

It works, but you'll notice the volume name is duplicated. Since all variables are defined in the same file (databricks.yml), updating them manually is straightforward. However, to minimize the risk of missing a declaration during an update, you can reference variables within other variables like this:

variables:
  volume_for_data:
    default: /Volumes/wfc/default
  items_catalog_path:
    default: ${var.volume_for_data}/items/
  orders_path:
    default: ${var.volume_for_data}/orders/

Lookup

Another great feature of variables is the lookup capability. Quite often, you or your DevOps colleagues will declare Databricks resources directly within the Terraform provider. If you're lucky, they will use consistent naming across environments (workspaces), making your life as an end-user much easier. With consistent naming, referencing those variables in your bundle is as simple as this:

variables:
  warehouse_id_sql_queries:
    lookup:
      warehouse: "Data processing warehouse endpoint"

With that your bundle deploys seamlessly on development, staging, or yet production workspace. You don't need any kind of magical scripts to retrieve the SQL warehouse id, or even worse, hardcoding those ids directly in the bundle definition.

Complex variables

Another convenient feature of variables is their support for complex types. A complex type is anything other than a scalar type. For example, it could be a struct used to define cluster requirements, or a list used to store users who should receive notifications after a job bundle's execution:

variables:
  failure_extra_users:
    type: complex
    default:
        - user1@myorganization.com
        - user2@myorganization.com

Mailing lists

Ideally you should use mailing lists for notifications. That way you manage the group from a single place and can share the distribution in other places than your bundle.

Variables substitution

Remember my words about luck and consistent naming conventions? If you aren't so lucky, or if your resources is not yet supported in the lookup, you'll have to manually fetch the underlying values for your Databricks resources injecting variables from the CI/CD environment.

Thankfully, DAB also supports variables injection from outside, either with --var="my_variable_1=value_1,my_variable_2=value_2", or with environment variables prefixed by BUNDLE_VAR_. If we wanted to reference a variable dynamically, we could declare it in the bundle as:

variables:
 a_variable:
    description: Something coming from the env. 

...and with the injection:

export BUNDLE_VAR_a_variable=a-b-c
databricks bundle deploy --target dev

Parameters vs. named parameters

The final section covers dynamic value references and named parameters. Named parameters are an excellent way to improve the readability of your Lakeflow jobs. Unlike standard parameters, named parameters appear directly on the job page. Furthermore, declaring them in the bundle is much cleaner and more intuitive. There is no longer a need to define a complex array of strings. Instead, you simply declare a map:

python_wheel_task:
  package_name: wfc
  entry_point: run_parameters
  named_parameters:
    volume_path: ${var.volume}
    max_files_per_trigger: 10

However, there is one tricky part. If you want to reference a dynamic value reference such as a job name, and you do it that way:

job_name: {{ job.name }}

...you'll run into trouble:

Error: failed to load /Users/bartosz/wfc/dab/test/resources/sample.job.yml: yaml (/Users/bartosz/wfc/dab/test/resources/sample.job.yml:11:25): key is not a scalar

To fix this error you need to wrap the value, and so declare it as:

job_name: "{{job.name}}"

DABs are no different, as with Python, variables are your best friend for avoiding maintenance headaches. A single variable declaration can be referenced throughout your project, making future changes seamless and far less risky.

Consulting

With nearly 17 years of experience, including 9 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
đź”— past projects