Pre-commit hook and Declarative Automation Bundles example on waitingforcode.com

https://github.com/bartosz25/databricks-playground/tree/main/pre-commit-hook

Enforcing team conventions is never an easy task. In my experience, I've moved from clearly defined collaboration conventions on a wiki page to code review checklists, but neither has made my life easier. The only tools that have truly helped are the Git hooks I've been using for over a decade.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests

Medallion architecture & Lakeflow SDP

Max 10 participants · production-ready templates

See the full curriculum → €7,000 flat fee · cohort of up to 10

Bartosz
Konieczny

My personal use of the hooks evolved from manually added instructions, ultimately automated from a Make task, to automatically installed scripts from dev dependencies. But before I show you the latest and greatest applied to a Declarative Automation Bundles use case, I owe you a few words of introduction.

Git hooks 101

A hook is a specific point in a program's execution where the software allows you to intercept the process to trigger custom code without modifying the original program. In Git it translates by defining a special behavior that may block your local commit or push to the remote branch. The full list of hooks is available in the hooks documentation but to better illustrate what going one, let's focus on hooks applied to probably two most popular Git operations, the commit and push:

As you can see in the diagram, whenever you execute a Git operation, the underlying setup invokes some additional code that may prevent your changes from being committed or pushed (the pre- hooks).

Hooks setup

Technically hooks are a separate directory in your Git setup. Here is an example of empty directory for the code snippets used for my Data Engineering Design Patterns book:

.git/hooks
|-- applypatch-msg.sample
|-- commit-msg.sample
|-- fsmonitor-watchman.sample
|-- post-update.sample
|-- pre-applypatch.sample
|-- pre-commit.sample
|-- pre-merge-commit.sample
|-- pre-push.sample
|-- pre-rebase.sample
|-- pre-receive.sample
|-- prepare-commit-msg.sample
|-- push-to-checkout.sample
`-- update.sample

When you want to add a new hook, you simply put it to the .git/hooks directory. The file name must correspond to the hook you are about to set up, i.e. for a pre-commit hook you will create a .git/hooks/pre-commit file. Consequently, if you need to run multiple scripts for one hook, you need to orchestrate them within the corresponding hook file. Or, if you are a Python user, you can install the pre-commit package and delegate the orchestration to this framework.

Pre-commit and Python

The pre-commit library is an additional abstraction for Git hooks that makes their declaration familiar to Python users. Instead of creating and copying hooks manually, you need to create a .pre-commit-config.yaml file with the list of all hooks to run. Next you execute the pre-commit install command to generate a new pre-commit hook file. The hook file simply invokes the steps defined in the YAML file. Below is an example of the generated hook content:

#!/usr/bin/env bash
# File generated by pre-commit: https://pre-commit.com
# ID: 138fd403232d2ddd5efb44317e38bf03

# start templated
INSTALL_PYTHON=/Users/bartosz/workspace/wfc/python-examples/.venv/bin/python
ARGS=(hook-impl --config=.pre-commit-config.yaml --hook-type=pre-commit)
# end templated

HERE="$(cd "$(dirname "$0")" && pwd)"
ARGS+=(--hook-dir "$HERE" -- "$@")

if [ -x "$INSTALL_PYTHON" ]; then
    exec "$INSTALL_PYTHON" -mpre_commit "${ARGS[@]}"
elif command -v pre-commit > /dev/null; then
    exec pre-commit "${ARGS[@]}"
else
    echo '`pre-commit` not found.  Did you forget to activate your virtualenv?' 1>&2
    exit 1
fi

Hooks and Declarative Automation Bundles

Since Git hooks are pretty flexible constructs, your imagination - and sometimes technical constraints - will limit you. That's why do not consider my example as a simple illustration and not as a demonstration of everything you can use hooks for in Databricks context.

One of DABs commands is validate:

databricks bundle validate --target dev --profile personal_free_edition
Name: poe_demo
Target: dev
Workspace:
  Host: https://XXXXXX.cloud.databricks.com
  User: contact@waitingforcode.com
  Path: /Workspace/Users/contact@waitingforcode.com/.bundle/poe_demo/dev

Validation OK!

When your DAB has some errors, such as an invalid syntax, the validate command will fail. Below is an error you get if you use an array of strings instead of an array of key-value pairs for tasks dependencies:

databricks bundle validate --target dev --profile personal_free_edition
Warning: expected map, found string
  at resources.jobs.sample_job.tasks[0].depends_on[0]
  in resources/sample_job.job.yml:8:15

Name: poe_demo
Target: dev
Workspace:
  Host: https://XXXXXX.cloud.databricks.com
  User: contact@waitingforcode.com
  Path: /Workspace/Users/contact@waitingforcode.com/.bundle/poe_demo/dev

Found 1 warning

You see my point, this error will hurt you only when you deploy the bundle. If you can do this only from your CI/CD pipeline, this syntax error means possibly one additional fix commit before you can actually test the deployed version. Of course, you can also run this validation manually and fix the error but you need to remember that. On another hand, if you include the validate command as part of the pre-commit hook, Git will do it for you (as long as you remember installing the hook 🙃).

Let's see how include this validation step as part of the development process. First, we need to add an additional dev dependency:

[dependency-groups]
dev = [
    # ...
    "pre-commit==4.5.1"
]

The next step is to create the .pre-commit-config.yaml file with the DAB validation step:

repos:
  - repo: local
    hooks:
      - id: dab-validation
        verbose: true
        name: Validate Dab
        entry: poe validate_dab
        language: system
        fail_fast: true
        pass_filenames: false

The hooks invokes a Poe task executing this shell command:

[tool.poe.tasks.validate_dab]
help ="Validates Declarative Automation Bundle, including warnings detection."
shell = """
    bundle_outcome=$(databricks bundle validate -t dev  --profile personal_free_edition)
    echo $bundle_outcome | grep 'warning' &> /dev/null
    if [ $? == 0 ]; then
      echo "? A warning detected in the bundle: ";
      echo $bundle_outcome;
      exit 1
    else
      echo "? Bundle is valid"
    fi
"""
interpreter = 'bash'

The next step is to install the hook with uv run pre-commit install:

(pre-commit-hook) ?  pre-commit-hook git:(main) ? uv run pre-commit install
      Built pre-commit-hook @ file:///Users/bartosz/workspace/wfc/databricks-playground/pre-commit-hook
Uninstalled 1 package in 0.97ms
Installed 1 package in 1ms
pre-commit installed at .git/hooks/pre-commit

Now, if you try to commit our invalid DAB definition, you'll be blocked by the hook:

Git hooks are a well-established extension in software engineering projects, and they absolutely have a place in a data engineering context. After all, we also write code, make style mistakes, and break unit tests. However, hooks shouldn't replace a CI/CD pipeline. They are designed to make the development process easier by reducing feedback loops. But since they can be skipped, all your code quality requirements should still be enforced at the final gate to production: your CI/CD pipeline.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com 📩

Pre-commit hook and Declarative Automation Bundles example