Semantic versioning with a Databricks volume-based package on waitingforcode.com

Versions: Python Semantic Release 9.21.1

One of the recommended ways of sharing a library on Databricks is to use the Unity Catalog to store the packages in the volumes. That's the theory but the question is, how to connect the dots between the release preparation and the release process? I'll try to answer this in the blog post.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Semantic versioning

Before you start writing any code for the released package, you need to define the versioning strategy. A popular standard is the semantic versioning which implies:

Using an X.Y.Z versioning schema where X stands for a major version, Y for a minor version, and Z for a patch version.
A new major version must be created for any backward incompatible changes, such as modifying the signature of the public API methods.
A minor version must be created for backward compatible features or public API deprecation.
A new patch version must be created for all backward compatible bug fixes.
The content of a released package must not be changed.
A 0 major version can be used for development and any features included cannot be considered as stable API.

The full specification is available in the semantic versioning 2.0.0 documentation.

Semantic release package

Good news is, you don't need to manage the release lifecycle on your own, at least in Python. There is an open source library called python-semantic-release that you can use to release the major, minor, and patched versions of a library. The python-semantic-release library will:

Manage the versions upgrade on your own based - by default - on the following rules tied to the commit messages:
- Patch upgrade if the commit starts with the fix keyword
- Minor upgrade if the commit starts with the feat keyword
- Major upgrade if the commit ends with the BREAKING CHANGE: footer or if the commit starts with a keyword ending with ! (e.g. fix!, feat!).
Automatically generate a CHANGELOG.md file based on the commits included in the given release
Also manage the prerelease process so that you can publish non-production-ready versions that might be tested before the final release.
Automatically push the build artifact to a repository, if enabled
Automatically create a release on Git, if enabled

Overall, the minimal workflow without the automatic push to the repository looks like this:

We can leverage this workflow to implement a library release workflow on top of Databricks Unity Catalog.

Databricks CI/CD workflow

Unity Catalog volumes are one of many ways to install libraries on Databricks. Compared to another popular solution which is a package repository such as Artifactory or Sonatype Nexus, volumes offer a native and relatively easy way to access the libraries and include them to the notebooks or jobs without any extra tool to manage (including credentials management, tool maintenance etc.). However, you're fully responsible for organizing the artifacts and managing their upload to the volumes.As you can see then, the solution we're going to see in this blog post has some defaults and you might also consider it as a temporary solution whilst you're waiting for the package repository setup.

Anyway, integrating this volume-based artifacts management with the semantic release consists of two deployment jobs:

The alpha release job. The job can run on a feature branch (e.g. a branch prefixed with "feature/") and execute the prerelease step of the Python Semantic Release plugin.
The final release job. The job runs on your main branch only and performs the final semantic release by upgrading one of the three version numbers.

The code for both parts is quite similar. The alpha release starts by calling these two commands:

poetry run semantic-release version --prerelease
poetry run semantic-release publish

After executing them on your feature branch, the semantic release library will increment the alpha version in your pyproject.toml and publish the release on Github. The Github publication step is not required and can be disabled but it's a good protection against trying to prerelease the same version again, or at the same time by a different person.

Once the prerelease is ready, it's time to upload the built library to Databricks volume. A few lines of bash should do the job:

VERSION_TO_RELEASE=$(poetry version --short)
FILE_TO_COPY_PATH=$(ls dist/*.whl)
FILE_TO_COPY_NAME=$(basename "$FILE_TO_COPY_PATH")
DATABRICKS_VOLUME="dbfs:/Volumes/wfc/lib/artifacts/alpha"
databricks fs mkdir  "${DATABRICKS_VOLUME}"
databricks fs cp  $FILE_TO_COPY "${DATABRICKS_VOLUME}/${FILE_TO_COPY_NAME}"

An important thing to keep in mind is to run both steps separately. Otherwise if for whatever reason the upload fails, you will need to retry the prerelease step too, which might lead to an increased number of releases without the copied artifact. The next picture shows the separated steps vs. integrated one that leads to prerelease versions multiplication in case of errors:

The command to build the final release is...:

poetry run semantic-release version
poetry run semantic-release publish

...while the upload code only has a different directory:

DATABRICKS_VOLUME="dbfs:/Volumes/wfc/lib/artifacts/final"

If you need to customize the release process, the Python Semantic Release library can have its own section in the pyproject.toml. You can find a customization example below:

[tool.semantic_release]
upload_to_pypi = false # disabled PyPI upload as we manage the library on Dataricks 
upload_to_release = true # enabled Github Release and tag creation
commit_message = "[RELEASE] {version}" # custom commit message for the `publish` command

[tool.semantic_release.remote]
token = { env = "CICD_GITHUB_TOKEN" } # Custom environment variable with Github token

Circular releases

An important thing to keep in mind is that the semantic release plugin pushes a commit with the new version to your Git repository. Consequently, if you trigger your CI/CD pipelines on any push, you may end up with indifinetely triggering releaes, or at best, a step to cancel if your release process requires manual authorization.

One way to avoid this issue is to put a specific keyword in the semantic release commit message. For example, if you use Github actions, you can use of the [skip ci], [ci skip], [no ci], [skip actions], [actions skip] (cf. https://docs.github.com/en/actions/how-tos/managing-workflow-runs-and-deployments/managing-workflow-runs/skipping-workflow-runs).

The semantic release pluging supports commit message customization, under the commit_message entry:

[tool.semantic_release]
# ...
commit_message = "[skip ci] chore(release): release {version}"

Semantic release is one of the different ways to release a package that can be fully automated with the help of the Python Semantic Release plugin and conventional commits. It's not the single way; you can still opt for a more manual approach where you will have to write more code but will get a chance to customize the deployment workflow a bit better.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects