Ops practices in Apache Spark project

A good CI/CD process avoids many pitfalls related to manual operations. Apache Spark also has one based on Github Actions. Since this part of the project has been a small mystery for me, I wanted to spend some time exploring it.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

Workflows

I'll start with Github Actions workflows. The first builds the project and runs the tests on top of it. The workflow is present in the Apache Spark Github project but can also run in any forked repository. It's even recommended in the Developer tools documentation to run the Build and test workflow in the forked repository before creating the PullRequest to avoid burdening the limited Github Actions resources allocated to the Apache Spark project. Besides this manual execution, the workflow also runs after pushing the changes to any branch.You'll find its definition in the build_and_test.yml file.

The Build and test workflow triggers another one called Report test results that pushes test results to the Pull Request as a summary and a PR check. Below you can find an example coming from a Hyukjin Kwon's contribution:

These 2 PR-related automations are not the single ones for the new changes to merge into the main branch. There is also a workflow responsible for automatically labeling the Pull Request. I've always wondered how it works, and it happens to be quite simple. The labeling uses a Github Labeler plugin that matches the modified files against the path patterns defined in the .github/labeler.yml and from that, it automatically associates the labels to the Pull Request.

Besides the aforementioned workflows, Apache Spark has a few others, this time in an alphabetical order:

Others

But Apache Spark ops is not only about workflows. I discovered a few other interesting concepts:

When it comes to the code itself, the project leverages some external tools to ensure code quality. PySpark uses Mypy for types checking and Coverage.py for code coverage generation, whereas Scala-Spark ensures code consistency with Scalastyle module.

Apache Spark is an established Open Source project, and proper ops tooling was a part of its success. I hope that the article showed you something new that you might find interesting to integrate into your internal projects!

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©