Ops practices in Apache Spark project

Versions: Apache Spark 3.2.0

A good CI/CD process avoids many pitfalls related to manual operations. Apache Spark also has one based on Github Actions. Since this part of the project has been a small mystery for me, I wanted to spend some time exploring it.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Workflows

I'll start with Github Actions workflows. The first builds the project and runs the tests on top of it. The workflow is present in the Apache Spark Github project but can also run in any forked repository. It's even recommended in the Developer tools documentation to run the Build and test workflow in the forked repository before creating the PullRequest to avoid burdening the limited Github Actions resources allocated to the Apache Spark project. Besides this manual execution, the workflow also runs after pushing the changes to any branch.You'll find its definition in the build_and_test.yml file.

The Build and test workflow triggers another one called Report test results that pushes test results to the Pull Request as a summary and a PR check. Below you can find an example coming from a Hyukjin Kwon's contribution:

These 2 PR-related automations are not the single ones for the new changes to merge into the main branch. There is also a workflow responsible for automatically labeling the Pull Request. I've always wondered how it works, and it happens to be quite simple. The labeling uses a Github Labeler plugin that matches the modified files against the path patterns defined in the .github/labeler.yml and from that, it automatically associates the labels to the Pull Request.

Besides the aforementioned workflows, Apache Spark has a few others, this time in an alphabetical order:

Others

But Apache Spark ops is not only about workflows. I discovered a few other interesting concepts:

When it comes to the code itself, the project leverages some external tools to ensure code quality. PySpark uses Mypy for types checking and Coverage.py for code coverage generation, whereas Scala-Spark ensures code consistency with Scalastyle module.

Apache Spark is an established Open Source project, and proper ops tooling was a part of its success. I hope that the article showed you something new that you might find interesting to integrate into your internal projects!

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
đź”— past projects


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!