DAIS 2024: Unit tests - configuration and declaration

Versions: Apache Spark 3.5.0

Code organization and assertions flow are both important but even them, they can't guarantee your colleagues' adherence to the unit tests. There are other user-facing attributes to consider as well.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ contact@waitingforcode.com πŸ“©

This blog post introduces 3 important factors to make unit tests user friendly.


One of the important aspects that can accelerate your unit tests is Apache Spark configuration. Concretely, I mean here two properties that are often forgotten while creating a SparkSession for the unit tests.

The first of them is the number of shuffle partitions, so the spark.sql.shuffle.partitions. Remember, your unit tests operate on a smaller set of data than your production jobs. As a result, you don't need the same high parallelism that by the way impacts various operations, such as a state store which has one instance for each shuffle partition. Besides, it also impacts the I/O as each instance has its dedicated checkpoint storage, so a dedicated file to write and manage. Finally, less shuffle partitions means less tasks to schedule and run. Although this last argument may not be that visible on production where you have a single workload, believe me, you'll see the difference if your unit tests layer validates various inputs, in different jobs.

The second crucial property is the spark.ui.enabled. I mean, if you run your unit tests locally and you're struggling with one of them, enabling Spark UI just as a support for debugging is fine. However, doing that for an CI/CD job or the unit tests suite executed from a Git hook, doesn't make a lot of sense as it's there to validate what you want to promote further in the deployment branch.

Code latency factors

The latency won't be always only impacted by Apache Spark or any other framework you use. Your code impacts the execution time as well. The first thing to look at is any blocking innovation of the code through a sleep function. You may be surprised but it's sometimes available in 3rd party libraries (Don't sleep when you code...about sleep issue in KPL) and can considerably slow down the test suite despite your configuration-related efforts.

Another slowness reason might be an I/O-related operation, such as interaction with a real database, manipulating files, or communicating with a remote API. That's why if you can avoid them, prefer the fastest access possible, typically the in-memory one. Apache Spark Structured Streaming comes with two easy ways to gather things in-memory. The first is the foreachBatch where you can simply .collect(...) generated DataFrame and do anything you want after that. The second way is the in-memory sink that you can later query directly from SQL. You can learn about both approaches in the previous blog follow-up blog post and my Data+AI Summit 2024 talk

Adoption - test definitions

Latency is an important aspect in the test practice adoption, indeed. But even the fastests tests, if they're difficult to maintain, extend, and declare, won't be widely adopted by the teams. They need something else. This "something" is an easy method to define datasets.

No need to define your own Domain Specific Language (DSL). A simple builder function should be enough for most of the cases. You must only keep in mind the following:

An example of such a builder is present in the repo

def visit(visit_id='visit_1', event_time='2024-01-05T10:00:00.000Z', user_id='user A id',
      page='page1.html', referral='search', ad_id='ad 1',
      user_cxt=user_context(), technical_cxt=technical_context()) -> Visit:
generated_visit = Visit(
    visit_id=visit_id, event_time=event_time,
    user_id=user_id, page=page, context=VisitContext(
        referral=referral, ad_id=ad_id, user=user_cxt, technical=technical_cxt
return generated_visit

In this blog post you saw three factors that can help your team adopt unit tests in their daily work. Although we've already solved many unit tests challenges, there are still some issues to address that you're going to discover next time.

If you liked it, you should read:

πŸ“š Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!