DAIS 2024: Unit tests - configuration and declaration on waitingforcode.com - articles about Apache Spark Structured Streaming

Versions: Apache Spark 3.5.0

Code organization and assertions flow are both important but even them, they can't guarantee your colleagues' adherence to the unit tests. There are other user-facing attributes to consider as well.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

This blog post introduces 3 important factors to make unit tests user friendly.

Configuration

One of the important aspects that can accelerate your unit tests is Apache Spark configuration. Concretely, I mean here two properties that are often forgotten while creating a SparkSession for the unit tests.

The first of them is the number of shuffle partitions, so the spark.sql.shuffle.partitions. Remember, your unit tests operate on a smaller set of data than your production jobs. As a result, you don't need the same high parallelism that by the way impacts various operations, such as a state store which has one instance for each shuffle partition. Besides, it also impacts the I/O as each instance has its dedicated checkpoint storage, so a dedicated file to write and manage. Finally, less shuffle partitions means less tasks to schedule and run. Although this last argument may not be that visible on production where you have a single workload, believe me, you'll see the difference if your unit tests layer validates various inputs, in different jobs.

The second crucial property is the spark.ui.enabled. I mean, if you run your unit tests locally and you're struggling with one of them, enabling Spark UI just as a support for debugging is fine. However, doing that for an CI/CD job or the unit tests suite executed from a Git hook, doesn't make a lot of sense as it's there to validate what you want to promote further in the deployment branch.

Code latency factors

The latency won't be always only impacted by Apache Spark or any other framework you use. Your code impacts the execution time as well. The first thing to look at is any blocking innovation of the code through a sleep function. You may be surprised but it's sometimes available in 3rd party libraries (Don't sleep when you code...about sleep issue in KPL) and can considerably slow down the test suite despite your configuration-related efforts.

Another slowness reason might be an I/O-related operation, such as interaction with a real database, manipulating files, or communicating with a remote API. That's why if you can avoid them, prefer the fastest access possible, typically the in-memory one. Apache Spark Structured Streaming comes with two easy ways to gather things in-memory. The first is the foreachBatch where you can simply .collect(...) generated DataFrame and do anything you want after that. The second way is the in-memory sink that you can later query directly from SQL. You can learn about both approaches in the previous blog follow-up blog post and my Data+AI Summit 2024 talk

Adoption - test definitions

Latency is an important aspect in the test practice adoption, indeed. But even the fastests tests, if they're difficult to maintain, extend, and declare, won't be widely adopted by the teams. They need something else. This "something" is an easy method to define datasets.

No need to define your own Domain Specific Language (DSL). A simple builder function should be enough for most of the cases. You must only keep in mind the following:

Predefine all irrelevant attributes so that your colleagues have only to override the properties that are relevant for the tested code unit.
Use high-level and the most user friendly abstractions and do all the conversion in the builder. It's the case of a timestamp field that you might accept as a string and convert it to the appropriate type inside the function.

An example of such a builder is present in the repo

def visit(visit_id='visit_1', event_time='2024-01-05T10:00:00.000Z', user_id='user A id',
      page='page1.html', referral='search', ad_id='ad 1',
      user_cxt=user_context(), technical_cxt=technical_context()) -> Visit:
generated_visit = Visit(
    visit_id=visit_id, event_time=event_time,
    user_id=user_id, page=page, context=VisitContext(
        referral=referral, ad_id=ad_id, user=user_cxt, technical=technical_cxt
    )
)
DataGeneratorWrapper.add_visit(generated_visit)
return generated_visit

In this blog post you saw three factors that can help your team adopt unit tests in their daily work. Although we've already solved many unit tests challenges, there are still some issues to address that you're going to discover next time.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #Data AI Summit 2024