Code organization and assertions flow are both important but even them, they can't guarantee your colleagues' adherence to the unit tests. There are other user-facing attributes to consider as well.
A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free
👉 Register here
This blog post introduces 3 important factors to make unit tests user friendly.
Configuration
One of the important aspects that can accelerate your unit tests is Apache Spark configuration. Concretely, I mean here two properties that are often forgotten while creating a SparkSession for the unit tests.
The first of them is the number of shuffle partitions, so the spark.sql.shuffle.partitions. Remember, your unit tests operate on a smaller set of data than your production jobs. As a result, you don't need the same high parallelism that by the way impacts various operations, such as a state store which has one instance for each shuffle partition. Besides, it also impacts the I/O as each instance has its dedicated checkpoint storage, so a dedicated file to write and manage. Finally, less shuffle partitions means less tasks to schedule and run. Although this last argument may not be that visible on production where you have a single workload, believe me, you'll see the difference if your unit tests layer validates various inputs, in different jobs.
The second crucial property is the spark.ui.enabled. I mean, if you run your unit tests locally and you're struggling with one of them, enabling Spark UI just as a support for debugging is fine. However, doing that for an CI/CD job or the unit tests suite executed from a Git hook, doesn't make a lot of sense as it's there to validate what you want to promote further in the deployment branch.
Code latency factors
The latency won't be always only impacted by Apache Spark or any other framework you use. Your code impacts the execution time as well. The first thing to look at is any blocking innovation of the code through a sleep function. You may be surprised but it's sometimes available in 3rd party libraries (Don't sleep when you code...about sleep issue in KPL) and can considerably slow down the test suite despite your configuration-related efforts.
Another slowness reason might be an I/O-related operation, such as interaction with a real database, manipulating files, or communicating with a remote API. That's why if you can avoid them, prefer the fastest access possible, typically the in-memory one. Apache Spark Structured Streaming comes with two easy ways to gather things in-memory. The first is the foreachBatch where you can simply .collect(...) generated DataFrame and do anything you want after that. The second way is the in-memory sink that you can later query directly from SQL. You can learn about both approaches in the previous blog follow-up blog post and my Data+AI Summit 2024 talk
Adoption - test definitions
Latency is an important aspect in the test practice adoption, indeed. But even the fastests tests, if they're difficult to maintain, extend, and declare, won't be widely adopted by the teams. They need something else. This "something" is an easy method to define datasets.
No need to define your own Domain Specific Language (DSL). A simple builder function should be enough for most of the cases. You must only keep in mind the following:
- Predefine all irrelevant attributes so that your colleagues have only to override the properties that are relevant for the tested code unit.
- Use high-level and the most user friendly abstractions and do all the conversion in the builder. It's the case of a timestamp field that you might accept as a string and convert it to the appropriate type inside the function.
An example of such a builder is present in the repo
def visit(visit_id='visit_1', event_time='2024-01-05T10:00:00.000Z', user_id='user A id', page='page1.html', referral='search', ad_id='ad 1', user_cxt=user_context(), technical_cxt=technical_context()) -> Visit: generated_visit = Visit( visit_id=visit_id, event_time=event_time, user_id=user_id, page=page, context=VisitContext( referral=referral, ad_id=ad_id, user=user_cxt, technical=technical_cxt ) ) DataGeneratorWrapper.add_visit(generated_visit) return generated_visit
In this blog post you saw three factors that can help your team adopt unit tests in their daily work. Although we've already solved many unit tests challenges, there are still some issues to address that you're going to discover next time.