Testing strategies in Spark

Versions: Spark 2.1.0

After writing a post about testing Spark applications, I decided to take a look at Spark project tests and see which patterns they use to verify framework features.

A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free

👉 Register here

This post summarizes the analysis. Unlike usual, there are not distinguish parts. Instead of them, each new paragraph describes shortly new testing pattern. The pattern itself is written in bold. The analysis was made against 4 projects: batch, streaming, SQL and external.

The first pattern is the use of helper methods and assertions. Among them we can distinguish for example StreamTest with methods used to add tested data, StateStoreMetricsTest defining some shared assertions or SQLTestData defining data used in tests of Spark SQL module.

Another commonly retrieved testing pattern concerns data used in tests. Most of time data used to build RDDs, DataFrames or DStreams are the simple integers. Very often we can encounter methods creating data through parallelize(Seq) method, as for example sparkContext.parallelize(1 to 10). As the examples of that we could observe: FilteredScanSuite or PairRDDFunctionsSuite.

To deal with context creation overhead mentioned in previous sections, Spark uses a shared context. For batch part this shared context is represented with SharedSparkContext. Spark SQL in its turns works with SharedSQLContext. A little exception is streaming based on DStream that mostly uses TestSuiteBase trait containing methods wrapping StreamingContext: withStreamingContext(StreamingContext).

In tests Spark uses often listeners that are helpful to track jobs execution in more fine grained way. For example TestSuiteBase uses StreamingListener to count the number of started and completed batches, and to make assertions on them in tests. AccumulatorSuite employs SparkListener implementation to keep somewhere completed stages and tasks, used further in test cases.

An important part in Spark tests concerns concurrency. Several solutions for concurrency testing exists but the most popular in browsed tests seems to be based on CountDownLatch use, exactly as shown in the post about Tests in multi-threading environment with JUnit. As an example we could show ExecutorSuite that synchronizes main and worker threads through CountDownLatch. Sometimes we can also meet the code containing Thread.sleep(Long) calls. An example can be found in StreamingContextSuite.

We could also think that testing programming library doesn't necessarily needs the use of mocking frameworks. But it's not true for Spark since in some places it mocks real objects with Mockito library. It does that for example with ExecutorAllocationClient object in ExecutorAllocationManagerSuite. Mockito is also employed in tests for Spark Kinesis project. Both parts are a little bit tricky to be tested with real/embedded servers. Kinesis doesn't provide such server while ExecutorAllocationManagerSuite verifies executors allocation during processing.

Beside mocks, also fake objects are used in tests (see post about Test doubles - mocks, stubs and the others to discover different test objects). PipedRDDSuite uses such type of objects to simulate Iterator. Another example of the use of fake objects is PairRDDFunctionsSuite. It works with fake objects to check saving data as Hadoop files.

But even if mocks are used in Kinesis testing, there are also tests against real instances. The only difference with usual tests is that they're conditioned, i.e. they are executed only if the environment variable RUN_KINESIS_TESTS is set to 1. In external project, beside Kinesis connector, are also other connectors, such a Kafka. In its case tests are more commonly ran against embedded instances of Kafka and ZooKeeper.

Tests, especially in streaming part, make a special use of ManualClock instance. Its specificity is the capability to manipulate streaming time programatically with appropriated setters. We can see it for example in BlockGeneratorSuite (streaming case) or TaskSetManagerSuite (core part). In both places ManualClock is used to advance the time in tested methods.

Regarding to other tests from external project, an interesting part concerns integration tests executed against Docker containers representing, for example: relational databases (PostgreSQL, MySQL, Oracle). We can find there also images used to launch master/worker dockerized infrastructure.

This post lists some main points from Spark project tests. We can learn that the tests are done in simple datasets (usually integers), with helper methods and often shared Spark contexts. We can also see how Spark checks the code written for external systems, as Kafka and Kinesis. For this case it uses both mocks and tests against real servers. The analysis of tests also made an insight on use of Docker images helping to verify Spark SQL integration.


If you liked it, you should read:

đź“š Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!