Testing strategies in Spark on waitingforcode.com

Versions: Spark 2.1.0

After writing a post about testing Spark applications, I decided to take a look at Spark project tests and see which patterns they use to verify framework features.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

This post summarizes the analysis. Unlike usual, there are not distinguish parts. Instead of them, each new paragraph describes shortly new testing pattern. The pattern itself is written in bold. The analysis was made against 4 projects: batch, streaming, SQL and external.

The first pattern is the use of helper methods and assertions. Among them we can distinguish for example StreamTest with methods used to add tested data, StateStoreMetricsTest defining some shared assertions or SQLTestData defining data used in tests of Spark SQL module.

Another commonly retrieved testing pattern concerns data used in tests. Most of time data used to build RDDs, DataFrames or DStreams are the simple integers. Very often we can encounter methods creating data through parallelize(Seq) method, as for example sparkContext.parallelize(1 to 10). As the examples of that we could observe: FilteredScanSuite or PairRDDFunctionsSuite.

To deal with context creation overhead mentioned in previous sections, Spark uses a shared context. For batch part this shared context is represented with SharedSparkContext. Spark SQL in its turns works with SharedSQLContext. A little exception is streaming based on DStream that mostly uses TestSuiteBase trait containing methods wrapping StreamingContext: withStreamingContext(StreamingContext).

In tests Spark uses often listeners that are helpful to track jobs execution in more fine grained way. For example TestSuiteBase uses StreamingListener to count the number of started and completed batches, and to make assertions on them in tests. AccumulatorSuite employs SparkListener implementation to keep somewhere completed stages and tasks, used further in test cases.

An important part in Spark tests concerns concurrency. Several solutions for concurrency testing exists but the most popular in browsed tests seems to be based on CountDownLatch use, exactly as shown in the post about Tests in multi-threading environment with JUnit. As an example we could show ExecutorSuite that synchronizes main and worker threads through CountDownLatch. Sometimes we can also meet the code containing Thread.sleep(Long) calls. An example can be found in StreamingContextSuite.

We could also think that testing programming library doesn't necessarily needs the use of mocking frameworks. But it's not true for Spark since in some places it mocks real objects with Mockito library. It does that for example with ExecutorAllocationClient object in ExecutorAllocationManagerSuite. Mockito is also employed in tests for Spark Kinesis project. Both parts are a little bit tricky to be tested with real/embedded servers. Kinesis doesn't provide such server while ExecutorAllocationManagerSuite verifies executors allocation during processing.

Beside mocks, also fake objects are used in tests (see post about Test doubles - mocks, stubs and the others to discover different test objects). PipedRDDSuite uses such type of objects to simulate Iterator. Another example of the use of fake objects is PairRDDFunctionsSuite. It works with fake objects to check saving data as Hadoop files.

But even if mocks are used in Kinesis testing, there are also tests against real instances. The only difference with usual tests is that they're conditioned, i.e. they are executed only if the environment variable RUN_KINESIS_TESTS is set to 1. In external project, beside Kinesis connector, are also other connectors, such a Kafka. In its case tests are more commonly ran against embedded instances of Kafka and ZooKeeper.

Tests, especially in streaming part, make a special use of ManualClock instance. Its specificity is the capability to manipulate streaming time programatically with appropriated setters. We can see it for example in BlockGeneratorSuite (streaming case) or TaskSetManagerSuite (core part). In both places ManualClock is used to advance the time in tested methods.

Regarding to other tests from external project, an interesting part concerns integration tests executed against Docker containers representing, for example: relational databases (PostgreSQL, MySQL, Oracle). We can find there also images used to launch master/worker dockerized infrastructure.

This post lists some main points from Spark project tests. We can learn that the tests are done in simple datasets (usually integers), with helper methods and often shared Spark contexts. We can also see how Spark checks the code written for external systems, as Kafka and Kinesis. For this case it uses both mocks and tests against real servers. The analysis of tests also made an insight on use of Docker images helping to verify Spark SQL integration.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #Spark testing