We all have our habits and as programmers, libraries and frameworks are definitely a part of the group. In this blog post I'll share with you a list of Java and Scala classes I use almost every time in data engineering projects. The part for Python will follow next week!
Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote
one on that topic! You can read it online
on the O'Reilly platform,
or get a print copy on Amazon.
I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩
Unit testing - Diffx
It's one of my favorites to add an extra context to the test failures. Diffx is a Scala library that spots the different values in case classes. So instead of having this:
Letter(1,a,AA) did not equal Letter(1,a,A) ScalaTestFailureLocation: com.waitingforcode.DiffxExample at (DiffxExample.scala:16) Expected :Letter(1,a,A) Actual :Letter(1,a,AA)
You'll get the exact difference for the mismatched fields:
Matching error: Letter( id: 1, lower: a, upper: A[A])
Additionally, you can customize the output by defining your own ShowConfig class.
Maps.difference
Unfortunately, Diffx shines only for the case class comparisons. What if you have an arbitrary map instead? The Open Source community also takes care of us here. Google Guava project - that you certainly know well if you have come to data engineering from software engineering - brings a method called Maps.difference. As the name suggests, it returns the difference between 2 maps, including:
- entriesDiffering() to list the keys present in both maps but with different values
- entriesOnlyOnLeft() to return the extra entries present only in the left map
- entriesOnlyOnRight() to return the extra entries present only in the right map
- areEqual() to return a boolean flag indicating the maps as being equal or different
The class is really powerful because it can even detect differences for nested maps!
val mapLeft = new java.util.HashMap[String, Any]() mapLeft.put("common_equal", 1) mapLeft.put("common_different", 1) mapLeft.put("extra_in_left", "22222") val nestedMapLeft = new java.util.HashMap[String, Int]() nestedMapLeft.put("extra_left", 1) nestedMapLeft.put("common_equal", 2) mapLeft.put("different_nested_map", nestedMapLeft) val mapRight = new java.util.HashMap[String, Any]() mapRight.put("common_equal", 1) mapRight.put("common_different", 11) mapRight.put("extra_in_right", "33333") val nestedMapRight = new java.util.HashMap[String, Int]() nestedMapRight.put("extra_right", 11) nestedMapRight.put("common_equal", 2) mapRight.put("different_nested_map", nestedMapLeft) val diff = Maps.difference[String, Any](mapLeft, mapRight) diff.areEqual() shouldEqual false // {common_different=(1, 11)} println(diff.entriesDiffering()) // {extra_in_left=22222} println(diff.entriesOnlyOnLeft()) // {extra_in_right=33333} println(diff.entriesOnlyOnRight())
TimeUnit
That's probably the most useful class if you deal with time. It simplifies the code a lot because instead of transforming time units with multiplications or divisions, you simply express the input with the expected output, as below:
val inputSeconds = 120 TimeUnit.SECONDS.toMinutes(inputSeconds) shouldEqual 2 TimeUnit.SECONDS.toMillis(inputSeconds) shouldEqual 120000
Beautiful, doesn't it?
FileUtils
A common requirement is to serialize a class and store it in the file as text. Even though Apache Spark fulfills it without involving any specific code on your side, you may not always use Apache Spark. It's especially true for testing or small context files where a FileUtils.writeStringToFile method should shine:
val textToWrite = """ |line#1 |line#2 |line#3 |""".stripMargin FileUtils.writeStringToFile(new File("/tmp/test.txt"), textToWrite, "UTF-8") FileUtils.readFileToString(new File("/tmp/test.txt"), "UTF-8") shouldEqual textToWrite
Besides the write, you can see in the snippet the opposite method that reads a file to string.
ObjectMapper
Sometimes even FileUtils can't be insufficient to write an object as text. It's true especially for JSON where building a JSON object manually is cumbersome and error-prone. One way to address that issue is to use a dedicated JSON serialization library and the one working best for me for several years is Jackson.
To save a case class as JSON with Jackson, it's easy. You simply initiate an ObjectMapper with all required modules (DefaultScalaModule for the example) and use one of the existing write and read methods:
val scalaJsonMapper = new ObjectMapper() scalaJsonMapper.registerModule(DefaultScalaModule) val personToSave = Person("Save", "Me") val personJson = scalaJsonMapper.writeValueAsString(personToSave) FileUtils.writeStringToFile(new File("/tmp/test.txt"), personJson, "UTF-8") scalaJsonMapper.readValue(new File("/tmp/test.txt"), classOf[Person]) shouldEqual personToSave
CountDownLatch
Lastly, a class that should help you in coordinating the asynchronous code. I like using it to start a background process and let the main thread continue and eventually finish before the background process. Of course, there are certainly multiple other ways for achieving this but for me having an explicit blocker shows the intent much better than a Future for example.
The class helping this is CountDownLatch. It's a counter-based lock where you define the number of processes that should decrease the counter (countDown()) before resuming the execution from the blocking point (await()). An example is just below with a background process generating a file and the process doing other thing:
val textToWrite = """ |line#1 |line#2 |line#3 |""".stripMargin val countDownLatch = new CountDownLatch(1) new Thread(new Runnable() { override def run(): Unit = { // Give some time to see the synchronization Thread.sleep(3000L) try { FileUtils.writeStringToFile(new File("/tmp/test.txt"), textToWrite, "UTF-8") } finally { countDownLatch.countDown() } } }).start() println("Doing some other, more important work here") countDownLatch.await(10, TimeUnit.SECONDS) FileUtils.readFileToString(new File("/tmp/test.txt"), "UTF-8") shouldEqual textToWrite
Hope you discovered something new here. Until last year I was not aware of the Diffx and Maps.difference but they turn out to be a better way to compare objects than visual checks or a custom comparison code! What about your favorite Java or Scala libraries?
Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions.
As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and
drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!
👉 contact@waitingforcode.com
đź”— past projects