Shading as solution for dependency hell in Spark

Versions: Spark 2.1.0

Using Spark in AWS environment can sometimes be problematic. It especially is when the dependency hell problem appears. But fortunately, it can be resolved pretty easily with shading.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

In its first part, the post shows the problem of mismatching dependencies. The second part contains one of possible solutions based on dependency shading in Maven.

java.lang.NoSuchMethodError: error

The application having the problem mentioned in the header is a simple Kinesis consumer using native Spark's external module in 2.1.0 version. The code is similar to the code advised in official's Spark documentation page:

 val kinesisStream = KinesisUtils.createStream(
     streamingContext, "My Kinesis consumer", "Test stream", [endpoint URL],
     [region name], [initial position], [checkpoint interval], StorageLevel.MEMORY_AND_DISK_2)

At first glance, nothing that could make the processing fail. However, the problems appeared after compilation and execution on EMR instance with YARN:

java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError:;
        at org.apache.spark.streaming.kinesis.KinesisReceiver$$anon$
Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError:;
        at java.util.concurrent.FutureTask.get(
        ... 3 more
Caused by: java.lang.NoSuchMethodError:;
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
        at java.util.concurrent.ThreadPoolExecutor$

What did happen here ? Simply speaking, the loaded JAR containing classes was not the same as the one used in$AggregatedRecord. The failing line (1749) contains the following code:

// ...
} finally {
  if (((mutable_bitField0_ & 0x00000001) == 0x00000001)) {
    partitionKeyTable_ = partitionKeyTable_.getUnmodifiableView();
  if (((mutable_bitField0_ & 0x00000002) == 0x00000002)) {
    explicitHashKeyTable_ = explicitHashKeyTable_.getUnmodifiableView();
  if (((mutable_bitField0_ & 0x00000004) == 0x00000004)) {
    records_ = java.util.Collections.unmodifiableList(records_);
  this.unknownFields =;

A quick look at the documentation tells more about the getUnmodifiableView() method. It was added only in 2.6.0 release of Protobuf. The version of this library installed on EMR comes from Hadoop that uses 2.5.0. Obviously, it leads to ineluctable dependency hell.

This error occurs because the messages are stored in aggregated part.

Shading as solution

The most straightforward solution, at least superficially, would be the replacement of installed Protobuf version. But it's not so simple and after some research, another fix seemed to be easier to implement. It was a technique called dependency shading.

Let's back to the beginning. Usually the application code is compiled with all used dependencies to a JAR called fatjar. This JAR is later deployed. The problem is that even if we add a dependency in particular version, we don't have the guarantee that this version will be used at runtime. It occurs when if 2 or more versions of the same dependency are available in the execution classpath. To resolve the issue we could ensure that there are only 1 place where particular library is present. But it's not always possible, especially on cloud managed services as AWS EMR. To make it possible every time, the shading can be used.

So what's the difference with shaded dependency ? First of all, it's also included in fatjar. The difference is that it's not included in the original definition. Instead, the shaded dependency is overridden. More exactly, it's renamed. It means 2 things. First - the included dependency changes its name (e.g. becomes Secondly - the classes importing shaded dependency change their import statements (e.g. import becomes The bytecode of compiled classes is changed and assembled later to the final fatjar.

To use shading in Maven we can add Maven Shade Plugin and define shadded dependencies under relocations section. The following snippet shows a simple use case:


Dependency shading is one of possible solutions for dependency hell. To recall, this problem occurs when 2 different and incompatible versions of one library are present and only one of them is selected at runtime (thus, the code can still compile correctly). It's for instance the case of Protobuf on EMR and Spark Kinesis client. The plugins making shading possible exist in the most of building tools. Except described Maven, for example we can find shadow for Gradle or SBT assembly for SBT.