Using Spark in AWS environment can sometimes be problematic. It especially is when the dependency hell problem appears. But fortunately, it can be resolved pretty easily with shading.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
In its first part, the post shows the problem of mismatching dependencies. The second part contains one of possible solutions based on dependency shading in Maven.
java.lang.NoSuchMethodError: com.google.protobuf.LazyStringList.getUnmodifiableView()Lcom/google/protobuf/LazyStringList error
The application having the problem mentioned in the header is a simple Kinesis consumer using native Spark's external module in 2.1.0 version. The code is similar to the code advised in official's Spark documentation page:
val kinesisStream = KinesisUtils.createStream( streamingContext, "My Kinesis consumer", "Test stream", [endpoint URL], [region name], [initial position], [checkpoint interval], StorageLevel.MEMORY_AND_DISK_2)
At first glance, nothing that could make the processing fail. However, the problems appeared after compilation and execution on EMR instance with YARN:
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: com.google.protobuf.LazyStringList.getUnmodifiableView()Lcom/google/protobuf/LazyStringList; at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.checkAndSubmitNextTask(ShardConsumer.java:157) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.consumeShard(ShardConsumer.java:126) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker.run(Worker.java:347) at org.apache.spark.streaming.kinesis.KinesisReceiver$$anon$1.run(KinesisReceiver.scala:174) Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: com.google.protobuf.LazyStringList.getUnmodifiableView()Lcom/google/protobuf/LazyStringList; at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.checkAndSubmitNextTask(ShardConsumer.java:137) ... 3 more Caused by: java.lang.NoSuchMethodError: com.google.protobuf.LazyStringList.getUnmodifiableView()Lcom/google/protobuf/LazyStringList; at com.amazonaws.services.kinesis.clientlibrary.types.Messages$AggregatedRecord.(Messages.java:1749) at com.amazonaws.services.kinesis.clientlibrary.types.Messages$AggregatedRecord. (Messages.java:1665) at com.amazonaws.services.kinesis.clientlibrary.types.Messages$AggregatedRecord$1.parsePartialFrom(Messages.java:1779) at com.amazonaws.services.kinesis.clientlibrary.types.Messages$AggregatedRecord$1.parsePartialFrom(Messages.java:1774) at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:141) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:176) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:188) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:193) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at com.amazonaws.services.kinesis.clientlibrary.types.Messages$AggregatedRecord.parseFrom(Messages.java:1970) at com.amazonaws.services.kinesis.clientlibrary.types.UserRecord.deaggregate(UserRecord.java:235) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:146) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
What did happen here ? Simply speaking, the loaded JAR containing com.google.protobuf classes was not the same as the one used in com.amazonaws.services.kinesis.clientlibrary.types.Messages$AggregatedRecord. The failing line (1749) contains the following code:
// ... } finally { if (((mutable_bitField0_ & 0x00000001) == 0x00000001)) { partitionKeyTable_ = partitionKeyTable_.getUnmodifiableView(); } if (((mutable_bitField0_ & 0x00000002) == 0x00000002)) { explicitHashKeyTable_ = explicitHashKeyTable_.getUnmodifiableView(); } if (((mutable_bitField0_ & 0x00000004) == 0x00000004)) { records_ = java.util.Collections.unmodifiableList(records_); } this.unknownFields = unknownFields.build(); makeExtensionsImmutable(); }
A quick look at the documentation tells more about the getUnmodifiableView() method. It was added only in 2.6.0 release of Protobuf. The version of this library installed on EMR comes from Hadoop that uses 2.5.0. Obviously, it leads to ineluctable dependency hell.
This error occurs because the messages are stored in aggregated part.
Shading as solution
The most straightforward solution, at least superficially, would be the replacement of installed Protobuf version. But it's not so simple and after some research, another fix seemed to be easier to implement. It was a technique called dependency shading.
Let's back to the beginning. Usually the application code is compiled with all used dependencies to a JAR called fatjar. This JAR is later deployed. The problem is that even if we add a dependency in particular version, we don't have the guarantee that this version will be used at runtime. It occurs when if 2 or more versions of the same dependency are available in the execution classpath. To resolve the issue we could ensure that there are only 1 place where particular library is present. But it's not always possible, especially on cloud managed services as AWS EMR. To make it possible every time, the shading can be used.
So what's the difference with shaded dependency ? First of all, it's also included in fatjar. The difference is that it's not included in the original definition. Instead, the shaded dependency is overridden. More exactly, it's renamed. It means 2 things. First - the included dependency changes its name (e.g. com.google.protobuf becomes com.google.protobuf.shaded). Secondly - the classes importing shaded dependency change their import statements (e.g. import com.google.protobuf.LazyStringList becomes com.google.protobuf.shaded.LazyStringList). The bytecode of compiled classes is changed and assembled later to the final fatjar.
To use shading in Maven we can add Maven Shade Plugin and define shadded dependencies under relocations section. The following snippet shows a simple use case:
<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.0.0</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <shadedArtifactAttached>true</shadedArtifactAttached> <shadedClassifierName>allinone</shadedClassifierName> <artifactSet> <includes> <include>*:*</include> </includes> </artifactSet> <relocations> <relocation> <pattern>com.google.protobuf</pattern> <shadedPattern>com.google.protobufv2_6_1</shadedPattern> </relocation> </relocations> </configuration> </execution> </executions> </plugin>
Dependency shading is one of possible solutions for dependency hell. To recall, this problem occurs when 2 different and incompatible versions of one library are present and only one of them is selected at runtime (thus, the code can still compile correctly). It's for instance the case of Protobuf on EMR and Spark Kinesis client. The plugins making shading possible exist in the most of building tools. Except described Maven, for example we can find shadow for Gradle or SBT assembly for SBT.