Generated method too long to be JIT compiled

Versions: Apache Spark 3.3.0 https://github.com/bartosz25/spark-playground/tree/master/spark-sql-jit-compilation

There are days like that. You inherit a code and it doesn't really work as expected. While digging into issues you find usual weird warnings but also several new things. For me one of these things was the "Generated method too long to be JIT compiled..." info message.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ contact@waitingforcode.com πŸ“©

JIT-compiled?

First, what does it mean that a method (or a code in general) is JIT-compiled? To understand this term we need to recall some basics about the JVM. Each JVM-based (Scala, Java, ...) application is compiled first into a bytecode representation. This representation is later used by the JVM in 2 different ways:

This magic happens at runtime and depends on several factors, such as the number of method calls. As you can see, the JIT compiler monitors what happens inside the JVM during the execution. It clearly explains enough why the JIT stands for Just-In-Time compilation!

Additionally to the native code compilation and direct calls, JIT provides several other optimizations, such as code reorganization or memory tuning. I'm explaining some of them in the JVM JIT optimizations blog post.

Why too long?

That's the theory but as usual, the practice might be different. Sometimes the method won't be JIT-compiled, even if the JIT is enabled at the JVM level! Why? One of the reasons is the method size. Compiling big methods are runtime involves the following:

That's why methods longer than 8000 byte code instructions are skipped in the compilation step. This behavior is not strict, though. You can turn it off by disabling the DontCompileHugeMethods JVM flag but it might have serious global implications.

CodeGenerator in Apache Spark

Apache Spark relies on the JIT compilation in the CodeGenerator class that compiles the code generated by the WholeStageCodegenExec physical operator. Let's take a look at what happens when it has to compile a simple if-else statement:

As you can see, it calls Janino compiler to create the Java bytecode and return the bytecode statistics. It's inside this statistics-related task where the first warning appears:

object CodeGenerator extends Logging {
// ...
  private def updateAndGetCompilationStats(evaluator: ClassBodyEvaluator): ByteCodeStats = {
// ...
            if (byteCodeSize > DEFAULT_JVM_HUGE_METHOD_LIMIT) {
              logInfo("Generated method too long to be JIT compiled: " +
                s"${cf.getThisClassName}.${method.getName} is $byteCodeSize bytes")
            }
// ...

But it's not the single bytecode protection in the whole stage code generation. The second one can invalidate the whole generated code if the bytecode size is larger than spark.sql.codegen.hugeMethodLimit (65535 by default, 8000 recommended if the job runs on HotSpot JVM). When it happens, you'll find another message in the logs:

case class WholeStageCodegenExec(child: SparkPlan)(val codegenStageId: Int)
    extends UnaryExecNode with CodegenSupport {
// ...
  override def doExecute(): RDD[InternalRow] = {
// ...
    if (compiledCodeStats.maxMethodCodeSize > conf.hugeMethodLimit) {
      logInfo(s"Found too long generated codes and JIT optimization might not work: " +
        s"the bytecode size (${compiledCodeStats.maxMethodCodeSize}) is above the limit " +
        s"${conf.hugeMethodLimit}, and the whole-stage codegen was disabled " +
        s"for this plan (id=$codegenStageId). To avoid this, you can raise the limit " +
        s"`${SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.key}`:\n$treeString")
      return child.execute()
    }

When the whole-stage codegen is disabled, each of the defined transformations is executed separately. I blogged about this topic in the why of code generation in Apache Spark SQL.

The JIT compilation and the whole stage code generation aren't the end of the world and your job may be running correctly. However, the performances will very probably suffer and you might need to refactor your code to take full advantage of these JVM optimizations.