Type specialization in Scala

Versions: Scala 2.12.1

When I was analyzing one of Apache Spark GraphX functions for the first time I faced a class annotated with @specialized annotation. Since then I decided to find more information about it and share them with you in this post.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in 馃憠 Early Release on the O'Reilly platform

I also help solve your data engineering problems 馃憠 contact@waitingforcode.com 馃摡

In the first section of the post I will explain the basic points about @specialized. In the next one, I will tend to show how to use it. In the final part, I'll do a micro-benchmark to analyze the real impact on specialized code.

Definition

Scala uses the @specialized class to apply type specialization to the compiled classes. The specialization occurs at the compile time and consists on generating the versions of generic classes for the specific types.

I was quite abstract for me too at the beginning, but an example helped me to follow. Let's say we have a generic class like CustomSequence[T]. If we apply the Long type specialization to it, the compiler will generate the $Generated$CustomSequence[Long] and use it everywhere we write val longSeq = new CustomSequence[Long]. I gave an example of Long not by accident because the type specialization applies only to the primitive types.

Of course, the type specialization doesn't come without costs. It will slow down the compilation time since the compiler has some extra work to do. And the negative impact on the compilation can be really big. If you take a generic class with 3 types, like BigGeneric[T1, T2, T3], the compiler will need to generate the combination for every primitive.

On the other side, the specialization may have a positive impact on the runtime because it helps to avoid the boxing/unboxing during the code execution. If you don't believe me at words, I will try to convince you in the last section.

Use in Scala

At first glance the type specialization looks easy. Let's see now how to use it in Scala with already mentioned @specialized annotation. Since it's difficult to illustrate the type specialization with usual learning tests, we'll try to do this by analyzing the compiled classes.

@specialized can be used in 2 different ways, without and with the list of the specialized types. You can see that in the following examples:

class GlobalSpecialization[@specialized T] {

  def get(item: T) = item

}

class ReducedSpecialization[@specialized(Long) T] {
  def get(item: T) = item
}

If we take a look at the compiled classes, we should see:

'GlobalSpecialization$mcB$sp.class'  'GlobalSpecialization$mcJ$sp.class'   
'GlobalSpecialization$mcC$sp.class'  'GlobalSpecialization$mcS$sp.class'  'ReducedSpecialization$mcJ$sp.class'
'GlobalSpecialization$mcD$sp.class'  'GlobalSpecialization$mcV$sp.class'   ReducedSpecialization.class 'GlobalSpecialization$mcF$sp.class'  'GlobalSpecialization$mcZ$sp.class'    
'GlobalSpecialization$mcI$sp.class'   GlobalSpecialization.class

A not specialized class would generate only 1 compiled file. Let's add it to our test package to see what the compiler is doing when it sees a specialized and not specialized code:

class NotSpecialized[T] {

  def get(item: T) = item

}

class Tests {

  val longNotSpecialized = new NotSpecialized[Long]()
  longNotSpecialized.get(3L) + 4L
  val longReducedSpecialized = new ReducedSpecialization[Long]()
  longReducedSpecialized.get(4L) + 5L

}

If we take a look at Tests.class with javap -v Tests.class command, we should see that the compiled adds a boxing for the not specialized type and doesn't do that for the specialized type:

# Not specialized class
9: invokespecial #29                 // Method com/waitingforcode/specialization/NotSpecialized."":()V
12: putfield      #17                 // Field longNotSpecialized:Lcom/waitingforcode/specialization/NotSpecialized;
15: aload_0
16: invokevirtual #31                 // Method longNotSpecialized:()Lcom/waitingforcode/specialization/NotSpecialized;
19: ldc2_w        #32                 // long 3l
22: invokestatic  #39                 // Method scala/runtime/BoxesRunTime.boxToLong:(J)Ljava/lang/Long;
25: invokevirtual #43                 // Method com/waitingforcode/specialization/NotSpecialized.get:(Ljava/lang/Object;)Ljava/lang/Object;
28: invokestatic  #47                 // Method scala/runtime/BoxesRunTime.unboxToLong:(Ljava/lang/Object;)J
31: ldc2_w        #48                 // long 4l
34: ladd

# Specialized class
37: new           #51                 // class com/waitingforcode/specialization/ReducedSpecialization$mcJ$sp
40: dup
41: invokespecial #52                 // Method com/waitingforcode/specialization/ReducedSpecialization$mcJ$sp."":()V
44: putfield      #22                 // Field longReducedSpecialized:Lcom/waitingforcode/specialization/ReducedSpecialization;
47: aload_0
48: invokevirtual #54                 // Method longReducedSpecialized:()Lcom/waitingforcode/specialization/ReducedSpecialization;
51: ldc2_w        #48                 // long 4l
54: invokevirtual #60                 // Method com/waitingforcode/specialization/ReducedSpecialization.get$mcJ$sp:(J)J
57: ldc2_w        #61                 // long 5l

Just to show you that I didn't hide the boxing in the get method of ReducedSpecialization, class, you can find the bytecode for it in the next snippet:

  public long get$mcJ$sp(long);
    descriptor: (J)J
    flags: ACC_PUBLIC
    Code:
      stack=2, locals=3, args_size=2
         0: lload_1
         1: lreturn
      LocalVariableTable:
        Start  Length  Slot  Name   Signature
            0       2     0  this   Lcom/waitingforcode/specialization/ReducedSpecialization$mcJ$sp;
            0       2     1  item   J
      LineNumberTable:
        line 5: 0
    MethodParameters:
      Name                           Flags
      item                           final

Specialized type impact on runtime

In order to check the specialized type impact on the Scala runtime I'll use the JMH, exactly like in the post about structural types. Since the build.sbt is the same, I'll omit it here for brevity. Let's focus rather on the tested classes:

@OutputTimeUnit(TimeUnit.MILLISECONDS)
@BenchmarkMode(Array(Mode.All))
class SpecializedTypeMicroBenchmark {

  @Benchmark
  def verify_specialized: Unit = {
    val specialized = new SpecializedType[Int]
    (0 to 1000000).map(nr => specialized.item(nr))
  }

  @Benchmark
  def verify_not_specialized: Unit = {
    val notSpecialized = new NotSpecializedType[Int]
    (0 to 1000000).map(nr => notSpecialized.item(nr))
  }

}

class NotSpecializedType[T] {
  def item(item: T) = item
}

class SpecializedType[@specialized T] {
  def item(item: T) = item
}

After executing the code with sbt jmh:run -i 20 -wi 10 -f1 -t1 -rf text, I got the following results:

Benchmark                                                                              Mode   Cnt   Score   Error   Units
SpecializedTypeMicroBenchmark.verify_not_specialized                                  thrpt    20   0.079 卤 0.010  ops/ms
SpecializedTypeMicroBenchmark.verify_specialized                                      thrpt    20   0.095 卤 0.015  ops/ms
SpecializedTypeMicroBenchmark.verify_not_specialized                                   avgt    20  21.024 卤 8.833   ms/op
SpecializedTypeMicroBenchmark.verify_specialized                                       avgt    20  12.423 卤 1.765   ms/op
SpecializedTypeMicroBenchmark.verify_not_specialized                                 sample  1394  14.427 卤 0.458   ms/op
SpecializedTypeMicroBenchmark.verify_not_specialized:verify_not_specialized路p0.00    sample         8.569           ms/op
SpecializedTypeMicroBenchmark.verify_not_specialized:verify_not_specialized路p0.50    sample        13.058           ms/op
SpecializedTypeMicroBenchmark.verify_not_specialized:verify_not_specialized路p0.90    sample        21.332           ms/op
SpecializedTypeMicroBenchmark.verify_not_specialized:verify_not_specialized路p0.95    sample        25.059           ms/op
SpecializedTypeMicroBenchmark.verify_not_specialized:verify_not_specialized路p0.99    sample        32.775           ms/op
SpecializedTypeMicroBenchmark.verify_not_specialized:verify_not_specialized路p0.999   sample        43.424           ms/op
SpecializedTypeMicroBenchmark.verify_not_specialized:verify_not_specialized路p0.9999  sample        43.450           ms/op
SpecializedTypeMicroBenchmark.verify_not_specialized:verify_not_specialized路p1.00    sample        43.450           ms/op
SpecializedTypeMicroBenchmark.verify_specialized                                     sample  1876  10.736 卤 0.431   ms/op
SpecializedTypeMicroBenchmark.verify_specialized:verify_specialized路p0.00            sample         6.111           ms/op
SpecializedTypeMicroBenchmark.verify_specialized:verify_specialized路p0.50            sample         9.052           ms/op
SpecializedTypeMicroBenchmark.verify_specialized:verify_specialized路p0.90            sample        15.188           ms/op
SpecializedTypeMicroBenchmark.verify_specialized:verify_specialized路p0.95            sample        20.064           ms/op
SpecializedTypeMicroBenchmark.verify_specialized:verify_specialized路p0.99            sample        38.676           ms/op
SpecializedTypeMicroBenchmark.verify_specialized:verify_specialized路p0.999           sample        55.613           ms/op
SpecializedTypeMicroBenchmark.verify_specialized:verify_specialized路p0.9999          sample        74.580           ms/op
SpecializedTypeMicroBenchmark.verify_specialized:verify_specialized路p1.00            sample        74.580           ms/op
SpecializedTypeMicroBenchmark.verify_not_specialized                                     ss    20  26.855 卤 9.885   ms/op
SpecializedTypeMicroBenchmark.verify_specialized                                         ss    20  17.305 卤 5.425   ms/op

The specialized version performs much better than the not specialized one. We can notice that already in the throughput metric where the former reaches almost 0.1 operations per ms while the latter is only close to 0.08. We can also notice that the not specialized code takes almost twice more to execute than the specialized one. Quite surprising is the result for sample time (sample) measure where the worst case for specialized version is worse than the same result for the not optimized code. It doesn't mean that the specialization is bad though. It's quite good even for the cold start (ss).

Maybe you won't use the type specialization frequently. In this article, I didn't try to convince you to change the code and put the @specialized annotation everywhere. It would probably slow down the compilation time and not bring a lot of advantages on runtime. However, if your application starts to slow down and the reason for that is the primitive type boxing, the specialization is here one of the solutions. As shown in the second section, the use of this mechanism is quite easy because it can be summarized to the use of @specialized annotation with, optionally, the list of specialized types.