How to generate schema from a case class in Apache Spark SQL ?

Schema helps to ensure a good data quality and to optimize data exploration. However, in some cases defining schema may be hard. That's especially true if your dataset has a lot of nested levels which help you to accelerate the processing by avoiding the joins with other datasets. It's also complicated when your schema has a lot of fields and you have to define them all manually.

Fortunately, Apache Spark provides a pretty smart solution to optimize the schema definition. The only requirement is to have a case class model of the schema. You can then use ScalaReflection class from org.apache.spark.sql.catalyst package this way:

class AutomaticSchemaResolutionTest extends FlatSpec with Matchers {

  "schema" should "be resolved automatically through reflection" in {
    val schema = ScalaReflection.schemaFor[Alphabet].dataType.asInstanceOf[StructType]

    schema.fields should have size 3
    schema.fields should contain allOf(StructField("letter1", StringType, true),
      StructField("letter2", StringType, true), StructField("otherLetters", ArrayType(StringType, true), true))


case class Alphabet(letter1: String, letter2: String, otherLetters: Seq[String])