Data types in Apache Parquet

Versions: Parquet 1.9.0

Data in Apache Parquet files is written against specific schema. And who tells schema, invokes automatically data types for the fields composing this schema.

Through this post we'll discover what data types are stored in Apache Parquet files. The first part describes the basic types, provided natively in the framework. The second section explains the interoperability between Parquet and serialization frameworks such Avro or Protobuf from the data types point of view.

Primitive and logical data types

In Parquet we can distinguish 2 families of types: primitive and logical. The latter are an abstraction over the first ones. The difference between them is the "friendliness" of definition. For instance, instead of defining a text as an array of bytes, we can simply annotate it with appropriate logical type. The annotation is made through org.apache.parquet.schema.Types.Builder#as(OriginalType) method.

Let's first see what logical types are available:

All of values listed above are backed by one of following primitive types:

Besides primitives Apache Parquet provides also nested types. They're handled by org.apache.parquet.schema.GroupType thanks to the manipulation of repetition levels that can take 1 of 3 values: required (exactly 1 occurrence, typically primitive types), optional (0 or 1 occurrence) or repeated (0, 1 or more occurrences). For the case of nested types we can distinguish:

Parquet types interoperability

As you know from the introduction to Apache Parquet, the framework provides the integrations with a lot of other Open Source projects as: Avro, Hive, Protobuf or Arrow. You deduce correctly that all of these systems weren't written expressively in the standards of Parquet data types. But somehow they communicate together pretty well.

The communication between 2 systems having different data types is made through the intermediary of converters. For instance, in the case of Parquet - Avro interoperability is provided by org.apache.parquet.avro.AvroSchemaConverter#convert(org.apache.avro.Schema) method. The same approach is used for Parquet - Protobuf compatibility where a org.apache.parquet.proto.ProtoSchemaConverter is defined.

However, Parquet doesn't work only with serialization libraries. It can also be used in query engines, as Hive. But the integration model doesn't change. Hive also uses the converters to map its data types to the ones supported by Parquet.

Parquet types examples

The learning tests below show some use cases of data types in Parquet:

@Test
public void should_create_map_of_integer_string_pairs() {
    Type letter = Types.required(BINARY).as(UTF8).named("letter");
    Type number = Types.required(INT32).named("number");
    GroupType map = Types.buildGroup(REQUIRED).as(OriginalType.MAP).addFields(letter, number).named("numbers_letters");

    String stringRepresentation = getStringRepresentation(map);

    assertThat(stringRepresentation).isEqualTo("required group numbers_letters (MAP) {\n" +
            "  required binary letter (UTF8);\n" +
            "  required int32 number;\n" +
            "}");
}

@Test
public void should_create_a_list() {
    Type letterField = Types.required(BINARY).as(UTF8).named("letter");
    GroupType lettersList = Types.buildGroup(REPEATED).as(OriginalType.LIST).addField(letterField).named("letters");

    String stringRepresentation = getStringRepresentation(lettersList);

    assertThat(stringRepresentation).isEqualTo("repeated group letters (LIST) {\n" +
            "  required binary letter (UTF8);\n" +
            "}");
}

@Test
public void should_create_int96_type() {
    Type bigNumberField = Types.required(INT96).named("big_number");

    String stringRepresentation = getStringRepresentation(bigNumberField);

    assertThat(stringRepresentation).isEqualTo("required int96 big_number");
}

@Test
public void should_create_boolean_type() {
    Type isPairFlagField = Types.required(BOOLEAN).named("is_pair");

    String stringRepresentation = getStringRepresentation(isPairFlagField);

    assertThat(stringRepresentation).isEqualTo("required boolean is_pair");
}

@Test
public void should_fail_on_applying_complex_type_to_primitive_type() {
    assertThatExceptionOfType(IllegalStateException.class).isThrownBy(() -> {
        Types.optional(FIXED_LEN_BYTE_ARRAY).length(10).as(MAP).named("letters");
    }).withMessageContaining("MAP can not be applied to a primitive type");
}

@Test
public void should_create_fixed_length_array_type() {
    Type salary = Types.optional(FIXED_LEN_BYTE_ARRAY).length(10).precision(4).as(DECIMAL).named("salary");

    String stringRepresentation = getStringRepresentation(salary);

    assertThat(stringRepresentation).isEqualTo("optional fixed_len_byte_array(10) salary (DECIMAL(4,0))");
}

@Test
public void should_create_simple_string_type() {
    Type textType = Types.required(BINARY).as(UTF8).named("text");

    String stringRepresentation = getStringRepresentation(textType);

    assertThat(stringRepresentation).isEqualTo("required binary text (UTF8)");
}

@Test
public void should_create_complex_type() {
    // Parquet also allows the creation of "complex" (nested) types that are
    // similar to objects from object-oriented languages
    Type userType = Types.requiredGroup()
        .required(INT64).named("id")
        .required(BINARY).as(UTF8).named("email")
        .named("User");

    String stringRepresentation = getStringRepresentation(userType);

    assertThat(stringRepresentation).isEqualTo("required group User {\n" +
            "  required int64 id;\n" +
            "  required binary email (UTF8);\n" +
            "}");
}

private static String getStringRepresentation(Type type) {
    StringBuilder bigNumberStringBuilder = new StringBuilder();
    type.writeToStringBuilder(bigNumberStringBuilder, "");
    return type.toString();
}

Data types are an inherent part of Apache Parquet. They are used not only to define the schema but also have associated specific optimization techniques such as encoding or compression. As we could see through the first section, Parquet brings the main primitive types that can be mapped (aliased) to logical types that are more user-friendly. The second part presented the converters that are widely used in the project to integrate external serialization libraries and query engines as Hive.