Data in Apache Parquet files is written against specific schema. And who tells schema, invokes automatically data types for the fields composing this schema.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
Through this post we'll discover what data types are stored in Apache Parquet files. The first part describes the basic types, provided natively in the framework. The second section explains the interoperability between Parquet and serialization frameworks such Avro or Protobuf from the data types point of view.
Primitive and logical data types
In Parquet we can distinguish 2 families of types: primitive and logical. The latter are an abstraction over the first ones. The difference between them is the "friendliness" of definition. For instance, instead of defining a text as an array of bytes, we can simply annotate it with appropriate logical type. The annotation is made through org.apache.parquet.schema.Types.Builder#as(OriginalType) method.
Let's first see what logical types are available:
- text (aka UTF8)
- enumeration
- UUID
- signed and unsigned integer
- decimals
- date
- time with milliseconds precision
- timestamp with milliseconds or macroseconds precision
- interval of time
- JSON and BSON
- list
- map
All of values listed above are backed by one of following primitive types:
- boolean - used to represent true/false values
- binary - stores data in an array of bytes. The array can have fixed or variable length. As you can deduce, it's responsible for the following logical types:
- UTF8 - byte array is interpreted as an array of UTF-8 encoded chars
- enumerations - the most often they've the same representation as UTF8 string. For instance it's the case of Avro and we can discover that through org.apache.parquet.schema.PrimitiveType.PrimitiveTypeNameConverter#convertBINARY(PrimitiveTypeName):
public Schema convertBINARY(PrimitiveTypeName primitiveTypeName) { if (annotation == OriginalType.UTF8 || annotation == OriginalType.ENUM) { return Schema.create(Schema.Type.STRING); } else { return Schema.create(Schema.Type.BYTES); } }
- UUID - since the UUID always corresponds to 16 characters, it's stored as a fixed length array of bytes.
- decimals with custom precision
- JSON and BSON format
- interval - is represented as 12-elements array of bytes with
- 32-bits integer - same as Java's Integer, it stores numeric values in 32 bits. It's used in the following logical types:
- signed and unsigned integers (8, 16 and 32 bits)
- decimals with the maximal precision of 4
- date
- time in milliseconds
- 64-bits integer - same as previous one, except that it's stored on 64 bits. It's used to similar logical types, so:
- signed and unsigned integers (64 bits)
- decimals with the maximal precision of 8
- timestamp with milliseconds or with microseconds precision
- 96-bits integer
Besides primitives Apache Parquet provides also nested types. They're handled by org.apache.parquet.schema.GroupType thanks to the manipulation of repetition levels that can take 1 of 3 values: required (exactly 1 occurrence, typically primitive types), optional (0 or 1 occurrence) or repeated (0, 1 or more occurrences). For the case of nested types we can distinguish:
- lists - considered as a repetition of the values of the same type.
- maps - it's constructed as a group type with 2 repeated fields: key and value.
Parquet types interoperability
As you know from the introduction to Apache Parquet, the framework provides the integrations with a lot of other Open Source projects as: Avro, Hive, Protobuf or Arrow. You deduce correctly that all of these systems weren't written expressively in the standards of Parquet data types. But somehow they communicate together pretty well.
The communication between 2 systems having different data types is made through the intermediary of converters. For instance, in the case of Parquet - Avro interoperability is provided by org.apache.parquet.avro.AvroSchemaConverter#convert(org.apache.avro.Schema) method. The same approach is used for Parquet - Protobuf compatibility where a org.apache.parquet.proto.ProtoSchemaConverter is defined.
However, Parquet doesn't work only with serialization libraries. It can also be used in query engines, as Hive. But the integration model doesn't change. Hive also uses the converters to map its data types to the ones supported by Parquet.
Parquet types examples
The learning tests below show some use cases of data types in Parquet:
@Test public void should_create_map_of_integer_string_pairs() { Type letter = Types.required(BINARY).as(UTF8).named("letter"); Type number = Types.required(INT32).named("number"); GroupType map = Types.buildGroup(REQUIRED).as(OriginalType.MAP).addFields(letter, number).named("numbers_letters"); String stringRepresentation = getStringRepresentation(map); assertThat(stringRepresentation).isEqualTo("required group numbers_letters (MAP) {\n" + " required binary letter (UTF8);\n" + " required int32 number;\n" + "}"); } @Test public void should_create_a_list() { Type letterField = Types.required(BINARY).as(UTF8).named("letter"); GroupType lettersList = Types.buildGroup(REPEATED).as(OriginalType.LIST).addField(letterField).named("letters"); String stringRepresentation = getStringRepresentation(lettersList); assertThat(stringRepresentation).isEqualTo("repeated group letters (LIST) {\n" + " required binary letter (UTF8);\n" + "}"); } @Test public void should_create_int96_type() { Type bigNumberField = Types.required(INT96).named("big_number"); String stringRepresentation = getStringRepresentation(bigNumberField); assertThat(stringRepresentation).isEqualTo("required int96 big_number"); } @Test public void should_create_boolean_type() { Type isPairFlagField = Types.required(BOOLEAN).named("is_pair"); String stringRepresentation = getStringRepresentation(isPairFlagField); assertThat(stringRepresentation).isEqualTo("required boolean is_pair"); } @Test public void should_fail_on_applying_complex_type_to_primitive_type() { assertThatExceptionOfType(IllegalStateException.class).isThrownBy(() -> { Types.optional(FIXED_LEN_BYTE_ARRAY).length(10).as(MAP).named("letters"); }).withMessageContaining("MAP can not be applied to a primitive type"); } @Test public void should_create_fixed_length_array_type() { Type salary = Types.optional(FIXED_LEN_BYTE_ARRAY).length(10).precision(4).as(DECIMAL).named("salary"); String stringRepresentation = getStringRepresentation(salary); assertThat(stringRepresentation).isEqualTo("optional fixed_len_byte_array(10) salary (DECIMAL(4,0))"); } @Test public void should_create_simple_string_type() { Type textType = Types.required(BINARY).as(UTF8).named("text"); String stringRepresentation = getStringRepresentation(textType); assertThat(stringRepresentation).isEqualTo("required binary text (UTF8)"); } @Test public void should_create_complex_type() { // Parquet also allows the creation of "complex" (nested) types that are // similar to objects from object-oriented languages Type userType = Types.requiredGroup() .required(INT64).named("id") .required(BINARY).as(UTF8).named("email") .named("User"); String stringRepresentation = getStringRepresentation(userType); assertThat(stringRepresentation).isEqualTo("required group User {\n" + " required int64 id;\n" + " required binary email (UTF8);\n" + "}"); } private static String getStringRepresentation(Type type) { StringBuilder bigNumberStringBuilder = new StringBuilder(); type.writeToStringBuilder(bigNumberStringBuilder, ""); return type.toString(); }
Data types are an inherent part of Apache Parquet. They are used not only to define the schema but also have associated specific optimization techniques such as encoding or compression. As we could see through the first section, Parquet brings the main primitive types that can be mapped (aliased) to logical types that are more user-friendly. The second part presented the converters that are widely used in the project to integrate external serialization libraries and query engines as Hive.