Previously we learned why serialization frameworks can facilitate work in distributed systems, where data provide from several different sources. Now, it's a good time to discover some real tools used in serialization step. As told, the chosen tool is Apache Avro.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
This is an introduction to Apache Avro showing its main feauters in the first part. The second part browses Java API by presenting main classes in serialization and deserialization process. The example of use is defined in next article, serialization and deserialization with schemas in Apache Avro.
Apache Avro features
We can discover Apache Avro features by explaining why it's a good serialization framework. We base on following criteria (list comes from "Hadoop in practice" of Alex Holmes):
- code generation - exists but is not mandatory make Avro works.
- versioning - since Avro files carry serialization schemas themselves, versioning can be easily achieved by defining correctly read schema
- language support - there are a lot of implementations, especially in C, C++, C#, Java, PHP, Python and Ruby.
- transparent compression - there are no need to compress Avro files manually. The framework does some size optimizations itself
- splittability - is achieved thanks to synchronization markers used between blocks.
- native support in MapReduce - according to the Apache Avro documentation, Avro files can be used in each step of MapReduce jobs (as input, output or intermediate files). All useful classes, such as readers or reducers, are in org.apache.avro.mapred package.
It's fine, Avro seems to be a good serialization choice, widely respected and correctly adapted to batch processing tasks. However, how does it work ? The main component of it are schemas. Schemas are descriptors of serialized and deserialized objects. They can be expressed in JSON or IDL format. After establishing a schema, we can to define a object in one of supported programming languages, let's say Java. Corresponding object can be defined manually or automatically, for example with Maven's plugin (avro-maven-plugin). But object definition is not mandatory and we can simply read fields stored in serialized file - still respecting writing schema.
How does Avro schema looks ? Mainly, it consists on 2 parts. The first part, introduced by avro.schema word, concerns schema used to write object. The second part is composed by real saved data. To save space, the saved data don't contain field names and types - they are defined only once, at the begin of the file. It's the reason why Avro will generate smaller files than, for example, a JSON file containing multiple lines of objects with the same type. Below you can find an example of Avro file storing 6 French football clubs:
Objavro.schema�{"type":"record","name":"Club","namespace":"com.waitingforcode.model","fields":[{"name":"full_name","type":"string","aliases":["fullName"]},{"name":"foundation_year","type":"int","aliases":["foundationYear"]},{"name":"disappearance_year","type":["null","int"],"aliases":["disappearanceYear"]}]} ����Wepy������e�RC Lens� Lille OSC� Matra Racing��US Valenciennes� Paris-SG� Red Star 93� ����Wepy������e
Apache Avro Java API
Before starting to use Avro, we are going to discover basic classes used in serialization and deserialization process. When we don't dispose object corresponding to serialized data, we can use the implementation of org.apache.avro.generic.GenericRecord interface. Thanks to it we can access serialized fields by their names or by their indexes. To generate readable object we must use of on available datuum readers, represented by org.apache.avro.io.DatumReader implementations. For GenericRecord the implementation to use is GenericDatumReader. It implements FileReader interface which itself extends Iterable et Iterator interfaces. It's why data can be accessed by well-known hasNext() and next() methods.
By the way, thanks to them we can use reading pattern advised in Avro documentation. It consists on reuse the same object on deserialization, to avoid garbage collection works, such as:
// comes from http://avro.apache.org/docs/current/gettingstartedjava.html User user = null; while (dataFileReader.hasNext()) { // Reuse user object by passing it to next(). This saves us from // allocating and garbage collecting many objects for files with // many items. user = dataFileReader.next(user); System.out.println(user); }
Avro can also convert serialized data to specific object. It's achieved with different datuum readers, such as org.apache.avro.reflect.ReflectDatumReader. How the name indicates, Java reflection is used to construct object of appropriated type.
We can find similar mechanims in serialization steps. Datuum writers are used to serialize data. In Java, they are based on org.apache.avro.io.DatumWriter interface. It defines only 2 methods: setSchema(Schema) and write(D datum, Encoder out). In Java API we retrieve writers specific for saved type - if it's Java object, we can use ReflectDatumWriter, if it's generic record, we can use GenericDatumWriter.
Interesting point about serialized data is that text is not stored as Java's String but Avro's org.apache.avro.util.Utf8 objects. Utf8 objects content can be compared directly, without UTF-8 bytes decoding. Unlike String, they are also mutable.
Apache Avro respects well expected points characterizing serialization frameworks. With its Java API, based on concept of datuum writers and readers, it's quite easy to serialize and deserialize objects - even if we don't want to map them to specific objects in Java code.