Introduction to Apache Avro

Previously we learned why serialization frameworks can facilitate work in distributed systems, where data provide from several different sources. Now, it's a good time to discover some real tools used in serialization step. As told, the chosen tool is Apache Avro.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

This is an introduction to Apache Avro showing its main feauters in the first part. The second part browses Java API by presenting main classes in serialization and deserialization process. The example of use is defined in next article, serialization and deserialization with schemas in Apache Avro.

Apache Avro features

We can discover Apache Avro features by explaining why it's a good serialization framework. We base on following criteria (list comes from "Hadoop in practice" of Alex Holmes):

It's fine, Avro seems to be a good serialization choice, widely respected and correctly adapted to batch processing tasks. However, how does it work ? The main component of it are schemas. Schemas are descriptors of serialized and deserialized objects. They can be expressed in JSON or IDL format. After establishing a schema, we can to define a object in one of supported programming languages, let's say Java. Corresponding object can be defined manually or automatically, for example with Maven's plugin (avro-maven-plugin). But object definition is not mandatory and we can simply read fields stored in serialized file - still respecting writing schema.

How does Avro schema looks ? Mainly, it consists on 2 parts. The first part, introduced by avro.schema word, concerns schema used to write object. The second part is composed by real saved data. To save space, the saved data don't contain field names and types - they are defined only once, at the begin of the file. It's the reason why Avro will generate smaller files than, for example, a JSON file containing multiple lines of objects with the same type. Below you can find an example of Avro file storing 6 French football clubs:

Objavro.schema�{"type":"record","name":"Club","namespace":"com.waitingforcode.model","fields":[{"name":"full_name","type":"string","aliases":["fullName"]},{"name":"foundation_year","type":"int","aliases":["foundationYear"]},{"name":"disappearance_year","type":["null","int"],"aliases":["disappearanceYear"]}]} ����Wepy������e�RC Lens� Lille OSC� Matra Racing��US Valenciennes� Paris-SG� Red Star 93� ����Wepy������e

Apache Avro Java API

Before starting to use Avro, we are going to discover basic classes used in serialization and deserialization process. When we don't dispose object corresponding to serialized data, we can use the implementation of org.apache.avro.generic.GenericRecord interface. Thanks to it we can access serialized fields by their names or by their indexes. To generate readable object we must use of on available datuum readers, represented by implementations. For GenericRecord the implementation to use is GenericDatumReader. It implements FileReader interface which itself extends Iterable et Iterator interfaces. It's why data can be accessed by well-known hasNext() and next() methods.

By the way, thanks to them we can use reading pattern advised in Avro documentation. It consists on reuse the same object on deserialization, to avoid garbage collection works, such as:

// comes from
User user = null;
while (dataFileReader.hasNext()) {
  // Reuse user object by passing it to next(). This saves us from
  // allocating and garbage collecting many objects for files with
  // many items.
  user =;

Avro can also convert serialized data to specific object. It's achieved with different datuum readers, such as org.apache.avro.reflect.ReflectDatumReader. How the name indicates, Java reflection is used to construct object of appropriated type.

We can find similar mechanims in serialization steps. Datuum writers are used to serialize data. In Java, they are based on interface. It defines only 2 methods: setSchema(Schema) and write(D datum, Encoder out). In Java API we retrieve writers specific for saved type - if it's Java object, we can use ReflectDatumWriter, if it's generic record, we can use GenericDatumWriter.

Interesting point about serialized data is that text is not stored as Java's String but Avro's org.apache.avro.util.Utf8 objects. Utf8 objects content can be compared directly, without UTF-8 bytes decoding. Unlike String, they are also mutable.

Apache Avro respects well expected points characterizing serialization frameworks. With its Java API, based on concept of datuum writers and readers, it's quite easy to serialize and deserialize objects - even if we don't want to map them to specific objects in Java code.

If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!