NoSQL solutions are very often related to the word schemaless. Sometimes the absence of schema can lead to maintenance or backward compatibility problems. One of solutions to these issues in Big Data systems are serialization frameworks.
A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free
👉 Register here
This purely theoretical article tries to explain the role and the importance of serialization in Big Data systems.
Serialization definition
First, let's define what serialization is ? In Java, serialization is linked to java.io.Serializable interface and possibility to convert and reconvert object to byte stream. But regarding to Big Data systems where data can come from different sources, written in different languages, this solution has some drawbacks, as a lack of portability or maintenance difficulty.
In Big Data, serialization also refers to converting data into portable structure as byte streams. But it has another goal which is schema control. Thanks to schema describing data structure, data can be validated on writing phase. It avoids to have some surprises when data is read and, for example, a mandatory field is missing or has bad type (int instead of array).
Additionally, serialization helps to execute Big Data tasks efficiently. Unlike popular formats as JSON or XML, serialized data is splittable easier. And splittability can be used for example by MapReduce to process input data divided to input splits.
Previously mentioned schema control involves another advantage of serialization frameworks - versioning. Let's imagine that one day our object has 5 fields and one week later, it has already 10 fields. To handle the change we could, for example, map new fields with default values or use old schema file for data deserialization.
Also, in some cases, serialized data takes less place that standard JSON or XML files.
To resume, we can list following points to characterize serialization in Big Data systems:
- splittability - easier to achieve splits on byte streams rather than JSON or XML files
- portability - schema can be consumed by different languages
- versioning - flexibity to define fields with default values or continue to use old schema version
- data integrity - serialization schemas enforces data corecteness. Thanks to them, errors can be detected earlier, when data is written.
Avro as use case
There are several main serialization frameworks available, among others: Avro, Thrift and Protocol Buffers. After analyzing some of implemented features, I decided to play a little with Apache Avro.
This choice was dictated by use flexibility (optional code generation), language support, splittability and the fact that in the most of Big Data books I studied, only Apache Thrift was widely presented. But Apache Avro will be the subject of another post.
In this article we can learn some points about serialization use in Big Data systems. We can see that it helps to keep data consistent, but also improves data processing with splittability and compression features.