Introduction to serialization in Big Data

NoSQL solutions are very often related to the word schemaless. Sometimes the absence of schema can lead to maintenance or backward compatibility problems. One of solutions to these issues in Big Data systems are serialization frameworks.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

This purely theoretical article tries to explain the role and the importance of serialization in Big Data systems.

Serialization definition

First, let's define what serialization is ? In Java, serialization is linked to java.io.Serializable interface and possibility to convert and reconvert object to byte stream. But regarding to Big Data systems where data can come from different sources, written in different languages, this solution has some drawbacks, as a lack of portability or maintenance difficulty.

In Big Data, serialization also refers to converting data into portable structure as byte streams. But it has another goal which is schema control. Thanks to schema describing data structure, data can be validated on writing phase. It avoids to have some surprises when data is read and, for example, a mandatory field is missing or has bad type (int instead of array).

Additionally, serialization helps to execute Big Data tasks efficiently. Unlike popular formats as JSON or XML, serialized data is splittable easier. And splittability can be used for example by MapReduce to process input data divided to input splits.

Previously mentioned schema control involves another advantage of serialization frameworks - versioning. Let's imagine that one day our object has 5 fields and one week later, it has already 10 fields. To handle the change we could, for example, map new fields with default values or use old schema file for data deserialization.

Also, in some cases, serialized data takes less place that standard JSON or XML files.

To resume, we can list following points to characterize serialization in Big Data systems:

Avro as use case

There are several main serialization frameworks available, among others: Avro, Thrift and Protocol Buffers. After analyzing some of implemented features, I decided to play a little with Apache Avro.

This choice was dictated by use flexibility (optional code generation), language support, splittability and the fact that in the most of Big Data books I studied, only Apache Thrift was widely presented. But Apache Avro will be the subject of another post.

In this article we can learn some points about serialization use in Big Data systems. We can see that it helps to keep data consistent, but also improves data processing with splittability and compression features.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©