Introduction to Apache ZooKeeper

Usually Apache ZooKeeper works in the shadow of more exposed Big Data tools, as Apache Spark or Apache Kafka. However, its role is very important in system architecture.

A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free

👉 Register here

The goal of this article is to answer to some of basic questions about Apache ZooKeeper. The first part describes theory aspects of Apache ZooKeeper. The next part describes some specific vocabulary of this Apache project.

What is Apache ZooKeeper ?

A very common explanation of Apache ZooKeeper consists to compare it to distributed file system. As local file system, ZooKeeper's one is also composed by a root (/), below which we can find things looking like files. These "things" are called zNodes. They can be files or directories. In the first case, they store binary data. In the second case, they contain other sub-zNodes (just like subdirectories). It's also allowed to make zNode holding both, data and subdirectories. We could use zNodes to, for example, keep configuration information in a centralized service. In additionally, this service would be automatically replicated over all servers composing cluster in ZooKeeper.

Data in ZooKeeper is stored in memory and in persistent logs. The in-memory storage helps to achieve high availability. The persistent store holds transaction logs and fuzzy snapshots. As the name indicates, >snapshots represent data tree at given moment. They're called fuzzy because they can not contain some changes made during taking them. So, if one zNode was removed during snapshot taking, snapshot will refer to something that doesn't exist anymore. On the other side, transactions logs help ZooKeeper to ensure that no operation was lost. Every time when new zNode update is planified, it's firstly written to persistent log files.

Data stored in ZooKeeper should be relatively small. Storing big objects could have negative aspects on latency. It's because network operations could take more time to be completely executed. A solution for that could consist on storing big files in a bulk storage system as HDFS. In this case, ZooKeeper could be only used to keep location path to these files. When we try to save a content bigger than 1mb, client connection will be closed.

What is vocabulary used in ZooKeeper ?

To be able to work efficiently with Apache ZooKeeper, it's important to understand vocabulary used by it. Below you can find a list of, subjectively thinking, the most important concepts to appropriate:

The article describes some basic information which should help to understand and start to work with Apache ZooKeeper. The first part describes globally what ZooKeeper is. It mentions one from several important terms used by ZooKeeper - zNodes. The rest of them, subjectively estimated as important, are presented in the next part.


If you liked it, you should read:

đź“š Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!