Introduction to Apache ZooKeeper

Usually Apache ZooKeeper works in the shadow of more exposed Big Data tools, as Apache Spark or Apache Kafka. However, its role is very important in system architecture.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

The goal of this article is to answer to some of basic questions about Apache ZooKeeper. The first part describes theory aspects of Apache ZooKeeper. The next part describes some specific vocabulary of this Apache project.

What is Apache ZooKeeper ?

A very common explanation of Apache ZooKeeper consists to compare it to distributed file system. As local file system, ZooKeeper's one is also composed by a root (/), below which we can find things looking like files. These "things" are called zNodes. They can be files or directories. In the first case, they store binary data. In the second case, they contain other sub-zNodes (just like subdirectories). It's also allowed to make zNode holding both, data and subdirectories. We could use zNodes to, for example, keep configuration information in a centralized service. In additionally, this service would be automatically replicated over all servers composing cluster in ZooKeeper.

Data in ZooKeeper is stored in memory and in persistent logs. The in-memory storage helps to achieve high availability. The persistent store holds transaction logs and fuzzy snapshots. As the name indicates, >snapshots represent data tree at given moment. They're called fuzzy because they can not contain some changes made during taking them. So, if one zNode was removed during snapshot taking, snapshot will refer to something that doesn't exist anymore. On the other side, transactions logs help ZooKeeper to ensure that no operation was lost. Every time when new zNode update is planified, it's firstly written to persistent log files.

Data stored in ZooKeeper should be relatively small. Storing big objects could have negative aspects on latency. It's because network operations could take more time to be completely executed. A solution for that could consist on storing big files in a bulk storage system as HDFS. In this case, ZooKeeper could be only used to keep location path to these files. When we try to save a content bigger than 1mb, client connection will be closed.

What is vocabulary used in ZooKeeper ?

To be able to work efficiently with Apache ZooKeeper, it's important to understand vocabulary used by it. Below you can find a list of, subjectively thinking, the most important concepts to appropriate:

The article describes some basic information which should help to understand and start to work with Apache ZooKeeper. The first part describes globally what ZooKeeper is. It mentions one from several important terms used by ZooKeeper - zNodes. The rest of them, subjectively estimated as important, are presented in the next part.

If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!