Introduction to Apache ZooKeeper

Usually Apache ZooKeeper works in the shadow of more exposed Big Data tools, as Apache Spark or Apache Kafka. However, its role is very important in system architecture.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

The goal of this article is to answer to some of basic questions about Apache ZooKeeper. The first part describes theory aspects of Apache ZooKeeper. The next part describes some specific vocabulary of this Apache project.

What is Apache ZooKeeper ?

A very common explanation of Apache ZooKeeper consists to compare it to distributed file system. As local file system, ZooKeeper's one is also composed by a root (/), below which we can find things looking like files. These "things" are called zNodes. They can be files or directories. In the first case, they store binary data. In the second case, they contain other sub-zNodes (just like subdirectories). It's also allowed to make zNode holding both, data and subdirectories. We could use zNodes to, for example, keep configuration information in a centralized service. In additionally, this service would be automatically replicated over all servers composing cluster in ZooKeeper.

Data in ZooKeeper is stored in memory and in persistent logs. The in-memory storage helps to achieve high availability. The persistent store holds transaction logs and fuzzy snapshots. As the name indicates, >snapshots represent data tree at given moment. They're called fuzzy because they can not contain some changes made during taking them. So, if one zNode was removed during snapshot taking, snapshot will refer to something that doesn't exist anymore. On the other side, transactions logs help ZooKeeper to ensure that no operation was lost. Every time when new zNode update is planified, it's firstly written to persistent log files.

Data stored in ZooKeeper should be relatively small. Storing big objects could have negative aspects on latency. It's because network operations could take more time to be completely executed. A solution for that could consist on storing big files in a bulk storage system as HDFS. In this case, ZooKeeper could be only used to keep location path to these files. When we try to save a content bigger than 1mb, client connection will be closed.

What is vocabulary used in ZooKeeper ?

To be able to work efficiently with Apache ZooKeeper, it's important to understand vocabulary used by it. Below you can find a list of, subjectively thinking, the most important concepts to appropriate:

The article describes some basic information which should help to understand and start to work with Apache ZooKeeper. The first part describes globally what ZooKeeper is. It mentions one from several important terms used by ZooKeeper - zNodes. The rest of them, subjectively estimated as important, are presented in the next part.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©