Distributed data brings a new problem to historical standalone relational databases - data consistency. Cassandra deals with this problem pretty nice with its different consistency levels.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
This article is the first one describing this data consistency topic. It's purely theoretical and only the second one contains some examples. Here, in the first part we introduce the concept of quorum. After that, we describe strong consistency concept. Finally, we present available consistency levels on both, write and read, sides.
Quorum in Cassandra
The idea of quorum is important to understand consistency types. It's the reason why it's explained before them. Generally, quorum defines the number of replicas which should acknowledge data read or write. A quorum is strictly related to a parameter called replication factory.
The formula used to calculate quorum is: N / 2 + 1, where N is the sum of replication factors in each data center. To illustrate that, some examples:
- 1 data center, replication factor of 4 -> quorum is 3. It means that 1 replica nodes can be down.
- 2 data centers, replication factor of 5 on each -> quorum is 6. It means that 4 replica nodes can be down.
- 4 data centers, replication factor of 5 on each -> quorum is 11. It means that 9 replica nodes can be down.
The quorum helps to define the number of tolerated unavailable replica nodes (or the number of available replica nodes to satisfy client's request, the same thing seen in different manner). Presented formula is the same for writing and reading consistency levels invoking quorum.
Strong consistency in Cassandra
Another consistency concept good to know before discovering consistency types is strong consistency. Here too, we have to deal with a formula. The one describing strong consistency is R + W > N, where: R represents read nodes, W written nodes and N is the replication factor. R and W come from consistency levels for read and write requests.
The formula is easily understandable. Let's take an example for the replication factor of 3. First, we have a quorum for both writes and reads, so R and W values are equal to 2. It helps to achieve strong consistency (data replicated in 100%) because 2+2 > 3. But if we have quorum for only writes and less strict rule for reads (for example ONE, using only 1 replica), we won't achieve strong consistency because 2+1 > 3.
In the other side we can find the concept of weak consistency (used synonymous is "eventually consistent"). It occurs when this formula occurs: R + W <= N. Meaning of the symbols is the same as for strong consistency.
Consistency levels in Cassandra
After discovering two important consistency concepts in Apache Cassandra, we can dive in more exact topic of available consistency levels for writing and reading. Let's begin by writing consistency levels:
- ALL - data must be written on all replica nodes in the cluster
- EACH_QUORUM, QUORUM, LOCAL_QUORUM - applies already describes quorum idea. The difference comes from application rule. The first one applies to quorum of replica nodes in all data center. The second one applies normal quorum level while the last one is used for replica nodes inside the same data center as coordinator node.
- ONE, TWO, THREE, LOCAL_ONE - specifies the minimal number of replicas which should held the write (1, 2, 3). LOCAL_ONE is a variation and it applies only on replica nodes from local data center.
- ANY - at least one node must handle the write.
- SERIAL, LOCAL_SERIAL - achieves consistency for Cassandra's lightweight transactions where given operations must perform without being interrupted by some other operations. Once again, LOCAL is a variant. It applies this kind of quorum consistency to replica nodes of the same data center.
The same levels can be found for reading consistency levels. The difference consists only on replicas actions. For example, ALL is not concerned anymore by writing but by returning data. If one of replicas is not available, read will fail. The same, response-oriented approach, concerns remaining levels (EACH_QUORUM, QUORUM, LOCAl_QUORUM, ONE, TWO, THREE, LOCAl_ONE). One subtle difference comes from (LOCAL_)SERIAL level. For it, read of uncommited write transaction in progress will commit this transaction as a part of read process.
Through this article we can see that consistency demands to find a balance between availability and data accuracy. We learned the concept of quorum, widely used in consistency levels for writes and reads. After we discovered an idea of strong consistency which is an inequality between read and write consistency levels and the replication factor. The last part described available consistency levels.