Data consistency in Cassandra

Distributed data brings a new problem to historical standalone relational databases - data consistency. Cassandra deals with this problem pretty nice with its different consistency levels.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ πŸ“©

This article is the first one describing this data consistency topic. It's purely theoretical and only the second one contains some examples. Here, in the first part we introduce the concept of quorum. After that, we describe strong consistency concept. Finally, we present available consistency levels on both, write and read, sides.

Quorum in Cassandra

The idea of quorum is important to understand consistency types. It's the reason why it's explained before them. Generally, quorum defines the number of replicas which should acknowledge data read or write. A quorum is strictly related to a parameter called replication factory.

The formula used to calculate quorum is: N / 2 + 1, where N is the sum of replication factors in each data center. To illustrate that, some examples:

The quorum helps to define the number of tolerated unavailable replica nodes (or the number of available replica nodes to satisfy client's request, the same thing seen in different manner). Presented formula is the same for writing and reading consistency levels invoking quorum.

Strong consistency in Cassandra

Another consistency concept good to know before discovering consistency types is strong consistency. Here too, we have to deal with a formula. The one describing strong consistency is R + W > N, where: R represents read nodes, W written nodes and N is the replication factor. R and W come from consistency levels for read and write requests.

The formula is easily understandable. Let's take an example for the replication factor of 3. First, we have a quorum for both writes and reads, so R and W values are equal to 2. It helps to achieve strong consistency (data replicated in 100%) because 2+2 > 3. But if we have quorum for only writes and less strict rule for reads (for example ONE, using only 1 replica), we won't achieve strong consistency because 2+1 > 3.

In the other side we can find the concept of weak consistency (used synonymous is "eventually consistent"). It occurs when this formula occurs: R + W <= N. Meaning of the symbols is the same as for strong consistency.

Consistency levels in Cassandra

After discovering two important consistency concepts in Apache Cassandra, we can dive in more exact topic of available consistency levels for writing and reading. Let's begin by writing consistency levels:

The same levels can be found for reading consistency levels. The difference consists only on replicas actions. For example, ALL is not concerned anymore by writing but by returning data. If one of replicas is not available, read will fail. The same, response-oriented approach, concerns remaining levels (EACH_QUORUM, QUORUM, LOCAl_QUORUM, ONE, TWO, THREE, LOCAl_ONE). One subtle difference comes from (LOCAL_)SERIAL level. For it, read of uncommited write transaction in progress will commit this transaction as a part of read process.

Through this article we can see that consistency demands to find a balance between availability and data accuracy. We learned the concept of quorum, widely used in consistency levels for writes and reads. After we discovered an idea of strong consistency which is an inequality between read and write consistency levels and the replication factor. The last part described available consistency levels.

If you liked it, you should read:

πŸ“š Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!