Introduction to Apache Cassandra on waitingforcode.com

After some articles about data ingestion and serialization in Big Data applications, it's time to start to learn about storage. This part begins with Apache Cassandra.

This article presents basic concepts of Apache Cassandra. In the first part it tries to explain architecture and general concepts of this solution. The second part is focused more on developer topics and it describes some main points about data organization.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Apache Cassandra - general concepts

Apacha Cassandra is NoSQL database conceived around 2 of 3 concepts of CAP theorem - Availability and Partitioning (by the way, it's almost impossible to have 3 of them and still keeping acceptable latency). So, Cassandra is quick and fault-tolerant. It's also linearly scalable, ie. if one machine can support 100 writes per second, 2 machines should support 200 writes, 3 machines 300 writes and so on.

From the architectural point of view, Cassandra is based on nodes composed with ring architecture. Every node has the same role. It means that there are no concept of master node. The communication between nodes is made through protocol called gossip. The use of this kind of architecture guarantees no single point failure. When one node is dead, data can still be read from its replica.

Apache Cassandra - data storage

Developers working previously with relational databases, can quickly start to work with Cassandra thanks to its specific Cassandra Query Language. The syntax of CQL looks like the syntax of SQL which makes it very intuitive to use. However, there are also some differences with relational databases:

no joins - Cassandra isn't adapted to make JOIN between tables. Instead of that, it prefers either denormalize data or define related data as column of collection type. Since JOIN is not supported, any concept of relation is not supported too.
aggregations - aggregations support is quite limited in Cassandra. In CQL v3.4.0 only average, max, min and count aggregation functions are supported. However, this list can be expanded with User Defined Functions (UDF) which programmer can write to support other aggregation cases such as grouping or sum. But be aware of eventual influence of performances when making aggregations.
model per query - in relation world data is modeled in the way of normalization. In Cassandra, the modelization is made more around querying. It means that sometimes we can find duplicated data among different tables and it shouldn't be considered as a bad practice. Trying to normalize data in Cassandra will in the most cases lead to inefficient queries.
keys and indexes - Cassandra also implemented a concept of indexes and primary keys. But the role of primary keys is a little bit extended. Without going into details (it's not the right moment), keys in Cassandra serve to define in which node given row will be stored and in which order.
schemaless - as usual in NoSQL solutions, Cassandra doesn't make an exception and allows flexible changes over time.

But how the data is stored ? The flow is easy. First, incoming data is written to persistent file called commit log. After, it goes to a memory structure called memtable. When this structure reaches configured threshold, its content goes once again to persistent disk file. But this one is called SSTable (Sorted String Table). Normally, several files of this type can exist for given table. They are put together by compaction operation. For the reading part, Cassandra is helped by a structure called Bloom filter. It checks the probability that given SSTable has searched data. If it detects that given file has the data, it checks in-memory cache and retrieve final data from SSTable.

The article describes the basic ideas of Apache Cassandra. The first part shows some basic architectural components, such as ring organization or response to CAP theorem. The second part concerns more developer and storage part. It describes differences between SQL and Cassandra and explains how data is written and read.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects