Introduction to Apache Cassandra

After some articles about data ingestion and serialization in Big Data applications, it's time to start to learn about storage. This part begins with Apache Cassandra.

This article presents basic concepts of Apache Cassandra. In the first part it tries to explain architecture and general concepts of this solution. The second part is focused more on developer topics and it describes some main points about data organization.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

This article presents basic concepts of Apache Cassandra. In the first part it tries to explain architecture and general concepts of this solution. The second part is focused more on developer topics and it describes some main points about data organization.

Apache Cassandra - general concepts

Apacha Cassandra is NoSQL database conceived around 2 of 3 concepts of CAP theorem - Availability and Partitioning (by the way, it's almost impossible to have 3 of them and still keeping acceptable latency). So, Cassandra is quick and fault-tolerant. It's also linearly scalable, ie. if one machine can support 100 writes per second, 2 machines should support 200 writes, 3 machines 300 writes and so on.

From the architectural point of view, Cassandra is based on nodes composed with ring architecture. Every node has the same role. It means that there are no concept of master node. The communication between nodes is made through protocol called gossip. The use of this kind of architecture guarantees no single point failure. When one node is dead, data can still be read from its replica.

Apache Cassandra - data storage

Developers working previously with relational databases, can quickly start to work with Cassandra thanks to its specific Cassandra Query Language. The syntax of CQL looks like the syntax of SQL which makes it very intuitive to use. However, there are also some differences with relational databases:

But how the data is stored ? The flow is easy. First, incoming data is written to persistent file called commit log. After, it goes to a memory structure called memtable. When this structure reaches configured threshold, its content goes once again to persistent disk file. But this one is called SSTable (Sorted String Table). Normally, several files of this type can exist for given table. They are put together by compaction operation. For the reading part, Cassandra is helped by a structure called Bloom filter. It checks the probability that given SSTable has searched data. If it detects that given file has the data, it checks in-memory cache and retrieve final data from SSTable.

The article describes the basic ideas of Apache Cassandra. The first part shows some basic architectural components, such as ring organization or response to CAP theorem. The second part concerns more developer and storage part. It describes differences between SQL and Cassandra and explains how data is written and read.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©