Introduction to Apache Cassandra

After some articles about data ingestion and serialization in Big Data applications, it's time to start to learn about storage. This part begins with Apache Cassandra.

This article presents basic concepts of Apache Cassandra. In the first part it tries to explain architecture and general concepts of this solution. The second part is focused more on developer topics and it describes some main points about data organization.

This article presents basic concepts of Apache Cassandra. In the first part it tries to explain architecture and general concepts of this solution. The second part is focused more on developer topics and it describes some main points about data organization.

Apache Cassandra - general concepts

Apacha Cassandra is NoSQL database conceived around 2 of 3 concepts of CAP theorem - Availability and Partitioning (by the way, it's almost impossible to have 3 of them and still keeping acceptable latency). So, Cassandra is quick and fault-tolerant. It's also linearly scalable, ie. if one machine can support 100 writes per second, 2 machines should support 200 writes, 3 machines 300 writes and so on.

From the architectural point of view, Cassandra is based on nodes composed with ring architecture. Every node has the same role. It means that there are no concept of master node. The communication between nodes is made through protocol called gossip. The use of this kind of architecture guarantees no single point failure. When one node is dead, data can still be read from its replica.

Apache Cassandra - data storage

Developers working previously with relational databases, can quickly start to work with Cassandra thanks to its specific Cassandra Query Language. The syntax of CQL looks like the syntax of SQL which makes it very intuitive to use. However, there are also some differences with relational databases:

But how the data is stored ? The flow is easy. First, incoming data is written to persistent file called commit log. After, it goes to a memory structure called memtable. When this structure reaches configured threshold, its content goes once again to persistent disk file. But this one is called SSTable (Sorted String Table). Normally, several files of this type can exist for given table. They are put together by compaction operation. For the reading part, Cassandra is helped by a structure called Bloom filter. It checks the probability that given SSTable has searched data. If it detects that given file has the data, it checks in-memory cache and retrieve final data from SSTable.

The article describes the basic ideas of Apache Cassandra. The first part shows some basic architectural components, such as ring organization or response to CAP theorem. The second part concerns more developer and storage part. It describes differences between SQL and Cassandra and explains how data is written and read.

If you liked it, you should read:

The comments are moderated. I publish them when I answer, so don't worry if you don't see yours immediately :)

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!