Data part in Apache Cassandra

The previous article introduced us to Apache Cassandra by presenting vaguely its main concepts. This article focuses more in details on data topics.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

The first part describes general mechanism of data storage. The second part demystifies different types of primary keys. The last part lists and details data types supported in Cassandra tables.

Data storage

Data in Cassandra is stored in logical units called keyspace. It's similar to database in relational databases world. So keyspace is a container keeping all tables, indexes, but also defining the replication level. There are 2 completely different replication strategies: SimpleStrategy and NetworkTopologyStrategy. The first one replicates data only inside 1 data center. The second one makes the replication in different data centers. It can take as argument the number of replicas stored in each data center. And NetworkTopologyStrategy is safer because it protects against the unavailability when all nodes of one data center stop to work at the same time.

As told in the introduction to Apache Cassandra, data is first written to commit logs, after to memory structure called memtable, and finally, when memtable reaches configures threshold, to files called SSTable. On writing data is also distributed according to replication factory detailed before.

Data is grouped in tables. Each table has rows which are described by columns. Since Cassandra doesn't support joins and relations, tables can have thousands of columns (or even more!). And it shouldn't decrease the performances.

Another topic related to tables is compaction. It's triggered when Cassandra sees that there are a lot of similar SSTables. The number of similar tables is configurable through min_threshold entry of given table. The first compaction strategy called SizeTieredCompactionStrategy. Another one concerns time series data and groups together data written within similar period of time. It's also based on min_threshold parameter and the strategy has the name of DateTieredCompactionStrategy. The last strategy is LeveledCompactionStrategy which creates relatively small SSTables with levels. Each level is 10 times bigger than the previous. When one level is full, it doesn't accept new data which is saved and compacted with data held by the next level SSTables.

Primary keys

Primary keys in Cassandra are more complicated topic than primary keys in relational databases. The simplest primary key is composed by only one column. Let's call it simple primary key. At the same time, this type of key is also responsible for choosing the node where given row will be stored. Commonly used name to describe this is partition key. One of Cassandra goals consists to distribute data equally. It's the reason why the choice of good partition key is important.

Another type of primary key is compound primary key (called also composite primary key). As in relation world, this kind of key is composed by two or more columns. The first column plays the role of partition key while the others are called clustering key. Its role is to sort data in given partition.

Primary key, as partition key, can be also composed by two or more columns. In this case, it should be declared inside parenthesis, for example: PRIMARY KEY (col1, col2), col1_cluster_key_, col2_cluster_key.

Data types

Apache Cassandra supports almost the same data types as relational databases. But some of them don't exist in relational world or have different goals. In consequence, we can list some differences:

Aside specific types previously listed, Cassandra has also more common data types which we can categorize in given list:

This article starts to go into Cassandra details. The first part describes data organization and compaction operation. We can see that data is organized around keyspaces containing tables. We can also learn some points about compaction and its different strategies. The second part exposes the topic of primary keys and their role in Cassandra. We can see there that primary keys aren't only used to identify a row but also to dispatch it to right space. The last part lists and details types supported in tables.

If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!