Data part in Apache Cassandra on waitingforcode.com

The previous article introduced us to Apache Cassandra by presenting vaguely its main concepts. This article focuses more in details on data topics.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests

Medallion architecture & Lakeflow SDP

Max 10 participants · production-ready templates

See the full curriculum → €7,000 flat fee · cohort of up to 10

Bartosz
Konieczny

The first part describes general mechanism of data storage. The second part demystifies different types of primary keys. The last part lists and details data types supported in Cassandra tables.

Data storage

Data in Cassandra is stored in logical units called keyspace. It's similar to database in relational databases world. So keyspace is a container keeping all tables, indexes, but also defining the replication level. There are 2 completely different replication strategies: SimpleStrategy and NetworkTopologyStrategy. The first one replicates data only inside 1 data center. The second one makes the replication in different data centers. It can take as argument the number of replicas stored in each data center. And NetworkTopologyStrategy is safer because it protects against the unavailability when all nodes of one data center stop to work at the same time.

As told in the introduction to Apache Cassandra, data is first written to commit logs, after to memory structure called memtable, and finally, when memtable reaches configures threshold, to files called SSTable. On writing data is also distributed according to replication factory detailed before.

Data is grouped in tables. Each table has rows which are described by columns. Since Cassandra doesn't support joins and relations, tables can have thousands of columns (or even more!). And it shouldn't decrease the performances.

Another topic related to tables is compaction. It's triggered when Cassandra sees that there are a lot of similar SSTables. The number of similar tables is configurable through min_threshold entry of given table. The first compaction strategy called SizeTieredCompactionStrategy. Another one concerns time series data and groups together data written within similar period of time. It's also based on min_threshold parameter and the strategy has the name of DateTieredCompactionStrategy. The last strategy is LeveledCompactionStrategy which creates relatively small SSTables with levels. Each level is 10 times bigger than the previous. When one level is full, it doesn't accept new data which is saved and compacted with data held by the next level SSTables.

Primary keys

Primary keys in Cassandra are more complicated topic than primary keys in relational databases. The simplest primary key is composed by only one column. Let's call it simple primary key. At the same time, this type of key is also responsible for choosing the node where given row will be stored. Commonly used name to describe this is partition key. One of Cassandra goals consists to distribute data equally. It's the reason why the choice of good partition key is important.

Another type of primary key is compound primary key (called also composite primary key). As in relation world, this kind of key is composed by two or more columns. The first column plays the role of partition key while the others are called clustering key. Its role is to sort data in given partition.

Primary key, as partition key, can be also composed by two or more columns. In this case, it should be declared inside parenthesis, for example: PRIMARY KEY (col1, col2), col1_cluster_key_, col2_cluster_key.

Data types

Apache Cassandra supports almost the same data types as relational databases. But some of them don't exist in relational world or have different goals. In consequence, we can list some differences:

blobs - in theory, the maximum supported size is 2GB, but in practice should store data smaller than 1MB.
collections - can store items of one specific type, such text or number, but can't store nested collections. However, collections shouldn't be used to store large amount of data. A reasonable size is 64KB.
counter - at the name indicates, it's auto-increment column without the possibility to set its value explicitly. Instead, it supports increment and decrement operations. The column of this type shouldn't be used as primary or partition key.
UUID and timeuuid - standardized unique identifiers
tuple - is a field able to group up to 32768 other fields. It's declared by specifying the type of stored fields.
User-Defined Type (UDT) - can be used to attach multiple data fields into a column. Imagine that you have a contact information about user, such as e-mail and phone number. You can group them into a UDT called 'contact' and store as it in the table.

Aside specific types previously listed, Cassandra has also more common data types which we can categorize in given list:

string - ascii, inet (IP address), text (UTF-8 encoded), varchar (another name for UTF-8 encoded text)
numeric - bigint, decimal, double, float, int, varint
date - timestamp
boolean - boolean

This article starts to go into Cassandra details. The first part describes data organization and compaction operation. We can see that data is organized around keyspaces containing tables. We can also learn some points about compaction and its different strategies. The second part exposes the topic of primary keys and their role in Cassandra. We can see there that primary keys aren't only used to identify a row but also to dispatch it to right space. The last part lists and details types supported in tables.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com 📩

Data part in Apache Cassandra