Tables in Apache Cassandra

Because tables in Apache Cassandra are very similar to the tables of relational databases, this article describing them won't focus on basic points. Instead, we'll explore more Cassandra specific subjects, such as configuration or different types.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in 👉 Early Release on the O'Reilly platform

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

The first part of the article describes some options which can be assigned to tables in Apache Cassandra. The second part list some particularities of tables. The last part shows how to implement that through Java API.

Table configuration in Cassandra

Sometimes (very often?) CREATE TABLE is not a simple definition of columns to create. In additional it takes some configuration properties which should be defined after columns, through WITH clause:

CREATE TABLE my_table (...columns here) 
  WITH config_entry1 = value1 AND config_entry2 = value2

As you can see already, there are no single one option available. Instead, we can configure following properties (non exhaustive list):

caching - configures caching policy (only keys or rows, or both, with the number of stored items per partition) for rows held in given table.
compaction - defines compaction strategy to use. As already described in the article about compaction in Apache Cassandra, Cassandra supports size-tiered (the default one), date-tiered or level compaction strategy.
compression - allows to specify compression to use for the table. The compression should be chosen depending on table requirements: space saving should be preferred to read performance or inversely ?
default_time_to_live - specifies how long table should be kept in keyspace.
gc_grace_seconds - as described in the article about delete in Apache Cassandra, this value specifies the time in seconds of keeping tombstones before final data removal.
memtable_flush_period_in_ms - defines the time of the memtable flush

Table particularities

Some of table features are very specific to Cassandra. The first one are static column. By making a column static we associate it to all values having the same partition key. As usual, only non PRIMARY KEY columns can be static. An additional constraint is that a table without clustering columns can't have static columns. In this case, each partition has only 1 row and automatically each column is static. To declare static column, the operation consists on append "static" word after column type definition.

Another specific tool are functions. The first one concerns the time of values definition. Thanks to WRITETIME(column_name) we can discover when given value was defined. When we want to work with paged results, Cassandra provides a special function TOKEN(value). It makes possible to page when unordered partitioners (RandomPartitioner and Murmur3Partitioner) are used. In this case, the comparison is not made on stored values but on tokens generated by this function.

The 3rd special Cassandra feature are user-defined types. It allows to construct customized data types to be use further for table columns definitions. The definition of this data type can be done with CREATE TYPE query.

Example of table creation in Cassandra Java API

Test cases are quite short this time. They present the use and the particularity of static columns, table creation with options and writetime function:

@Test
public void should_correctly_create_table_from_java_api() {
  Statement createStatement = SchemaBuilder.createTable("customer")
    .addPartitionKey("id", DataType.text())
    .addClusteringColumn("login", DataType.text())
    .addClusteringColumn("age", DataType.smallint())
    .withOptions().clusteringOrder("login", SchemaBuilder.Direction.DESC)
    .comment("test comment").gcGraceSeconds(2);
  SESSION.execute(createStatement);

  KeyspaceMetadata ks = CLUSTER.getMetadata().getKeyspace("tableTest");
  TableMetadata table = ks.getTable("customer");

  assertThat(table).isNotNull();
  assertThat(table.getPartitionKey()).hasSize(1);
  assertThat(table.getPartitionKey().get(0).getName()).isEqualTo("id");
  assertThat(table.getClusteringColumns()).hasSize(2);
  assertThat(table.getClusteringColumns()).extracting("name").containsOnly("login", "age");
  assertThat(table.getOptions().getComment()).isEqualTo("test comment");
  assertThat(table.getOptions().getGcGraceInSeconds()).isEqualTo(2);
}

@Test
public void should_correctly_work_with_static_column() {
  // To test if static column really work, first we create the rows
  // representing 1 players of a same team
  // After we INSERT new player under the same partition key but
  // with different value for static column with division
  // When we read the data after the change, modified column should
  // have new values for both rows (only 1 was explicitly edited)

  // However, the case doesn't work with update for this flow:
  // 1/ Create 2 players for the same partition key
  // 2/ Update static column for one of them
  // Expected error should be in this case:
  // Invalid restrictions on clustering columns since the UPDATE statement modifies only static columns
  String team = "RC Lens";
  int foundationYear = 1906;
  String country = "FR";
  int division = 2;
  SESSION.execute("INSERT INTO static_team_player (teamName, player, foundationYear, country, division) " +
    " VALUES (?, ?, ?, ?, ?)", team, "Player_1", foundationYear, country, division);


  ResultSet result = SESSION.execute("SELECT * FROM static_team_player");
  List<Row> rows = result.all();
  assertThat(rows).hasSize(1);
  assertThat(rows.get(0).getString("teamName")).isEqualTo(team);
  assertThat(rows.get(0).getString("player")).isEqualTo("Player_1");
  assertThat(rows.get(0).getInt("foundationYear")).isEqualTo(foundationYear);
  assertThat(rows.get(0).getInt("division")).isEqualTo(2);

  // Now, add row with Player_2
  SESSION.execute("INSERT INTO static_team_player (teamName, player, foundationYear, country, division) " +
    " VALUES (?, ?, ?, ?, ?)", team, "Player_2", foundationYear, country, 1);

  result = SESSION.execute("SELECT * FROM static_team_player");
  rows = result.all();
  assertThat(rows).hasSize(2);
  assertThat(rows.stream().map(r -> r.getString("teamName"))).containsOnly(team);
  assertThat(rows.stream().map(r -> r.getString("player"))).containsOnly("Player_1", "Player_2");
  assertThat(rows.stream().map(r -> r.getInt("foundationYear"))).containsOnly(foundationYear);
  assertThat(rows.stream().map(r -> r.getInt("division"))).containsOnly(1);
}

@Test
public void should_correctly_get_write_time_of_a_column() {
  long insertTime = System.currentTimeMillis();
  SESSION.execute("INSERT INTO static_team_player (teamName, player, foundationYear, country, division, bornYear) " +
    " VALUES ('Team1', 'Player1', 1999, 'BE', 1, 1980)");

  ResultSet writetimeResult = SESSION.execute("SELECT bornYear, WRITETIME(country) AS countryCreationTime FROM static_team_player " +
    " WHERE teamName = 'Team1' AND player = 'Player1'");

  assertThat(writetimeResult.one().getLong("countryCreationTime")).isEqualTo(insertTime*1000);
}

The article describes several features for Cassandra tables. The first part presents options available during table creation. They concern things like compaction, compression, tombstone keeping time. The second part concerns points related more to reading stuff, such as functions, customized data types or special kind of static columns. The last part presents described cases through JUnit tests.