Collections in Apache Cassandra

One of interesting data types used in Apache Cassandra are collections. In our model we can freely use maps, sets or lists.

A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free

👉 Register here

This article focuses on this type of data. The first part lists and explains all available types. The second part explains the idea of frozen collections. The last part shows collections implementation through some simple test cases.

Collection types

The first collection type is a list. When we want to insert a data of this type, we should define it between '[' and ']'. The second type, set, must be defined between '{' and '}'. The last type, map, is defined in the same way as set.

So, what are the differences between the types ? Intuitively, set doesn't accept any duplicates. In additional, it's a sorted set and it returns stored values in sorted order. In the other hand, list allows duplicates and it returns data in order of insertion, ie. items inserted first are returned before items added after them. Regarding maps, entries are returned ordered by sorted key.

Normally, data in collections is not freezed. We can add or remove items stored inside. It can be made with '+' and '-' operators. We can also delete specific items, either by key in UPDATE query, or by key in DELETE query. But for this second case, it applies only for lists and maps. Sets can be only deleted entirely in this way.

However, collections have some limitations. First, they can't exceed 64K or 2B (depending on used protocol version). If this value is exceeded, additional data will be lost and only acceptable (64K or 2B) will be stored. Second, for performance reasons collections should store relative small amounts of data. Otherwise, it can make queries slower because Cassandra will read an entier collection at once.

Frozen collections

But collections are not so simple as they appear to be. The first problems come with the use of user defined types or other collections inside root collections, such as: collection<team_history>. If we try to create a table containing this column, following error will be returned:

com.datastax.driver.core.exceptions.InvalidQueryException: Non-frozen collections are not allowed inside collections: list

To solve this issue, we must use frozen collections. What the difference with normal collections ? Frozen collections can't be manipulated in unitary way. If we want to update it, we must first get all collection items, make some changes on it, and finally override current collection with modified one. It's because frozen collections are flatten to single value field by Cassandra and they're considered as blobs. In the case of unitary update, similar exception should be thrown:

com.datastax.driver.core.exceptions.InvalidQueryException: Invalid operation (teams = teams + [(1900, 'Not played yet')]) for frozen collection column teams

Frozen collections are defined with frozen type. To solve the issue presented at the begin of this part, following query could help:

CREATE TABLE players_multi_team (
  currentTeam text,
  fromYear int,
  name text,
  teams frozen>,
  PRIMARY KEY (currentTeam, name)
)

Example of collections in Cassandra

In test case we represent one football player and the teams where he played during his carrer. These teams are represented within different collections: list, set, map and frozen list. Appropriated type is defined in table name. These tests look like:

@After
public void cleanTables() {
  SESSION.execute("TRUNCATE TABLE collectionsTest.players_teams_list");
  SESSION.execute("TRUNCATE TABLE collectionsTest.players_teams_set");
  SESSION.execute("TRUNCATE TABLE collectionsTest.players_teams_map");
  SESSION.execute("TRUNCATE TABLE collectionsTest.players_teams_frozen_list");
}

@Test
public void should_read_teams_in_the_order_of_data_adding() {
  insertList();

  ResultSet resultSet = SESSION.execute("SELECT * FROM collectionsTest.players_teams_list");

  List<String> teams = resultSet.one().getList("teams", String.class);
  assertThat(teams).hasSize(6);
  assertThat(teams).containsExactly("SC Bastia", "Cagliari", "Modena", "AS Monaco", "Olympiakos", "SC Bastia");
}

@Test
public void should_read_teams_in_sorted_order_from_set_column() {
  insertSet();

  ResultSet resultSet = SESSION.execute("SELECT * FROM collectionsTest.players_teams_set");

  Set<String> teams = resultSet.one().getSet("teams", String.class);
  // Set doesn't accept duplicated data, so we have only 5 teams
  // In additionnaly, they are sorted in ascending order
  assertThat(teams).hasSize(5);
  assertThat(teams).containsExactly("AS Monaco", "Cagliari",  "Modena", "Olympiakos", "SC Bastia");
}

@Test
public void should_read_map_in_order_of_data_definition() throws InterruptedException {
  insertMap();
  for (int i = 0; i < 10; i++) {
    ResultSet resultSet = SESSION.execute("SELECT * FROM collectionsTest.players_teams_map");

    Map<String, String> teams = resultSet.one().getMap("teams", String.class, String.class);
    assertThat(teams).hasSize(7);
    assertThat(teams.values()).containsExactly("SC Bastia reserve", "SC Bastia", "Cagliari", "Modena", "AS Monaco",
      "Olympiakos", "SC Bastia");
    assertThat(teams.keySet()).containsExactly("1997/1998", "1998/1999", "1999/2004", "2004/2004", "2004/2010",
      "2010/2013", "2013/?");
    Thread.sleep(1000);
  }
}

@Test
public void should_correctly_change_list_elements() {
  insertList();
  // Remove all Italian clubs and add 2 unknown clubs instead
  SESSION.execute("UPDATE collectionsTest.players_teams_list SET " +
    " teams  =  teams - ['Cagliari', 'Modena']  WHERE name = 'François Modesto'");
  SESSION.execute("UPDATE collectionsTest.players_teams_list SET " +
    " teams  = teams + ['Unknown_1', 'Unknown_2'] WHERE name = 'François Modesto'");

  ResultSet resultSet = SESSION.execute("SELECT * FROM collectionsTest.players_teams_list");

  List<String> teams = resultSet.one().getList("teams", String.class);
  assertThat(teams).hasSize(6);
  assertThat(teams).containsExactly("SC Bastia", "AS Monaco", "Olympiakos", "SC Bastia", "Unknown_1", "Unknown_2");
}

@Test
public void should_correctly_change_set_elements() {
  insertSet();
  // Remove all Italian clubs and add 2 unknown clubs instead
  SESSION.execute("UPDATE collectionsTest.players_teams_set SET " +
    " teams  =  teams - {'Cagliari', 'Modena'}  WHERE name = 'François Modesto'");
  SESSION.execute("UPDATE collectionsTest.players_teams_set SET " +
    " teams  = teams + {'Unknown_1', 'Unknown_2'} WHERE name = 'François Modesto'");

  ResultSet resultSet = SESSION.execute("SELECT * FROM collectionsTest.players_teams_set");

  Set<String> teams = resultSet.one().getSet("teams", String.class);
  assertThat(teams).hasSize(5);
  assertThat(teams).containsExactly("AS Monaco", "Olympiakos", "SC Bastia", "Unknown_1", "Unknown_2");
}

@Test
public void should_correctly_change_map_elements() throws InterruptedException {
  insertMap();
  // replace Italian teams by unknown ones
  SESSION.execute("UPDATE collectionsTest.players_teams_map SET " +
    " teams['1999/2004'] = 'Unknown_1'," +
    " teams['2004/2004'] = 'Unknown_2'  WHERE name = 'François Modesto'");


  for (int i = 0; i < 10; i++) {
    ResultSet resultSet = SESSION.execute("SELECT * FROM collectionsTest.players_teams_map");

    Map<String, String> teams = resultSet.one().getMap("teams", String.class, String.class);
    assertThat(teams).hasSize(7);
    assertThat(teams.values()).containsExactly("SC Bastia reserve", "SC Bastia", "Unknown_1", "Unknown_2", "AS Monaco",
      "Olympiakos", "SC Bastia");
    assertThat(teams.keySet()).containsExactly("1997/1998", "1998/1999", "1999/2004", "2004/2004", "2004/2010",
      "2010/2013", "2013/?");
    Thread.sleep(1000);
  }
}

@Test
public void should_correctly_delete_3_first_items_in_list() {
  insertList();

  SESSION.execute("DELETE teams[0], teams[1], teams[2]  FROM collectionsTest.players_teams_list WHERE name = 'François Modesto'");

  ResultSet resultSet = SESSION.execute("SELECT * FROM collectionsTest.players_teams_list");

  List<String> teams = resultSet.one().getList("teams", String.class);
  assertThat(teams).hasSize(3);
  assertThat(teams).containsExactly("AS Monaco", "Olympiakos", "SC Bastia");
}

@Test(expected = SyntaxError.class)
public void should_correctly_delete_3_first_items_in_set() {
  insertSet();

  // In sets, the remove can be done by indexes
  // Instead we should use '-' operator, already implemented
  // in previous tests
  // Other intuitive option, for example "DELETE teams{'Modena'} FROM..." won't work too.
  SESSION.execute("DELETE teams[0], teams[1], teams[2]  FROM collectionsTest.players_teams_set WHERE name = 'François Modesto'");
}

@Test
public void should_correctly_delete_2_Italian_periods_of_player_from_map() throws InterruptedException {
  insertMap();

  // Map is not like a set and data can be removed from collection directly
  // by key name
  SESSION.execute("DELETE teams['1999/2004'], teams['2004/2004'] FROM collectionsTest.players_teams_map " +
    "WHERE name = 'François Modesto'");

  for (int i = 0; i < 10; i++) {
    ResultSet resultSet = SESSION.execute("SELECT * FROM collectionsTest.players_teams_map");

    Map<String, String> teams = resultSet.one().getMap("teams", String.class, String.class);
    assertThat(teams).hasSize(5);
    assertThat(teams.values()).containsExactly("SC Bastia reserve", "SC Bastia","AS Monaco",
      "Olympiakos", "SC Bastia");
    assertThat(teams.keySet()).containsExactly("1997/1998", "1998/1999", "2004/2010",
      "2010/2013", "2013/?");
    Thread.sleep(1000);
  }
}

@Test(expected = InvalidQueryException.class)
public void should_not_allow_to_update_single_value_on_frozen_list() {
  SESSION.execute("INSERT INTO collectionsTest.players_teams_frozen_list (name, teams) " +
    " VALUES ('François Modesto', [(1998, 'SC Bastia'), (1999, 'Cagliari'), (2004, 'Modena'), " +
    "(2010, 'AS Monaco'), (2013, 'Olympiakos')]) ");

  SESSION.execute("UPDATE collectionsTest.players_teams_frozen_list SET " +
    " teams  = teams + [(1900, 'Not played yet')] WHERE name = 'François Modesto'");
}

private void insertList() {
  SESSION.execute("INSERT INTO collectionsTest.players_teams_list (name, teams) " +
    " VALUES ('François Modesto', ['SC Bastia', 'Cagliari', 'Modena', 'AS Monaco', 'Olympiakos', 'SC Bastia']) ");
}

private void insertSet() {
  SESSION.execute("INSERT INTO collectionsTest.players_teams_set (name, teams) " +
    " VALUES ('François Modesto', {'SC Bastia', 'Cagliari', 'Modena', 'AS Monaco', 'Olympiakos', 'SC Bastia'}) ");
}

private void insertMap() {
  SESSION.execute("INSERT INTO collectionsTest.players_teams_map (name, teams) " +
    " VALUES ('François Modesto', {'1998/1999': 'SC Bastia', '1999/2004': 'Cagliari', '2004/2004': 'Modena', " +
    "'2004/2010': 'AS Monaco', '2010/2013': 'Olympiakos', '2013/?': 'SC Bastia', '1997/1998': 'SC Bastia reserve'}) ");
}

This article explains different collection types implemented in Apache Cassandra. The first part describes all available types and the differences between them. The second part focuses on special, flat type of collections, frozen ones. The last part tests some of listed collections features.


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!