Keeping old data eternally takes place and makes reads longer. Apache Cassandra is not an exception and has a mechanism to remove data.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
In this article we explore delete part of Apache Cassandra. The first part describes the general idea of it by presenting CQL queries adapted to each case. The second part explains the concept of tombstones. The third part shows how to use different delete cases in Cassandra Java API.
Delete in Cassandra
First of all, let's try to explain what and how can be removed from Cassandra tables. The first thing is a row. As in the relational database, we can remove a whole row by specifying its primary key. Delete by other column than primary key is not allowed. The syntax for this case is very similar to the SQL syntax:
DELETE FROM my_table WHERE id = 3
We can also remove only concrete columns of a row. But these columns can't be one from defined primary keys. Column delete is also more strict than the row one. If our table has partitioning as well as clustering keys, we must to specify all of them. Otherwise, an exception telling that "Range deletions are not supported for specific columns" is thrown. The query responsible for remove columns from one row looks like:
DELETE column1, column2 FROM my_table WHERE id = 3
There are also special case of delete, based on rows lifetime. Each time when new row is added, Cassandra appends the creation time in milliseconds to it. By default, this time is defined by the server and it's the time of received query.But it can also be defined by the client thanks to clause USING TIMESTAMP. This clause can be used when adding or updating new data, but also when trying to delete one. For the last case, the syntax looks like:
DELETE FROM simple_team USING TIMESTAMP 10000 WHERE id = 3
The query deletes all rows older than specified timestamp.
Note also that we can remove collections (maps or lists).
Tombstones
Described queries won't delete row or column immediately. Instead of that, delete produces a structure called tombstone. It contains information about deleted data (column name, row) and is used to keep deleted data hidden for the database. Once some period of time elapsed, rows or columns associated with existing tombstones are definitively deleted from the database. It occurs during compaction phase. Nodes holding replicas behave in the same manner (they rely on tombstones too).
However, it's one touchy situation in which delete can't occur, even if deleted item was marked in coordinator node. Generally, tombstones expire after the time configured in gc_grace_seconds property. So if one replica goes down and wakes up after the expiration time of tombstone, it won't be able to apply the change. In this case, it's advised in Cassandra documentation, to "remove the node, wipe it, and bootstrap it again".
Delete with Cassandra Java API
After the introduction, let's take a look at delete in action:
@Test(expected = SyntaxError.class) public void should_not_delete_all_data_at_once() { insertTeam("X", 1990, "FR", 1); ResultSet resultSet = SESSION.execute("SELECT * FROM simple_team"); assertThat(resultSet.all()).isNotEmpty(); // Deleting all rows is not possible without WHERE clause SESSION.execute("DELETE FROM simple_team"); } @Test(expected = InvalidQueryException.class) public void should_not_allow_the_delete_on_where_clause_which_is_not_a_key() { insertTeam("X", 1990, "FR", 1); insertTeam("Y", 1990, "DE", 1); insertTeam("Z", 1990, "PL", 1); ResultSet resultSet = SESSION.execute("SELECT * FROM simple_team"); assertThat(resultSet.all()).isNotEmpty(); // Deleting all rows is not possible without WHERE clause applied on // partition key SESSION.execute("DELETE FROM simple_team WHERE division = 1"); } @Test public void should_correctly_remove_row_by_partition_key() { insertTeam("X", 1990, "FR", 1); insertTeam("Y", 1990, "DE", 1); insertTeam("Z", 1990, "PL", 1); ResultSet resultSet = SESSION.execute("SELECT * FROM simple_team"); assertThat(resultSet.all()).isNotEmpty(); // Deleting all rows is not possible without WHERE clause applied on // partition key SESSION.execute("DELETE FROM simple_team WHERE teamName IN ?", Lists.newArrayList("X", "Y", "Z")); resultSet = SESSION.execute("SELECT * FROM simple_team"); assertThat(resultSet.all()).isEmpty(); } @Test public void should_remove_only_column() { insertTeam("A", 1990, "FR", 1); ResultSet resultSet = SESSION.execute("SELECT * FROM simple_team"); assertThat(resultSet.all()).isNotEmpty(); // Once again, only columns can be removed, no keys // Without specifying all keys in WHERE clause, it's not possible to remove // a column. This exception is returned by Cassandra when all of 3 keys are not // used: // com.datastax.driver.core.exceptions.InvalidQueryException: // Range deletions are not supported for specific columns // It's not a problem when we try to delete whole row (we can use only partition key). SESSION.execute("DELETE division FROM simple_team WHERE teamName = ? AND foundationYear = ? AND country = ?", "A", 1990, "FR"); resultSet = SESSION.execute("SELECT * FROM simple_team WHERE teamName = ?", "A"); Row row = resultSet.one(); assertThat(row.get("division", Integer.class)).isNull(); assertThat(row.getString("teamName")).isEqualTo("A"); } @Test public void should_delete_obsolete_data_using_timestamp_clause() throws InterruptedException { insertTeam("A", 1990, "FR", 1); insertTeam("A", 1991, "FR", 1); insertTeam("A", 1992, "FR", 1); long timestamp = System.currentTimeMillis(); Thread.sleep(2000); insertTeam("A", 1993, "FR", 1); insertTeam("A", 1994, "FR", 1); ResultSet resultSet = SESSION.execute("SELECT * FROM simple_team"); assertThat(resultSet.all()).hasSize(5); SESSION.execute("DELETE FROM simple_team USING TIMESTAMP ? WHERE teamName = ?", timestamp, "A"); resultSet = SESSION.execute("SELECT * FROM simple_team WHERE teamName = ?", "A"); List<Row> allRows = resultSet.all(); assertThat(allRows).hasSize(2); assertThat(allRows.stream().map(r -> r.getInt("foundationYear"))).containsOnly(1993, 1994); } private void insertTeam(String name, int foundationYear, String country, int division) { Statement insert = QueryBuilder.insertInto("deleteTest", "simple_team") .value("teamName", name) .value("foundationYear", foundationYear) .value("country", country) .value("division", division) // Helps to make correct test on DELETE USING TIMESTAMP // Otherwise server-side timestamp is used and it not easy to predict // which value will be set. .setDefaultTimestamp(System.currentTimeMillis()) .enableTracing(); SESSION.execute(insert); }
The article shows how to work with Cassandra deletes. In the first part it introduces different kinds of removal, such as row, column or timestamp-based ones. The second part describes shortly how the delete is handled by Cassandra through tombstones. The last part illustrates, through test cases, how to make deletes with Java API.