Wide rows in column-oriented stores

https://gist.github.com/bartosz25/d0f9c4fb852a63e75f446a51dccb65e3

Big Data enforces denormalized storage. Joins are costly and it's often much more efficient to store all related information in a single row. Such rows with a lot of columns are called wide rows and they'll be explained in the sections below.

The post focuses on the column-oriented storage called wide rows. The first section explains this concept. The second one gives some positive and negative points about it. The last compares different wide rows (tables with 80, 40 and 20 columns) in Apache Cassandra.

Wide rows defined

Data stored in column-family stores is not homogeneous. Sometimes it's pretty obvious to write it in RDBMS-like style, one event per row. But there are the situations when it's useful to group events of given key in the same row and then, to store them in dozens or more columns. The row in such storage strategy is called a wide row.

When we could use wide rows? The first use case coming in mind is the one of 1:n relationship. A wide row can be used as its flattened version. For instance, if we consider football players and the number of goals scored in each of 38 matches, we could store it in row-oriented fashion like this:

KeyGoals
Player 1#2018/2019_13
Player 1#2018/2019_20
...
Player 1#2018/2019_371
Player 1#2018/2019_382

But we could also store the data in column-oriented way:

KeyMatch 1Match 2...Match 37Match 38
Player 1#2018/201930...12

Which version is better? It all depends on the query pattern. If we'll always retrieve all stats for given player, then hitting given row only once on disk can be better. On the other side if we'll retrieve the stats only for one match, then the first proposal is more adapted. Unfortunately, the query pattern rule is not always valid and we can learn that in the next section.

Benefits and drawbacks

As already told, one of the main benefits of wide rows is the query pattern. If we're sure to retrieve always all columns and that the number of columns is not enormous, putting them all into one common row makes sense. Thanks to this method the query hits only 1 partition, thus it reduces the I/O and may have a positive impact on the read performance. An alternative to storing particular entries in columns would be the storage in maps. However, in this case, all the times the engine will need to read the entire map and eventually at the end filter it out. Having these values in separate columns would reduce the amount of returned data. Of course, this point depends on the underlying storage.

Furthermore, often distributed column-oriented stores don't provide the traditional transactional character of all or nothing. With wide rows, we can accomplish it because the "transaction" will apply on row level and never a half of row columns will be persisted.

But wide rows also have the drawbacks. First of all, we must keep in mind that their performance will vary and for some data specificities, they can perform better than for another. Before using them we should always measure the gain or the loss. Secondly, wide rows can very quickly hit storage limitations. It can be especially harmful when the number of columns is unbounded. For instance, storing a second-basis measurement for an hourly IoT sensor activity can be a bad idea because too many events in a row may violate some of storages best practices.

Another drawback of wide rows is their impact on the cache. Let's imagine the case when only 2 from 100 big columns are frequently used by the clients. Since the cache is row-based, we'll store extra 98 columns that are accessed less frequently. By doing that we'll prevent other rows to be cached because of already taken storage. This point of not-all-used columns brings another drawback - sometimes 90% of values may be missing. In such case, we'll make client reading more difficult since it must deal with null values. This is especially true in dynamically created columns. For instance, if we decide to add a dynamic column with insertion date, on a year scale 364 or 365 other dates won't be specified. It's why mixing dynamic columns with wide rows is often not so good idea as it may appear at the beginning.

Wide rows test - writing

Writing queries to show wide rows doesn't have a lot of sense. Instead, I'll try to get an idea of how querying on wide rows performs on my standalone 2-nodes Apache Cassandra cluster. Please notice however that it's executed only locally and it can behave slightly different in the real clusters.

First, let's create an environment for Cassandra Cluster Manager tool and construct a cluster with 2 nodes:

virtualenv .venv
source .venv/bin/activate
ccm create test_wide_rows -v 3.11.3 -n 2 -s
ccm start

A small test to check if both nodes are up:

(.venv) bartosz:~/programming/cassandra_cluster_manager$ ccm node1 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  100.56 KiB  1            100.0%            6abeda6b-ad6c-4480-91a8-7cdb75fa524a  rack1
UN  127.0.0.2  25.12 KiB  1            100.0%            e2225988-bc26-458a-b77a-bfe7957681dd  rack1


(.venv) bartosz:~/programming/cassandra_cluster_manager$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  29.99 KiB  1            66.7%             6abeda6b-ad6c-4480-91a8-7cdb75fa524a  rack1
UN  127.0.0.2  185.36 KiB  1            66.7%             e2225988-bc26-458a-b77a-bfe7957681dd  rack1
UN  127.0.0.3  25.13 KiB  1            66.7%             33a18509-a423-46e9-a232-8545248e47d3  rack1


(.venv) bartosz:~/programming/cassandra_cluster_manager$

To show the performances of both schemas, I'm using JMH and the following common setup code for both reading and writing cases:

public class CassandraSession {

  private static final String TEST_KEYSPACE = "test_keyspace"+System.currentTimeMillis();

  private final Session session;

  private final List<String> wideRowTable80Columns = new ArrayList<>();

  private final List<String> wideRowTable40Columns = new ArrayList<>();

  private final List<String> wideRowTable20Columns = new ArrayList<>();

  public CassandraSession() {
    Cluster cluster = Cluster.builder().addContactPoint("127.0.0.1").build();
    session = cluster.connect();
    for (int i = 0; i < 80; i++) {
      String columnName = "column"+i;
      wideRowTable80Columns.add(columnName);
      if (i < 40) {
        wideRowTable40Columns.add(columnName);
        if (i < 20) {
            wideRowTable20Columns.add(columnName);
        }
      }
    }
    session.execute("CREATE KEYSPACE IF NOT EXISTS "+ TEST_KEYSPACE +" WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 2}");
    session.execute("USE "+ TEST_KEYSPACE);
  }

  public void execute(String query) {
    session.execute(query);
  }

  public void createTables() {
    session.execute(getWideRowTableCreationQuery(wideRowTable80Columns));
    session.execute(getWideRowTableCreationQuery(wideRowTable40Columns));
    session.execute(getWideRowTableCreationQuery(wideRowTable20Columns));
    String skinnyRowTableQuery = "CREATE TABLE skinny_row_test ( col text, value text, PRIMARY KEY (col)) ";
    session.execute(skinnyRowTableQuery);
  }

  private String getWideRowTableCreationQuery(List<String> columns) {
    int columnsCount = columns.size();
    String wideRowTableQuery = "CREATE TABLE wide_rows_test_"+columnsCount+" (id INT, ";
    String wideRowTableCols = columns.stream().map(column -> column+" text ")
            .collect(Collectors.joining(","));
    wideRowTableQuery += wideRowTableCols;
    wideRowTableQuery += " , PRIMARY KEY(id) )";
    return wideRowTableQuery;
  }

  public void closeSession() {
    session.close();
    session.getCluster().close();
  }
}

The tests focus on writing and reading case. The former ones checks saving of 100 rows for each of tested tables. Moreover, the test contains a case verifying the performance of insert into a skinny table (2 column/row). The tested code is available on gist. The results show pretty clearly that the difference between the tables with 20 and 40 columns is very small for throughput, average execution time and single shot invocation time:

Benchmark                                                                                        Mode  Cnt     Score       Error   Units
WriteMicroBenchmark.write_wide_rows_to_20_cols_table                                            thrpt    3     0.017 ±     0.045  ops/ms
WriteMicroBenchmark.write_wide_rows_to_40_cols_table                                            thrpt    3     0.014 ±     0.043  ops/ms
WriteMicroBenchmark.write_wide_rows_to_20_cols_table                                             avgt    3    59.848 ±   181.634   ms/op
WriteMicroBenchmark.write_wide_rows_to_40_cols_table                                             avgt    3    74.905 ±   244.420   ms/op
WriteMicroBenchmark.write_wide_rows_to_20_cols_table                                               ss    3   163.301 ±   295.528   ms/op
WriteMicroBenchmark.write_wide_rows_to_40_cols_table                                               ss    3   187.843 ±   244.468   ms/op

The test writing data to 80 columns performs slightly worse for above measures:

Benchmark                                                                                        Mode  Cnt     Score       Error   Units
WriteMicroBenchmark.write_wide_rows_to_80_cols_table                                            thrpt    3     0.010 ±     0.020  ops/ms
WriteMicroBenchmark.write_wide_rows_to_80_cols_table                                             avgt    3   100.400 ±   255.359   ms/op
WriteMicroBenchmark.write_wide_rows_to_80_cols_table                                               ss    3   248.805 ±   889.762   ms/op

However, the difference is less visible for the case of sample execution time:

WriteMicroBenchmark.write_wide_rows_to_40_cols_table:write_wide_rows_to_40_cols_table·p1.00    sample        233.570               ms/op
WriteMicroBenchmark.write_wide_rows_to_80_cols_table:write_wide_rows_to_80_cols_table·p1.00    sample        246.415               ms/op

Regarding to skinny rows, they perform much worse than all wide rows scenarios:

Benchmark                                                                                        Mode  Cnt     Score       Error   Units
WriteMicroBenchmark.write_skinny_rows                                                           thrpt    3    ≈ 10⁻⁴              ops/ms
WriteMicroBenchmark.write_skinny_rows                                                            avgt    3  3549.173 ±  8300.291   ms/op
WriteMicroBenchmark.write_skinny_rows:write_skinny_rows·p1.00                                  sample       5729.419               ms/op
WriteMicroBenchmark.write_skinny_rows                                                              ss    3  4048.182 ± 24205.465   ms/op

From that we can easily deduce that the performance degrades with increasing number of columns. We can also notice that saving the same amount of data in columns performs much better than in rows (skinny rows table). And the difference is of 35 order of magnitude with rows of 80 columns.

Wide rows tests - reading

Reading tests verify the performances of reading all rows from each table (20-, 40-, 80-columns + skinny) and reading of only 20 columns. Once again, the code is available on gist and it's why we focus here only on the results. The results for all of 3 wide row tables are pretty consistent. As for writing, here too the performance degrades with an increased number of columns - and so for each of test cases:

Benchmark                                                                                                                     Mode  Cnt     Score       Error   Units

ReadMicroBenchmark.read_20_columns_from_all_wide_row_20_cols_table                                                           thrpt    3     0.015 ±     0.012  ops/ms
ReadMicroBenchmark.read_20_columns_from_all_wide_row_40_cols_table                                                           thrpt    3     0.014 ±     0.011  ops/ms
ReadMicroBenchmark.read_20_columns_from_all_wide_row_80_cols_table                                                           thrpt    3     0.013 ±     0.009  ops/ms
ReadMicroBenchmark.read_20_columns_from_all_wide_row_20_cols_table                                                            avgt    3    66.470 ±    56.636   ms/op
ReadMicroBenchmark.read_20_columns_from_all_wide_row_40_cols_table                                                            avgt    3    70.430 ±    69.421   ms/op
ReadMicroBenchmark.read_20_columns_from_all_wide_row_80_cols_table                                                            avgt    3    76.030 ±    76.495   ms/op
ReadMicroBenchmark.read_20_columns_from_all_wide_row_20_cols_table:read_20_columns_from_all_wide_row_20_cols_table·p1.00    sample        148.374               ms/op
ReadMicroBenchmark.read_20_columns_from_all_wide_row_40_cols_table:read_20_columns_from_all_wide_row_40_cols_table·p1.00    sample        173.015               ms/op
ReadMicroBenchmark.read_20_columns_from_all_wide_row_80_cols_table:read_20_columns_from_all_wide_row_80_cols_table·p1.00    sample        261.095               ms/op
ReadMicroBenchmark.read_20_columns_from_all_wide_row_20_cols_table                                                              ss    3   140.156 ±   493.088   ms/op
ReadMicroBenchmark.read_20_columns_from_all_wide_row_40_cols_table                                                              ss    3   127.799 ±   347.604   ms/op
ReadMicroBenchmark.read_20_columns_from_all_wide_row_80_cols_table                                                              ss    3   160.245 ±   566.044   ms/op

ReadMicroBenchmark.read_all_wide_rows_20_cols                                                                                thrpt    3     0.017 ±     0.015  ops/ms
ReadMicroBenchmark.read_all_wide_rows_40_cols                                                                                thrpt    3     0.015 ±     0.016  ops/ms
ReadMicroBenchmark.read_all_wide_rows_80_cols                                                                                thrpt    3     0.013 ±     0.013  ops/ms
ReadMicroBenchmark.read_all_wide_rows_20_cols                                                                                 avgt    3    57.757 ±    38.058   ms/op
ReadMicroBenchmark.read_all_wide_rows_40_cols                                                                                 avgt    3    66.321 ±    83.872   ms/op
ReadMicroBenchmark.read_all_wide_rows_80_cols                                                                                 avgt    3    81.196 ±   115.854   ms/op
ReadMicroBenchmark.read_all_wide_rows_20_cols:read_all_wide_rows_20_cols·p1.00                                              sample        176.161               ms/op
ReadMicroBenchmark.read_all_wide_rows_40_cols:read_all_wide_rows_40_cols·p1.00                                              sample        164.889               ms/op
ReadMicroBenchmark.read_all_wide_rows_80_cols:read_all_wide_rows_80_cols·p1.00                                              sample        279.446               ms/op
ReadMicroBenchmark.read_all_wide_rows_20_cols                                                                                   ss    3   124.794 ±   189.612   ms/op
ReadMicroBenchmark.read_all_wide_rows_40_cols                                                                                   ss    3   145.048 ±   416.505   ms/op
ReadMicroBenchmark.read_all_wide_rows_80_cols                                                                                   ss    3   156.285 ±   951.021   ms/op

Regarding to skinny rows table, as for writing tests, it performs much worse than wide rows table:

Benchmark                                                                                                                     Mode  Cnt     Score       Error   Units
ReadMicroBenchmark.read_all_skinny_rows                                                                                      thrpt    3    ≈ 10⁻⁴              ops/ms
ReadMicroBenchmark.read_20_rows_from_all_skinny_rows                                                                         thrpt    3     0.001 ±     0.001  ops/ms
ReadMicroBenchmark.read_20_rows_from_all_skinny_rows                                                                          avgt    3  1120.074 ±  1076.035   ms/op
ReadMicroBenchmark.read_all_skinny_rows                                                                                       avgt    3  4481.257 ±  4532.688   ms/op

It shows one again than retrieving needed information from columns is much better than from rows. And the network doesn't play an important role here since the amount of data to transfer is the same.

This post talked about wide rows. Its first section defined this storage concept. As we learned there, wide rows prefer to store all logically related information in many different columns of the same row rather than on different smaller rows. It has its advantages, especially when the query pattern involves the reading of all columns every time. But on the other side, the decision to store the data in such a way should be dictated by the analysis of real need. It doesn't come without the cost (data store limits risk, performance troubles, cache issues) and this despite the positive results of micro-benchmark from the last section.