https://gist.github.com/bartosz25/d0f9c4fb852a63e75f446a51dccb65e3
Big Data enforces denormalized storage. Joins are costly and it's often much more efficient to store all related information in a single row. Such rows with a lot of columns are called wide rows and they'll be explained in the sections below.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in 👉
Early Release on the O'Reilly platform
I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩
The post focuses on the column-oriented storage called wide rows. The first section explains this concept. The second one gives some positive and negative points about it. The last compares different wide rows (tables with 80, 40 and 20 columns) in Apache Cassandra.
Wide rows defined
Data stored in column-family stores is not homogeneous. Sometimes it's pretty obvious to write it in RDBMS-like style, one event per row. But there are the situations when it's useful to group events of given key in the same row and then, to store them in dozens or more columns. The row in such storage strategy is called a wide row.
When we could use wide rows? The first use case coming in mind is the one of 1:n relationship. A wide row can be used as its flattened version. For instance, if we consider football players and the number of goals scored in each of 38 matches, we could store it in row-oriented fashion like this:
Key | Goals |
---|---|
Player 1#2018/2019_1 | 3 |
Player 1#2018/2019_2 | 0 |
... | |
Player 1#2018/2019_37 | 1 |
Player 1#2018/2019_38 | 2 |
But we could also store the data in column-oriented way:
Key | Match 1 | Match 2 | ... | Match 37 | Match 38 |
---|---|---|---|---|---|
Player 1#2018/2019 | 3 | 0 | ... | 1 | 2 |
Which version is better? It all depends on the query pattern. If we'll always retrieve all stats for given player, then hitting given row only once on disk can be better. On the other side if we'll retrieve the stats only for one match, then the first proposal is more adapted. Unfortunately, the query pattern rule is not always valid and we can learn that in the next section.
Benefits and drawbacks
As already told, one of the main benefits of wide rows is the query pattern. If we're sure to retrieve always all columns and that the number of columns is not enormous, putting them all into one common row makes sense. Thanks to this method the query hits only 1 partition, thus it reduces the I/O and may have a positive impact on the read performance. An alternative to storing particular entries in columns would be the storage in maps. However, in this case, all the times the engine will need to read the entire map and eventually at the end filter it out. Having these values in separate columns would reduce the amount of returned data. Of course, this point depends on the underlying storage.
Furthermore, often distributed column-oriented stores don't provide the traditional transactional character of all or nothing. With wide rows, we can accomplish it because the "transaction" will apply on row level and never a half of row columns will be persisted.
But wide rows also have the drawbacks. First of all, we must keep in mind that their performance will vary and for some data specificities, they can perform better than for another. Before using them we should always measure the gain or the loss. Secondly, wide rows can very quickly hit storage limitations. It can be especially harmful when the number of columns is unbounded. For instance, storing a second-basis measurement for an hourly IoT sensor activity can be a bad idea because too many events in a row may violate some of storages best practices.
Another drawback of wide rows is their impact on the cache. Let's imagine the case when only 2 from 100 big columns are frequently used by the clients. Since the cache is row-based, we'll store extra 98 columns that are accessed less frequently. By doing that we'll prevent other rows to be cached because of already taken storage. This point of not-all-used columns brings another drawback - sometimes 90% of values may be missing. In such case, we'll make client reading more difficult since it must deal with null values. This is especially true in dynamically created columns. For instance, if we decide to add a dynamic column with insertion date, on a year scale 364 or 365 other dates won't be specified. It's why mixing dynamic columns with wide rows is often not so good idea as it may appear at the beginning.
Wide rows test - writing
Writing queries to show wide rows doesn't have a lot of sense. Instead, I'll try to get an idea of how querying on wide rows performs on my standalone 2-nodes Apache Cassandra cluster. Please notice however that it's executed only locally and it can behave slightly different in the real clusters.
First, let's create an environment for Cassandra Cluster Manager tool and construct a cluster with 2 nodes:
virtualenv .venv source .venv/bin/activate ccm create test_wide_rows -v 3.11.3 -n 2 -s ccm start
A small test to check if both nodes are up:
(.venv) bartosz:~/programming/cassandra_cluster_manager$ ccm node1 status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 127.0.0.1 100.56 KiB 1 100.0% 6abeda6b-ad6c-4480-91a8-7cdb75fa524a rack1 UN 127.0.0.2 25.12 KiB 1 100.0% e2225988-bc26-458a-b77a-bfe7957681dd rack1 (.venv) bartosz:~/programming/cassandra_cluster_manager$ ccm node2 status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 127.0.0.1 29.99 KiB 1 66.7% 6abeda6b-ad6c-4480-91a8-7cdb75fa524a rack1 UN 127.0.0.2 185.36 KiB 1 66.7% e2225988-bc26-458a-b77a-bfe7957681dd rack1 UN 127.0.0.3 25.13 KiB 1 66.7% 33a18509-a423-46e9-a232-8545248e47d3 rack1 (.venv) bartosz:~/programming/cassandra_cluster_manager$
To show the performances of both schemas, I'm using JMH and the following common setup code for both reading and writing cases:
public class CassandraSession { private static final String TEST_KEYSPACE = "test_keyspace"+System.currentTimeMillis(); private final Session session; private final List<String> wideRowTable80Columns = new ArrayList<>(); private final List<String> wideRowTable40Columns = new ArrayList<>(); private final List<String> wideRowTable20Columns = new ArrayList<>(); public CassandraSession() { Cluster cluster = Cluster.builder().addContactPoint("127.0.0.1").build(); session = cluster.connect(); for (int i = 0; i < 80; i++) { String columnName = "column"+i; wideRowTable80Columns.add(columnName); if (i < 40) { wideRowTable40Columns.add(columnName); if (i < 20) { wideRowTable20Columns.add(columnName); } } } session.execute("CREATE KEYSPACE IF NOT EXISTS "+ TEST_KEYSPACE +" WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 2}"); session.execute("USE "+ TEST_KEYSPACE); } public void execute(String query) { session.execute(query); } public void createTables() { session.execute(getWideRowTableCreationQuery(wideRowTable80Columns)); session.execute(getWideRowTableCreationQuery(wideRowTable40Columns)); session.execute(getWideRowTableCreationQuery(wideRowTable20Columns)); String skinnyRowTableQuery = "CREATE TABLE skinny_row_test ( col text, value text, PRIMARY KEY (col)) "; session.execute(skinnyRowTableQuery); } private String getWideRowTableCreationQuery(List<String> columns) { int columnsCount = columns.size(); String wideRowTableQuery = "CREATE TABLE wide_rows_test_"+columnsCount+" (id INT, "; String wideRowTableCols = columns.stream().map(column -> column+" text ") .collect(Collectors.joining(",")); wideRowTableQuery += wideRowTableCols; wideRowTableQuery += " , PRIMARY KEY(id) )"; return wideRowTableQuery; } public void closeSession() { session.close(); session.getCluster().close(); } }
The tests focus on writing and reading case. The former ones checks saving of 100 rows for each of tested tables. Moreover, the test contains a case verifying the performance of insert into a skinny table (2 column/row). The tested code is available on gist. The results show pretty clearly that the difference between the tables with 20 and 40 columns is very small for throughput, average execution time and single shot invocation time:
Benchmark Mode Cnt Score Error Units WriteMicroBenchmark.write_wide_rows_to_20_cols_table thrpt 3 0.017 ± 0.045 ops/ms WriteMicroBenchmark.write_wide_rows_to_40_cols_table thrpt 3 0.014 ± 0.043 ops/ms WriteMicroBenchmark.write_wide_rows_to_20_cols_table avgt 3 59.848 ± 181.634 ms/op WriteMicroBenchmark.write_wide_rows_to_40_cols_table avgt 3 74.905 ± 244.420 ms/op WriteMicroBenchmark.write_wide_rows_to_20_cols_table ss 3 163.301 ± 295.528 ms/op WriteMicroBenchmark.write_wide_rows_to_40_cols_table ss 3 187.843 ± 244.468 ms/op
The test writing data to 80 columns performs slightly worse for above measures:
Benchmark Mode Cnt Score Error Units WriteMicroBenchmark.write_wide_rows_to_80_cols_table thrpt 3 0.010 ± 0.020 ops/ms WriteMicroBenchmark.write_wide_rows_to_80_cols_table avgt 3 100.400 ± 255.359 ms/op WriteMicroBenchmark.write_wide_rows_to_80_cols_table ss 3 248.805 ± 889.762 ms/op
However, the difference is less visible for the case of sample execution time:
WriteMicroBenchmark.write_wide_rows_to_40_cols_table:write_wide_rows_to_40_cols_table·p1.00 sample 233.570 ms/op WriteMicroBenchmark.write_wide_rows_to_80_cols_table:write_wide_rows_to_80_cols_table·p1.00 sample 246.415 ms/op
Regarding to skinny rows, they perform much worse than all wide rows scenarios:
Benchmark Mode Cnt Score Error Units WriteMicroBenchmark.write_skinny_rows thrpt 3 ≈ 10⁻⁴ ops/ms WriteMicroBenchmark.write_skinny_rows avgt 3 3549.173 ± 8300.291 ms/op WriteMicroBenchmark.write_skinny_rows:write_skinny_rows·p1.00 sample 5729.419 ms/op WriteMicroBenchmark.write_skinny_rows ss 3 4048.182 ± 24205.465 ms/op
From that we can easily deduce that the performance degrades with increasing number of columns. We can also notice that saving the same amount of data in columns performs much better than in rows (skinny rows table). And the difference is of 35 order of magnitude with rows of 80 columns.
Wide rows tests - reading
Reading tests verify the performances of reading all rows from each table (20-, 40-, 80-columns + skinny) and reading of only 20 columns. Once again, the code is available on gist and it's why we focus here only on the results. The results for all of 3 wide row tables are pretty consistent. As for writing, here too the performance degrades with an increased number of columns - and so for each of test cases:
Benchmark Mode Cnt Score Error Units ReadMicroBenchmark.read_20_columns_from_all_wide_row_20_cols_table thrpt 3 0.015 ± 0.012 ops/ms ReadMicroBenchmark.read_20_columns_from_all_wide_row_40_cols_table thrpt 3 0.014 ± 0.011 ops/ms ReadMicroBenchmark.read_20_columns_from_all_wide_row_80_cols_table thrpt 3 0.013 ± 0.009 ops/ms ReadMicroBenchmark.read_20_columns_from_all_wide_row_20_cols_table avgt 3 66.470 ± 56.636 ms/op ReadMicroBenchmark.read_20_columns_from_all_wide_row_40_cols_table avgt 3 70.430 ± 69.421 ms/op ReadMicroBenchmark.read_20_columns_from_all_wide_row_80_cols_table avgt 3 76.030 ± 76.495 ms/op ReadMicroBenchmark.read_20_columns_from_all_wide_row_20_cols_table:read_20_columns_from_all_wide_row_20_cols_table·p1.00 sample 148.374 ms/op ReadMicroBenchmark.read_20_columns_from_all_wide_row_40_cols_table:read_20_columns_from_all_wide_row_40_cols_table·p1.00 sample 173.015 ms/op ReadMicroBenchmark.read_20_columns_from_all_wide_row_80_cols_table:read_20_columns_from_all_wide_row_80_cols_table·p1.00 sample 261.095 ms/op ReadMicroBenchmark.read_20_columns_from_all_wide_row_20_cols_table ss 3 140.156 ± 493.088 ms/op ReadMicroBenchmark.read_20_columns_from_all_wide_row_40_cols_table ss 3 127.799 ± 347.604 ms/op ReadMicroBenchmark.read_20_columns_from_all_wide_row_80_cols_table ss 3 160.245 ± 566.044 ms/op ReadMicroBenchmark.read_all_wide_rows_20_cols thrpt 3 0.017 ± 0.015 ops/ms ReadMicroBenchmark.read_all_wide_rows_40_cols thrpt 3 0.015 ± 0.016 ops/ms ReadMicroBenchmark.read_all_wide_rows_80_cols thrpt 3 0.013 ± 0.013 ops/ms ReadMicroBenchmark.read_all_wide_rows_20_cols avgt 3 57.757 ± 38.058 ms/op ReadMicroBenchmark.read_all_wide_rows_40_cols avgt 3 66.321 ± 83.872 ms/op ReadMicroBenchmark.read_all_wide_rows_80_cols avgt 3 81.196 ± 115.854 ms/op ReadMicroBenchmark.read_all_wide_rows_20_cols:read_all_wide_rows_20_cols·p1.00 sample 176.161 ms/op ReadMicroBenchmark.read_all_wide_rows_40_cols:read_all_wide_rows_40_cols·p1.00 sample 164.889 ms/op ReadMicroBenchmark.read_all_wide_rows_80_cols:read_all_wide_rows_80_cols·p1.00 sample 279.446 ms/op ReadMicroBenchmark.read_all_wide_rows_20_cols ss 3 124.794 ± 189.612 ms/op ReadMicroBenchmark.read_all_wide_rows_40_cols ss 3 145.048 ± 416.505 ms/op ReadMicroBenchmark.read_all_wide_rows_80_cols ss 3 156.285 ± 951.021 ms/op
Regarding to skinny rows table, as for writing tests, it performs much worse than wide rows table:
Benchmark Mode Cnt Score Error Units ReadMicroBenchmark.read_all_skinny_rows thrpt 3 ≈ 10⁻⁴ ops/ms ReadMicroBenchmark.read_20_rows_from_all_skinny_rows thrpt 3 0.001 ± 0.001 ops/ms ReadMicroBenchmark.read_20_rows_from_all_skinny_rows avgt 3 1120.074 ± 1076.035 ms/op ReadMicroBenchmark.read_all_skinny_rows avgt 3 4481.257 ± 4532.688 ms/op
It shows one again than retrieving needed information from columns is much better than from rows. And the network doesn't play an important role here since the amount of data to transfer is the same.
This post talked about wide rows. Its first section defined this storage concept. As we learned there, wide rows prefer to store all logically related information in many different columns of the same row rather than on different smaller rows. It has its advantages, especially when the query pattern involves the reading of all columns every time. But on the other side, the decision to store the data in such a way should be dictated by the analysis of real need. It doesn't come without the cost (data store limits risk, performance troubles, cache issues) and this despite the positive results of micro-benchmark from the last section.