I have already described grouping sets feature in the context of Apache Spark. But natively they are a part of SQL standard and that's why I would like to extend the previous post here. After all, you don't need Big Data to use them - even though nowadays it's difficult to not to deal with it.
A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free
👉 Register here
The post is composed of 3 sections. The first one talks more generally about data representation with grouping sets. The second one gives some precise examples of their use in PotgreSQL. The final part focuses on some internal execution details .
Summarizing data with GROUPING
SQL's GROUP BY clause is a powerful operation helping to generate aggregates in the context of one or more columns. However, with simple GROUP BY it's hard to generate some data, like for instance the totals for grouped aggregates. One of the solutions to do that are UNIONs:
WITH players (name, team, points, city) AS ( SELECT 'player1' AS name, 'team1' AS team, 3 AS points, 'Paris' AS city UNION ALL SELECT 'player2' AS name, 'team3' AS team, 4 AS points, 'Berlin' AS city UNION ALL SELECT 'player3' AS name, 'team4' AS team, 1 AS points, 'Warsaw' AS city UNION ALL SELECT 'player4' AS name, 'team2' AS team, 2 AS points, 'London' AS city UNION ALL SELECT 'player5' AS name, 'team1' AS team, 2 AS points, 'Warsaw' AS city UNION ALL SELECT 'player6' AS name, 'team2' AS team, 1 AS points, 'Berlin' AS city UNION ALL SELECT 'player7' AS name, 'team2' AS team, 6 AS points, 'Berlin' AS city ) SELECT team, city, SUM(points) FROM players GROUP BY team, city UNION ALL SELECT NULL, NULL, SUM(points) FROM players; team | city | sum -------+--------+----- team2 | Berlin | 7 team3 | Berlin | 4 team2 | London | 2 team4 | Warsaw | 1 team1 | Warsaw | 2 team1 | Paris | 3 | | 19 (7 rows)
If you're a careful reader, you can immediately see the inefficiency of above statement. After all the engine could compute the totals directly from aggregates and consider them as a partial aggregation, similarly to Tree aggregations in Spark. To mitigate this issue SQL standard provides GROUPING SETS operator. Simply speaking, it's a set of attributes used in group by statement. The previous query could be rewritten with GROUPING SETS like that:
WITH players (name, team, points, city) AS ( SELECT 'player1' AS name, 'team1' AS team, 3 AS points, 'Paris' AS city UNION ALL SELECT 'player2' AS name, 'team3' AS team, 4 AS points, 'Berlin' AS city UNION ALL SELECT 'player3' AS name, 'team4' AS team, 1 AS points, 'Warsaw' AS city UNION ALL SELECT 'player4' AS name, 'team2' AS team, 2 AS points, 'London' AS city UNION ALL SELECT 'player5' AS name, 'team1' AS team, 2 AS points, 'Warsaw' AS city UNION ALL SELECT 'player6' AS name, 'team2' AS team, 1 AS points, 'Berlin' AS city UNION ALL SELECT 'player7' AS name, 'team2' AS team, 6 AS points, 'Berlin' AS city ) SELECT team, city, SUM(points) FROM players GROUP BY GROUPING SETS ((), (team, city)); team | city | sum -------+--------+----- | | 19 team2 | Berlin | 7 team3 | Berlin | 4 team2 | London | 2 team4 | Warsaw | 1 team1 | Warsaw | 2 team1 | Paris | 3 (7 rows)
Grouping sets operator comes with 2 shorter operators, CUBE and ROLLUP. CUBE computes all possible permutations of the grouping set whereas ROLLUP produces an aggregation over a hierarchy of groups, e.g. for a group (a, b, c) it'll create the aggregates for (a, b, c), (a, b) and (a).
Grouping sets and their shortcuts (ROLLUP, CUBE) are implemented in a lot of databases. You can find them in usual relational systems like PostgreSQL, MySQL, Oracle or SQL Server, and also in distributed data stores used mainly in the context of data warehousing like BigQuery. You can also find them on AWS where Athena service provides the support for ROLLUP and CUBE operations.
Queries examples
The examples from this section are executed against PostgreSQL 10 started with the following Docker command docker run --name some-postgres -e POSTGRES_PASSWORD=password -d postgress. To access the image's psql you can call docker run -it --rm --link some-postgres:postgres postgres psql -h postgres -U postgres
Since the first part of the post gave an example of a query using GROUPING SETS operator, we'll start this samples section by illustrating the use of GROUPING function, returning 1 if the column in aggregation represents all values or 0 if it doesn't. In other words, 1 will be returned when our grouping will take either all countries or all teams:
WITH players (name, team, country, points) AS ( SELECT 'player1' AS name, 'team1' AS team, 'fr' AS country, 3 AS points UNION ALL SELECT 'player2' AS name, 'team3' AS team, 'de' AS country, 4 AS points UNION ALL SELECT 'player3' AS name, 'team4' AS team, 'pl' AS country, 1 AS points UNION ALL SELECT 'player4' AS name, 'team2' AS team, 'uk' AS country, 2 AS points UNION ALL SELECT 'player5' AS name, 'team1' AS team, 'pl' AS country, 2 AS points UNION ALL SELECT 'player6' AS name, 'team2' AS team, 'de' AS country, 1 AS points UNION ALL SELECT 'player7' AS name, 'team2' AS team, 'de' AS country, 6 AS points ) SELECT GROUPING(country) AS aggregated_country, GROUPING(team) AS aggregated_team, country, team, SUM(points) FROM players GROUP BY GROUPING SETS ((country), (team)); aggregated_country | aggregated_team | country | team | sum --------------------+-----------------+---------+-------+----- 0 | 1 | de | | 11 0 | 1 | uk | | 2 0 | 1 | pl | | 3 0 | 1 | fr | | 3 1 | 0 | | team3 | 4 1 | 0 | | team1 | 5 1 | 0 | | team4 | 1 1 | 0 | | team2 | 9 (8 rows)
In addition, we can also concatenate grouping sets with multiple expressions. In such a case, the result is a cross product of individual items. In the example below the aggregates are generated from the combination of the first and the second grouping statements, i.e. level + team and level + (team, country):
WITH players (name, team, country, points, level) AS ( SELECT 'player1' AS name, 'team1' AS team, 'fr' AS country, 3 AS points, 1 AS level UNION ALL SELECT 'player2' AS name, 'team3' AS team, 'de' AS country, 4 AS points, 1 AS level UNION ALL SELECT 'player3' AS name, 'team4' AS team, 'pl' AS country, 1 AS points, 1 AS level UNION ALL SELECT 'player4' AS name, 'team2' AS team, 'uk' AS country, 2 AS points, 1 AS level UNION ALL SELECT 'player5' AS name, 'team1' AS team, 'pl' AS country, 2 AS points, 1 AS level UNION ALL SELECT 'player6' AS name, 'team2' AS team, 'de' AS country, 1 AS points, 1 AS level UNION ALL SELECT 'player7' AS name, 'team2' AS team, 'de' AS country, 6 AS points, 2 AS level ) SELECT level, country, team, SUM(points) FROM players GROUP BY level, GROUPING SETS ((team), (team, country)); level | country | team | sum -------+---------+-------+----- 1 | pl | team1 | 2 1 | de | team3 | 4 1 | pl | team4 | 1 1 | fr | team1 | 3 2 | de | team2 | 6 1 | de | team2 | 1 1 | uk | team2 | 2 2 | | team2 | 6 1 | | team1 | 5 1 | | team4 | 1 1 | | team3 | 4 1 | | team2 | 3 (12 rows)
In the grouping sets we can also use composite columns, i.e. columns defined inside ‘()'. In such a case, the wrapped columns are considered as a single one. You can observe that in the following snippet where we don't have the result for (team, level) tuple. Normally we would get it if the level and country were written separately:
WITH players (name, team, country, points, level, city) AS ( SELECT 'player1' AS name, 'team1' AS team, 'fr' AS country, 3 AS points, 1 AS level, 'Paris' AS city UNION ALL SELECT 'player2' AS name, 'team3' AS team, 'de' AS country, 4 AS points, 1 AS level, 'Berlin' AS city UNION ALL SELECT 'player3' AS name, 'team4' AS team, 'pl' AS country, 1 AS points, 1 AS level, 'Warsaw' AS city UNION ALL SELECT 'player4' AS name, 'team2' AS team, 'uk' AS country, 2 AS points, 1 AS level, 'London' AS city UNION ALL SELECT 'player5' AS name, 'team1' AS team, 'pl' AS country, 2 AS points, 1 AS level, 'Warsaw' AS city UNION ALL SELECT 'player6' AS name, 'team2' AS team, 'de' AS country, 1 AS points, 1 AS level, 'Berlin' AS city UNION ALL SELECT 'player7' AS name, 'team2' AS team, 'de' AS country, 6 AS points, 2 AS level, 'Berlin' AS city ) SELECT team, level, country, city, SUM(points) FROM players GROUP BY ROLLUP (team, (level, country), city); team | level | country | city | sum -------+-------+---------+--------+----- team1 | 1 | fr | Paris | 3 team1 | 1 | fr | | 3 team1 | 1 | pl | Warsaw | 2 team1 | 1 | pl | | 2 team1 | | | | 5 team2 | 1 | de | Berlin | 1 team2 | 1 | de | | 1 team2 | 1 | uk | London | 2 team2 | 1 | uk | | 2 team2 | 2 | de | Berlin | 6 team2 | 2 | de | | 6 team2 | | | | 9 team3 | 1 | de | Berlin | 4 team3 | 1 | de | | 4 team3 | | | | 4 team4 | 1 | pl | Warsaw | 1 team4 | 1 | pl | | 1 team4 | | | | 1 | | | | 19 (19 rows)
And finally, of course, we can use grouping sets with usual HAVING clause . The next example filters all results and keeps only the one for France:
WITH players (name, team, country, points, level, city) AS ( SELECT 'player1' AS name, 'team1' AS team, 'fr' AS country, 3 AS points, 1 AS level, 'Paris' AS city UNION ALL SELECT 'player2' AS name, 'team3' AS team, 'de' AS country, 4 AS points, 1 AS level, 'Berlin' AS city UNION ALL SELECT 'player3' AS name, 'team4' AS team, 'pl' AS country, 1 AS points, 1 AS level, 'Warsaw' AS city UNION ALL SELECT 'player4' AS name, 'team2' AS team, 'uk' AS country, 2 AS points, 1 AS level, 'London' AS city UNION ALL SELECT 'player5' AS name, 'team1' AS team, 'pl' AS country, 2 AS points, 1 AS level, 'Warsaw' AS city UNION ALL SELECT 'player6' AS name, 'team2' AS team, 'de' AS country, 1 AS points, 1 AS level, 'Berlin' AS city UNION ALL SELECT 'player7' AS name, 'team2' AS team, 'de' AS country, 6 AS points, 2 AS level, 'Berlin' AS city ) SELECT team, level, country, city, SUM(points) FROM players GROUP BY ROLLUP (team, (level, country), city) HAVING country = 'fr'; team | level | country | city | sum -------+-------+---------+-------+----- team1 | 1 | fr | Paris | 3 team1 | 1 | fr | | 3 (2 rows)
Execution plan
But why it's preferred to use grouping sets over their alternative version based on UNION ALL shown in the first section? Essentially it's because of performance. In most of the implementations (all?), grouping sets compute the final aggregates in a single pass over the data. On the other side, UNION, unless the optimizer decides differently, uses as many passes as SELECT statements.
For instance, the SQL Server implementation reduces the number of passes and compute the final aggregates on top of base aggregates. PostgreSQL's ones does the same:
CREATE TABLE players ( name CHAR(7) NOT NULL PRIMARY KEY, team CHAR(5) NOT NULL, points INT, city VARCHAR(20) ); INSERT INTO players(name, team, points, city) VALUES ('player1', 'team1', 3, 'Paris'), ('player2', 'team3', 4, 'Berlin'), ('player3', 'team4', 1, 'Warsaw'), ('player4', 'team2', 2, 'London'), ('player5', 'team1', 2, 'Warsaw'), ('player6', 'team2', 1, 'Berlin'), ('player7', 'team2', 6, 'Berlin'); EXPLAIN (SELECT name, team, city, SUM(points) FROM players GROUP BY ROLLUP (name, team, city)); postgres=# EXPLAIN (SELECT name, team, city, SUM(points) FROM players GROUP BY ROLLUP (name, team, city)); QUERY PLAN ------------------------------------------------------------------ MixedAggregate (cost=0.00..45.76 rows=1651 width=122) Hash Key: name, team, city Hash Key: name, team Hash Key: name Group Key: () -> Seq Scan on players (cost=0.00..15.50 rows=550 width=118) (6 rows)
As you can see in the plan explanation, the query executes with a single table scan using different hashes to group the aggregates. Moreover, MixedAggregate is able to populate different hash tables in parallel to group sorting. If we allocate a sufficient amount of memory to the operation, it'll perform really good. On the other side, the naïve solution with UNION ALL will scan the table twice, thus obviously, will be less efficient:
postgres=# EXPLAIN SELECT team, city, SUM(points) FROM players GROUP BY team, city postgres-# UNION ALL postgres-# SELECT NULL, NULL, SUM(points) FROM players; QUERY PLAN -------------------------------------------------------------------------------- Append (cost=19.62..40.52 rows=201 width=90) -> HashAggregate (cost=19.62..21.62 rows=200 width=90) Group Key: players.team, players.city -> Seq Scan on players (cost=0.00..15.50 rows=550 width=86) -> Aggregate (cost=16.88..16.89 rows=1 width=72) -> Seq Scan on players players_1 (cost=0.00..15.50 rows=550 width=4) (6 rows)
Grouping sets are an interesting feature available in a lot of databases. It's valid not only for classical RDBMS as PostgreSQL, Oracle, but also for distributed data warehouse solutions as Google's BigQuery. The main feature of grouping sets consists of aggregating values by different groups. It's different from doing the same with UNION operator because it significantly reduces the number of scans over the input table. In addition, grouping sets feature comes with pretty handful ROLLUP and CUBE shortcuts that can be used to build group hierarchies or all group permutations automatically.