SQL GROUPING SETS operator

I have already described grouping sets feature in the context of Apache Spark. But natively they are a part of SQL standard and that's why I would like to extend the previous post here. After all, you don't need Big Data to use them - even though nowadays it's difficult to not to deal with it.

The post is composed of 3 sections. The first one talks more generally about data representation with grouping sets. The second one gives some precise examples of their use in PotgreSQL. The final part focuses on some internal execution details .

Summarizing data with GROUPING

SQL's GROUP BY clause is a powerful operation helping to generate aggregates in the context of one or more columns. However, with simple GROUP BY it's hard to generate some data, like for instance the totals for grouped aggregates. One of the solutions to do that are UNIONs:

WITH players (name, team, points, city) AS (
    SELECT 'player1' AS name, 'team1' AS team, 3 AS points, 'Paris' AS city UNION ALL
    SELECT 'player2' AS name, 'team3' AS team, 4 AS points, 'Berlin' AS city UNION ALL
    SELECT 'player3' AS name, 'team4' AS team, 1 AS points, 'Warsaw' AS city UNION ALL
    SELECT 'player4' AS name, 'team2' AS team, 2 AS points, 'London' AS city UNION ALL
    SELECT 'player5' AS name, 'team1' AS team, 2 AS points, 'Warsaw' AS city UNION ALL
    SELECT 'player6' AS name, 'team2' AS team, 1 AS points, 'Berlin' AS city UNION ALL
    SELECT 'player7' AS name, 'team2' AS team, 6 AS points, 'Berlin' AS city
)

SELECT team, city, SUM(points) FROM players GROUP BY team, city
UNION ALL
SELECT NULL, NULL, SUM(points) FROM players;
  team  |  city  | sum
-------+--------+-----
 team2 | Berlin |   7
 team3 | Berlin |   4
 team2 | London |   2
 team4 | Warsaw |   1
 team1 | Warsaw |   2
 team1 | Paris  |   3
       |        |  19
(7 rows)

If you're a careful reader, you can immediately see the inefficiency of above statement. After all the engine could compute the totals directly from aggregates and consider them as a partial aggregation, similarly to Tree aggregations in Spark. To mitigate this issue SQL standard provides GROUPING SETS operator. Simply speaking, it's a set of attributes used in group by statement. The previous query could be rewritten with GROUPING SETS like that:

WITH players (name, team, points, city) AS (
    SELECT 'player1' AS name, 'team1' AS team, 3 AS points, 'Paris' AS city UNION ALL
    SELECT 'player2' AS name, 'team3' AS team, 4 AS points, 'Berlin' AS city UNION ALL
    SELECT 'player3' AS name, 'team4' AS team, 1 AS points, 'Warsaw' AS city UNION ALL
    SELECT 'player4' AS name, 'team2' AS team, 2 AS points, 'London' AS city UNION ALL
    SELECT 'player5' AS name, 'team1' AS team, 2 AS points, 'Warsaw' AS city UNION ALL
    SELECT 'player6' AS name, 'team2' AS team, 1 AS points, 'Berlin' AS city UNION ALL
    SELECT 'player7' AS name, 'team2' AS team, 6 AS points, 'Berlin' AS city
)
SELECT team, city, SUM(points) FROM players
GROUP BY GROUPING SETS ((), (team, city));
 team  |  city  | sum
-------+--------+-----
       |        |  19
 team2 | Berlin |   7
 team3 | Berlin |   4
 team2 | London |   2
 team4 | Warsaw |   1
 team1 | Warsaw |   2
 team1 | Paris  |   3
(7 rows)

Grouping sets operator comes with 2 shorter operators, CUBE and ROLLUP. CUBE computes all possible permutations of the grouping set whereas ROLLUP produces an aggregation over a hierarchy of groups, e.g. for a group (a, b, c) it'll create the aggregates for (a, b, c), (a, b) and (a).

Grouping sets and their shortcuts (ROLLUP, CUBE) are implemented in a lot of databases. You can find them in usual relational systems like PostgreSQL, MySQL, Oracle or SQL Server, and also in distributed data stores used mainly in the context of data warehousing like BigQuery. You can also find them on AWS where Athena service provides the support for ROLLUP and CUBE operations.

Queries examples

The examples from this section are executed against PostgreSQL 10 started with the following Docker command docker run --name some-postgres -e POSTGRES_PASSWORD=password -d postgress. To access the image's psql you can call docker run -it --rm --link some-postgres:postgres postgres psql -h postgres -U postgres

Since the first part of the post gave an example of a query using GROUPING SETS operator, we'll start this samples section by illustrating the use of GROUPING function, returning 1 if the column in aggregation represents all values or 0 if it doesn't. In other words, 1 will be returned when our grouping will take either all countries or all teams:

WITH players (name, team, country, points) AS (
    SELECT 'player1' AS name, 'team1' AS team, 'fr' AS country, 3 AS points UNION ALL
    SELECT 'player2' AS name, 'team3' AS team, 'de' AS country, 4 AS points UNION ALL
    SELECT 'player3' AS name, 'team4' AS team, 'pl' AS country, 1 AS points UNION ALL
    SELECT 'player4' AS name, 'team2' AS team, 'uk' AS country, 2 AS points UNION ALL
    SELECT 'player5' AS name, 'team1' AS team, 'pl' AS country, 2 AS points UNION ALL
    SELECT 'player6' AS name, 'team2' AS team, 'de' AS country, 1 AS points UNION ALL
    SELECT 'player7' AS name, 'team2' AS team, 'de' AS country, 6 AS points
)

SELECT GROUPING(country) AS aggregated_country, GROUPING(team) AS aggregated_team, country, team, SUM(points) FROM players
GROUP BY GROUPING SETS ((country), (team));

 aggregated_country | aggregated_team | country | team  | sum
--------------------+-----------------+---------+-------+-----
                  0 |               1 | de      |       |  11
                  0 |               1 | uk      |       |   2
                  0 |               1 | pl      |       |   3
                  0 |               1 | fr      |       |   3
                  1 |               0 |         | team3 |   4
                  1 |               0 |         | team1 |   5
                  1 |               0 |         | team4 |   1
                  1 |               0 |         | team2 |   9
(8 rows)

In addition, we can also concatenate grouping sets with multiple expressions. In such a case, the result is a cross product of individual items. In the example below the aggregates are generated from the combination of the first and the second grouping statements, i.e. level + team and level + (team, country):

WITH players (name, team, country, points, level) AS (
    SELECT 'player1' AS name, 'team1' AS team, 'fr' AS country, 3 AS points, 1 AS level UNION ALL
    SELECT 'player2' AS name, 'team3' AS team, 'de' AS country, 4 AS points, 1 AS level UNION ALL
    SELECT 'player3' AS name, 'team4' AS team, 'pl' AS country, 1 AS points, 1 AS level UNION ALL
    SELECT 'player4' AS name, 'team2' AS team, 'uk' AS country, 2 AS points, 1 AS level UNION ALL
    SELECT 'player5' AS name, 'team1' AS team, 'pl' AS country, 2 AS points, 1 AS level UNION ALL
    SELECT 'player6' AS name, 'team2' AS team, 'de' AS country, 1 AS points, 1 AS level UNION ALL
    SELECT 'player7' AS name, 'team2' AS team, 'de' AS country, 6 AS points, 2 AS level
)

SELECT level, country, team, SUM(points) FROM players
GROUP BY level, GROUPING SETS ((team), (team, country));

 level | country | team  | sum
-------+---------+-------+-----
     1 | pl      | team1 |   2
     1 | de      | team3 |   4
     1 | pl      | team4 |   1
     1 | fr      | team1 |   3
     2 | de      | team2 |   6
     1 | de      | team2 |   1
     1 | uk      | team2 |   2
     2 |         | team2 |   6
     1 |         | team1 |   5
     1 |         | team4 |   1
     1 |         | team3 |   4
     1 |         | team2 |   3
(12 rows)

In the grouping sets we can also use composite columns, i.e. columns defined inside ‘()'. In such a case, the wrapped columns are considered as a single one. You can observe that in the following snippet where we don't have the result for (team, level) tuple. Normally we would get it if the level and country were written separately:

WITH players (name, team, country, points, level, city) AS (
    SELECT 'player1' AS name, 'team1' AS team, 'fr' AS country, 3 AS points, 1 AS level, 'Paris' AS city UNION ALL
    SELECT 'player2' AS name, 'team3' AS team, 'de' AS country, 4 AS points, 1 AS level, 'Berlin' AS city UNION ALL
    SELECT 'player3' AS name, 'team4' AS team, 'pl' AS country, 1 AS points, 1 AS level, 'Warsaw' AS city UNION ALL
    SELECT 'player4' AS name, 'team2' AS team, 'uk' AS country, 2 AS points, 1 AS level, 'London' AS city UNION ALL
    SELECT 'player5' AS name, 'team1' AS team, 'pl' AS country, 2 AS points, 1 AS level, 'Warsaw' AS city UNION ALL
    SELECT 'player6' AS name, 'team2' AS team, 'de' AS country, 1 AS points, 1 AS level, 'Berlin' AS city UNION ALL
    SELECT 'player7' AS name, 'team2' AS team, 'de' AS country, 6 AS points, 2 AS level, 'Berlin' AS city
)

SELECT team, level, country, city, SUM(points) FROM players
GROUP BY ROLLUP (team, (level, country), city);

 team  | level | country |  city  | sum
-------+-------+---------+--------+-----
 team1 |     1 | fr      | Paris  |   3
 team1 |     1 | fr      |        |   3
 team1 |     1 | pl      | Warsaw |   2
 team1 |     1 | pl      |        |   2
 team1 |       |         |        |   5
 team2 |     1 | de      | Berlin |   1
 team2 |     1 | de      |        |   1
 team2 |     1 | uk      | London |   2
 team2 |     1 | uk      |        |   2
 team2 |     2 | de      | Berlin |   6
 team2 |     2 | de      |        |   6
 team2 |       |         |        |   9
 team3 |     1 | de      | Berlin |   4
 team3 |     1 | de      |        |   4
 team3 |       |         |        |   4
 team4 |     1 | pl      | Warsaw |   1
 team4 |     1 | pl      |        |   1
 team4 |       |         |        |   1
       |       |         |        |  19
(19 rows)

And finally, of course, we can use grouping sets with usual HAVING clause . The next example filters all results and keeps only the one for France:

WITH players (name, team, country, points, level, city) AS (
    SELECT 'player1' AS name, 'team1' AS team, 'fr' AS country, 3 AS points, 1 AS level, 'Paris' AS city UNION ALL
    SELECT 'player2' AS name, 'team3' AS team, 'de' AS country, 4 AS points, 1 AS level, 'Berlin' AS city UNION ALL
    SELECT 'player3' AS name, 'team4' AS team, 'pl' AS country, 1 AS points, 1 AS level, 'Warsaw' AS city UNION ALL
    SELECT 'player4' AS name, 'team2' AS team, 'uk' AS country, 2 AS points, 1 AS level, 'London' AS city UNION ALL
    SELECT 'player5' AS name, 'team1' AS team, 'pl' AS country, 2 AS points, 1 AS level, 'Warsaw' AS city UNION ALL
    SELECT 'player6' AS name, 'team2' AS team, 'de' AS country, 1 AS points, 1 AS level, 'Berlin' AS city UNION ALL
    SELECT 'player7' AS name, 'team2' AS team, 'de' AS country, 6 AS points, 2 AS level, 'Berlin' AS city
)

SELECT team, level, country, city, SUM(points) FROM players
GROUP BY ROLLUP (team, (level, country), city) HAVING country = 'fr';

 team  | level | country | city  | sum
-------+-------+---------+-------+-----
 team1 |     1 | fr      | Paris |   3
 team1 |     1 | fr      |       |   3
(2 rows)

Execution plan

But why it's preferred to use grouping sets over their alternative version based on UNION ALL shown in the first section? Essentially it's because of performance. In most of the implementations (all?), grouping sets compute the final aggregates in a single pass over the data. On the other side, UNION, unless the optimizer decides differently, uses as many passes as SELECT statements.

For instance, the SQL Server implementation reduces the number of passes and compute the final aggregates on top of base aggregates. PostgreSQL's ones does the same:

CREATE TABLE players (
    name CHAR(7) NOT NULL PRIMARY KEY, 
    team CHAR(5) NOT NULL,
    points INT,
    city VARCHAR(20)
);
INSERT INTO players(name, team, points, city) VALUES 
('player1', 'team1', 3, 'Paris'),
('player2', 'team3', 4, 'Berlin'),
('player3', 'team4', 1, 'Warsaw'),
('player4', 'team2', 2, 'London'),
('player5', 'team1', 2, 'Warsaw'),
('player6', 'team2', 1, 'Berlin'),
('player7', 'team2', 6, 'Berlin');

EXPLAIN (SELECT name, team, city, SUM(points) FROM players  GROUP BY ROLLUP (name, team, city));


postgres=# EXPLAIN (SELECT name, team, city, SUM(points) FROM players  GROUP BY ROLLUP (name, team, city));
                            QUERY PLAN
------------------------------------------------------------------
 MixedAggregate  (cost=0.00..45.76 rows=1651 width=122)
   Hash Key: name, team, city
   Hash Key: name, team
   Hash Key: name
   Group Key: ()
   ->  Seq Scan on players  (cost=0.00..15.50 rows=550 width=118)
(6 rows)

As you can see in the plan explanation, the query executes with a single table scan using different hashes to group the aggregates. Moreover, MixedAggregate is able to populate different hash tables in parallel to group sorting. If we allocate a sufficient amount of memory to the operation, it'll perform really good. On the other side, the naïve solution with UNION ALL will scan the table twice, thus obviously, will be less efficient:

postgres=# EXPLAIN SELECT team, city, SUM(points) FROM players GROUP BY team, city
postgres-# UNION ALL
postgres-# SELECT NULL, NULL, SUM(points) FROM players;
                                   QUERY PLAN
--------------------------------------------------------------------------------
 Append  (cost=19.62..40.52 rows=201 width=90)
   ->  HashAggregate  (cost=19.62..21.62 rows=200 width=90)
         Group Key: players.team, players.city
         ->  Seq Scan on players  (cost=0.00..15.50 rows=550 width=86)
   ->  Aggregate  (cost=16.88..16.89 rows=1 width=72)
         ->  Seq Scan on players players_1  (cost=0.00..15.50 rows=550 width=4)
(6 rows)

Grouping sets are an interesting feature available in a lot of databases. It's valid not only for classical RDBMS as PostgreSQL, Oracle, but also for distributed data warehouse solutions as Google's BigQuery. The main feature of grouping sets consists of aggregating values by different groups. It's different from doing the same with UNION operator because it significantly reduces the number of scans over the input table. In addition, grouping sets feature comes with pretty handful ROLLUP and CUBE shortcuts that can be used to build group hierarchies or all group permutations automatically.