Hierarchical queries

Versions: PostreSQL 10.2

SQL hides for the most of its daily users a lot of interesting and powerful functions that though are not used very frequently. One of them are hierarchical queries.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ contact@waitingforcode.com πŸ“©

This post explain hierarchical queries in 3 sections. The first gives some basic definitions about this type of queries. The second one shows a simple example of hierarchical query imitating for loop behavior. The last part illustrates more real-world example of their use in the case of a blog post comments hierarchy.

Hierarchical queries definition

To understand the hierarchical queries we must begin by the definition of the hierarchical data. As the name indicates, the hierarchical data represents hierarchies. An example of that kind of hierarchy can be the genealogical tree where some people are ancestor of others (= they are hierarchically in higher position). Thus to generalize we can tell that a hierarchical data represents the data related to each other by a hierarchical relationship (e.g. "is parent of" in genealogical tree).

Thus the hierarchical query is a query able to handle hierarchical data. Oracle implements them with CONNECT BY statement. In the standard SQL the hierarchical query is executed with the help of Common Table Expression and their WITH RECURSIVE statement. More details are included in the next section.

Common Table Expression (CTE)

The CTE represents a temporary result set defined through WITH operator. This temporary set can be later referenced in one or multiple places of the query. It also contributes to improve the readability of the queries.

Hierarchical query - for loop example

An example of a hierarchical query can be a simulation of a for loop existing in many programming languages. It can be defined as follows:

WITH RECURSIVE for_loop_imitation (i) AS  (
  SELECT 0
  UNION ALL
  SELECT i+1  FROM for_loop_imitation WHERE i < 9)

SELECT * FROM for_loop_imitation;
 i
---
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
(10 rows)

As you can clearly see, the query generated the numbers from 0 to 9, exactly as the for loop could do. To understand better the hierarchical queries let's see how does it work by detailing the components of the recursive CTE:

To have a clearer vision let's now see how the above query translates to the query plan (PostgreSQL case):

                                                QUERY PLAN                                     
-----------------------------------------------------------------------------------------------------------
 CTE Scan on for_loop_imitation  (cost=2.95..3.57 rows=31 width=4)
   CTE for_loop_imitation
     ->  Recursive Union  (cost=0.00..2.95 rows=31 width=4)
           ->  Result  (cost=0.00..0.01 rows=1 width=4)
           ->  WorkTable Scan on for_loop_imitation for_loop_imitation_1  (cost=0.00..0.23 rows=3 width=4)
                 Filter: (i < 9)
(6 rows)

As you can see, the output references WorkTable that represents working table where all outputs of recursive executions are stored. More concretely, the execution algorithm behaves finally as iteration rather than recursion (according to the doc, recursion was chosen to respect the SQL standards) and it executes in the following steps:

Hierarchical queries examples

To see the hierarchical queries in more real-world examples, we'll use the case of hierarchical comments:

CREATE TABLE comments (
  id INT NOT NULL,
  parent_comment_id INT NULL,
  replied_comment_id INT NULL,
  author VARCHAR(10) NOT NULL,
  added TIMESTAMP NOT NULL,
  content TEXT NOT NULL,
  PRIMARY KEY(id)
);


INSERT INTO comments (id, parent_comment_id, replied_comment_id, author, added, content) VALUES
(1, 1, NULL, 'user_1', TIMESTAMP '2018-01-01 10:00:00', 'user_1 comment_id_1'),
(2, 1, 1, 'admin', TIMESTAMP '2018-01-01 10:01:00','admin answer to user_1 comment_id_1'),
(3, 1, 2, 'user_1', TIMESTAMP '2018-01-01 10:02:00','user_1 answer to admin comment_id_2'),
(4, 4, NULL, 'user_2', TIMESTAMP '2018-01-01 14:00:00','user_2 comment_id_5'),
(5, 5, NULL, 'user_3', TIMESTAMP '2018-01-01 14:01:00','user_3 comment_id_6'),
(6, 5, 5, 'user_1', TIMESTAMP '2018-01-01 15:00:00','user_1 answer to user_3 comment_id_6'),
(7, 5, 5, 'user_2', TIMESTAMP '2018-01-01 15:01:00','user_2 answer to user_3 comment_id_6'),
(8, 1, 3, 'admin', TIMESTAMP '2018-01-01 16:03:00','admin answer to user_1 comment_id_3'),
(9, 5, 7, 'admin', TIMESTAMP '2018-01-01 15:02:00','admin answer to user_1 comment_id_7'),
(10, 1, 8, 'admin', TIMESTAMP '2018-01-01 15:03:00','admin answer to user_2 comment_id_8'),
(11, 1, 10, 'admin', TIMESTAMP '2018-01-01 15:03:30','admin answer to admin comment_id_10'),
(12, 1, 10, 'user_3', TIMESTAMP '2018-01-01 15:03:50','user_3 answer to admin comment_id_10'),
(13, 1, 2, 'user_3', TIMESTAMP '2018-01-01 15:03:50','user_3 answer to admin comment_id_2');

As you can see, the comments can be related to each others through the replied_comment_id field. Let's start with a simple query returning the comments tree for one specific parent comment:

WITH RECURSIVE comments_tree AS (
  (
    SELECT
      id, replied_comment_id, author, added, parent_comment_id, ARRAY[id] AS path,
      content
    FROM comments WHERE id = 1
  )
  UNION ALL
  (
    SELECT
      child.id, child.replied_comment_id, child.author, child.added,
      child.parent_comment_id, (ct.path || child.id) AS path,
      LPAD(child.content, LENGTH(child.content) + CARDINALITY(path), ' ') AS content
      FROM comments_tree ct
      JOIN comments child ON child.replied_comment_id = ct.id
  )
)

SELECT id, parent_comment_id, added, replied_comment_id, content FROM comments_tree;

The generated output will be:

correlated_subqueries-# SELECT id, parent_comment_id, added, replied_comment_id, content FROM comments_tree;
 id | parent_comment_id |        added        | replied_comment_id |                  content                 
 
----+-------------------+---------------------+--------------------+------------------------------------------
-
  1 |                 1 | 2018-01-01 10:00:00 |                    | user_1 comment_id_1
  2 |                 1 | 2018-01-01 10:01:00 |                  1 |  admin answer to user_1 comment_id_1
  3 |                 1 | 2018-01-01 10:02:00 |                  2 |   user_1 answer to admin comment_id_2
 13 |                 1 | 2018-01-01 15:03:50 |                  2 |   user_3 answer to admin comment_id_2
  8 |                 1 | 2018-01-01 16:03:00 |                  3 |    admin answer to user_1 comment_id_3
 10 |                 1 | 2018-01-01 15:03:00 |                  8 |     admin answer to user_2 comment_id_8
 11 |                 1 | 2018-01-01 15:03:30 |                 10 |      admin answer to admin comment_id_10
 12 |                 1 | 2018-01-01 15:03:50 |                 10 |      user_3 answer to admin comment_id_10
(8 rows)

As you can see through this output, it returned a list of hierarchically ordered comments. The hierarchy is well visible thanks to the lpad function adding one space for every level of comments. The recursive CTE can also be applied to all the comments and create a list of all comments with nested levels:

WITH RECURSIVE comments_tree AS (
  (
    SELECT
      id, replied_comment_id, author, added, parent_comment_id,
      content
    FROM comments WHERE replied_comment_id IS NULL
  )
  UNION ALL
  (
    SELECT
      child.id, child.replied_comment_id, child.author, child.added,
      child.parent_comment_id
      LPAD(child.content, LENGTH(child.content) + CARDINALITY(path), ' ') AS content
      FROM comments_tree ct
      JOIN comments child ON child.replied_comment_id = ct.id
  )
)

-- shows all queries, how it goes down for each level
SELECT id, parent_comment_id, added, replied_comment_id, content FROM comments_tree
ORDER BY parent_comment_id ASC, replied_comment_id IS NOT NULL ASC, replied_comment_id ASC;

With the output:

 id | parent_comment_id |        added        | replied_comment_id |                  content                 
 
----+-------------------+---------------------+--------------------+------------------------------------------
-
  1 |                 1 | 2018-01-01 10:00:00 |                    | user_1 comment_id_1
  2 |                 1 | 2018-01-01 10:01:00 |                  1 |  admin answer to user_1 comment_id_1
  3 |                 1 | 2018-01-01 10:02:00 |                  2 |   user_1 answer to admin comment_id_2
 13 |                 1 | 2018-01-01 15:03:50 |                  2 |   user_3 answer to admin comment_id_2
  8 |                 1 | 2018-01-01 16:03:00 |                  3 |    admin answer to user_1 comment_id_3
 10 |                 1 | 2018-01-01 15:03:00 |                  8 |     admin answer to user_2 comment_id_8
 11 |                 1 | 2018-01-01 15:03:30 |                 10 |      admin answer to admin comment_id_10
 12 |                 1 | 2018-01-01 15:03:50 |                 10 |      user_3 answer to admin comment_id_10
  4 |                 4 | 2018-01-01 14:00:00 |                    | user_2 comment_id_5
  5 |                 5 | 2018-01-01 14:01:00 |                    | user_3 comment_id_6
  7 |                 5 | 2018-01-01 15:01:00 |                  5 |  user_2 answer to user_3 comment_id_6
  6 |                 5 | 2018-01-01 15:00:00 |                  5 |  user_1 answer to user_3 comment_id_6
  9 |                 5 | 2018-01-01 15:02:00 |                  7 |   admin answer to user_1 comment_id_7
(13 rows)

However the above output also shows some limitation of the use of recursive CTE to nested hierarchy construction. It's because this expression works on breadth-first manner instead of depth-first. That said, it returns a complete level every time instead of the complete branch. To remedy this problem we can make the same trick as in the article quoted in "Read more" section (Recursive CTEs Explained) consisting on constructing a path and ordering by it:

WITH RECURSIVE comments_tree AS (
  (
    SELECT
      id, replied_comment_id, author, added, parent_comment_id, ARRAY[id] AS path,
      content
    FROM comments WHERE replied_comment_id IS NULL
  )
  UNION ALL
  (
    SELECT
      child.id, child.replied_comment_id, child.author, child.added,
      child.parent_comment_id
      , (ct.path || child.id) AS path,
      LPAD(child.content, LENGTH(child.content) + CARDINALITY(path), ' ') AS content
      FROM comments_tree ct
      JOIN comments child ON child.replied_comment_id = ct.id
  )
)

Now the output will look like:

correlated_subqueries-# SELECT id, parent_comment_id, added, replied_comment_id, content FROM comments_tree
correlated_subqueries-# ORDER BY path ASC;
 id | parent_comment_id |        added        | replied_comment_id |                  content                 
 
----+-------------------+---------------------+--------------------+------------------------------------------
-
  1 |                 1 | 2018-01-01 10:00:00 |                    | user_1 comment_id_1
  2 |                 1 | 2018-01-01 10:01:00 |                  1 |  admin answer to user_1 comment_id_1
  3 |                 1 | 2018-01-01 10:02:00 |                  2 |   user_1 answer to admin comment_id_2
  8 |                 1 | 2018-01-01 16:03:00 |                  3 |    admin answer to user_1 comment_id_3
 10 |                 1 | 2018-01-01 15:03:00 |                  8 |     admin answer to user_2 comment_id_8
 11 |                 1 | 2018-01-01 15:03:30 |                 10 |      admin answer to admin comment_id_10
 12 |                 1 | 2018-01-01 15:03:50 |                 10 |      user_3 answer to admin comment_id_10
 13 |                 1 | 2018-01-01 15:03:50 |                  2 |   user_3 answer to admin comment_id_2
  4 |                 4 | 2018-01-01 14:00:00 |                    | user_2 comment_id_5
  5 |                 5 | 2018-01-01 14:01:00 |                    | user_3 comment_id_6
  6 |                 5 | 2018-01-01 15:00:00 |                  5 |  user_1 answer to user_3 comment_id_6
  7 |                 5 | 2018-01-01 15:01:00 |                  5 |  user_2 answer to user_3 comment_id_6
  9 |                 5 | 2018-01-01 15:02:00 |                  7 |   admin answer to user_1 comment_id_7
(13 rows)

The hierarchical queries are quite nice feature present in the RDBMS. As shown in the first section they allow to work with hierarchical data structures, as for instance trees. One of great examples illustrating it is the genealogical tree. The second section shown a simple example of the hierarchical query simulating the for loop behavior. The last part was about more realistic use-case of a website comments returned through 3 different queries: one comment query and queries showing breadth-first and depth-first result generations.