Hierarchical queries

on waitingforcode.com

Hierarchical queries

SQL hides for the most of its daily users a lot of interesting and powerful functions that though are not used very frequently. One of them are hierarchical queries.

This post explain hierarchical queries in 3 sections. The first gives some basic definitions about this type of queries. The second one shows a simple example of hierarchical query imitating for loop behavior. The last part illustrates more real-world example of their use in the case of a blog post comments hierarchy.

Hierarchical queries definition

To understand the hierarchical queries we must begin by the definition of the hierarchical data. As the name indicates, the hierarchical data represents hierarchies. An example of that kind of hierarchy can be the genealogical tree where some people are ancestor of others (= they are hierarchically in higher position). Thus to generalize we can tell that a hierarchical data represents the data related to each other by a hierarchical relationship (e.g. "is parent of" in genealogical tree).

Thus the hierarchical query is a query able to handle hierarchical data. Oracle implements them with CONNECT BY statement. In the standard SQL the hierarchical query is executed with the help of Common Table Expression and their WITH RECURSIVE statement. More details are included in the next section.

Common Table Expression (CTE)

The CTE represents a temporary result set defined through WITH operator. This temporary set can be later referenced in one or multiple places of the query. It also contributes to improve the readability of the queries.

Hierarchical query - for loop example

An example of a hierarchical query can be a simulation of a for loop existing in many programming languages. It can be defined as follows:

WITH RECURSIVE for_loop_imitation (i) AS  (
  SELECT 0
  UNION ALL
  SELECT i+1  FROM for_loop_imitation WHERE i < 9)

SELECT * FROM for_loop_imitation;
 i
---
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
(10 rows)

As you can clearly see, the query generated the numbers from 0 to 9, exactly as the for loop could do. To understand better the hierarchical queries let's see how does it work by detailing the components of the recursive CTE:

  • SELECT 0 is the definition of the query starting point. In our case we'll assign the value 0 to the field i defined in the CTE signature (aka anchor member)
  • UNION ALL it can be thought as a marker for the beginning of the recursive execution
  • SELECT i+1 FROM for_loop_imitation defines the recursive logic to apply on each call (aka recursive member). You can correctly notice that the recursive logic references to the for_loop_imitation.
  • WHERE i < 9 it's called termination check because it determines when the recursive query execution will terminate, a little bit as < in this code for (int i = 0; i < 9; i++). If the termination check is not defined the query can fail because of too depth recursion call.

To have a clearer vision let's now see how the above query translates to the query plan (PostgreSQL case):

                                                QUERY PLAN                                     
-----------------------------------------------------------------------------------------------------------
 CTE Scan on for_loop_imitation  (cost=2.95..3.57 rows=31 width=4)
   CTE for_loop_imitation
     ->  Recursive Union  (cost=0.00..2.95 rows=31 width=4)
           ->  Result  (cost=0.00..0.01 rows=1 width=4)
           ->  WorkTable Scan on for_loop_imitation for_loop_imitation_1  (cost=0.00..0.23 rows=3 width=4)
                 Filter: (i < 9)
(6 rows)

As you can see, the output references WorkTable that represents working table where all outputs of recursive executions are stored. More concretely, the execution algorithm behaves finally as iteration rather than recursion (according to the doc, recursion was chosen to respect the SQL standards) and it executes in the following steps:

  • The anchor member is executed first and its results are put into the working table (for_loop_imitation in the previous query).
  • Next, the recursive logic is executed as long as there are matching results (= working table is not empty):
    • the results are first stored in an intermediate table
    • later these results replace the content of the working table and the intermediate table is emptied

Hierarchical queries examples

To see the hierarchical queries in more real-world examples, we'll use the case of hierarchical comments:

CREATE TABLE comments (
  id INT NOT NULL,
  parent_comment_id INT NULL,
  replied_comment_id INT NULL,
  author VARCHAR(10) NOT NULL,
  added TIMESTAMP NOT NULL,
  content TEXT NOT NULL,
  PRIMARY KEY(id)
);


INSERT INTO comments (id, parent_comment_id, replied_comment_id, author, added, content) VALUES
(1, 1, NULL, 'user_1', TIMESTAMP '2018-01-01 10:00:00', 'user_1 comment_id_1'),
(2, 1, 1, 'admin', TIMESTAMP '2018-01-01 10:01:00','admin answer to user_1 comment_id_1'),
(3, 1, 2, 'user_1', TIMESTAMP '2018-01-01 10:02:00','user_1 answer to admin comment_id_2'),
(4, 4, NULL, 'user_2', TIMESTAMP '2018-01-01 14:00:00','user_2 comment_id_5'),
(5, 5, NULL, 'user_3', TIMESTAMP '2018-01-01 14:01:00','user_3 comment_id_6'),
(6, 5, 5, 'user_1', TIMESTAMP '2018-01-01 15:00:00','user_1 answer to user_3 comment_id_6'),
(7, 5, 5, 'user_2', TIMESTAMP '2018-01-01 15:01:00','user_2 answer to user_3 comment_id_6'),
(8, 1, 3, 'admin', TIMESTAMP '2018-01-01 16:03:00','admin answer to user_1 comment_id_3'),
(9, 5, 7, 'admin', TIMESTAMP '2018-01-01 15:02:00','admin answer to user_1 comment_id_7'),
(10, 1, 8, 'admin', TIMESTAMP '2018-01-01 15:03:00','admin answer to user_2 comment_id_8'),
(11, 1, 10, 'admin', TIMESTAMP '2018-01-01 15:03:30','admin answer to admin comment_id_10'),
(12, 1, 10, 'user_3', TIMESTAMP '2018-01-01 15:03:50','user_3 answer to admin comment_id_10'),
(13, 1, 2, 'user_3', TIMESTAMP '2018-01-01 15:03:50','user_3 answer to admin comment_id_2');

As you can see, the comments can be related to each others through the replied_comment_id field. Let's start with a simple query returning the comments tree for one specific parent comment:

WITH RECURSIVE comments_tree AS (
  (
    SELECT
      id, replied_comment_id, author, added, parent_comment_id, ARRAY[id] AS path,
      content
    FROM comments WHERE id = 1
  )
  UNION ALL
  (
    SELECT
      child.id, child.replied_comment_id, child.author, child.added,
      child.parent_comment_id, (ct.path || child.id) AS path,
      LPAD(child.content, LENGTH(child.content) + CARDINALITY(path), ' ') AS content
      FROM comments_tree ct
      JOIN comments child ON child.replied_comment_id = ct.id
  )
)

SELECT id, parent_comment_id, added, replied_comment_id, content FROM comments_tree;

The generated output will be:

correlated_subqueries-# SELECT id, parent_comment_id, added, replied_comment_id, content FROM comments_tree;
 id | parent_comment_id |        added        | replied_comment_id |                  content                 
 
----+-------------------+---------------------+--------------------+------------------------------------------
-
  1 |                 1 | 2018-01-01 10:00:00 |                    | user_1 comment_id_1
  2 |                 1 | 2018-01-01 10:01:00 |                  1 |  admin answer to user_1 comment_id_1
  3 |                 1 | 2018-01-01 10:02:00 |                  2 |   user_1 answer to admin comment_id_2
 13 |                 1 | 2018-01-01 15:03:50 |                  2 |   user_3 answer to admin comment_id_2
  8 |                 1 | 2018-01-01 16:03:00 |                  3 |    admin answer to user_1 comment_id_3
 10 |                 1 | 2018-01-01 15:03:00 |                  8 |     admin answer to user_2 comment_id_8
 11 |                 1 | 2018-01-01 15:03:30 |                 10 |      admin answer to admin comment_id_10
 12 |                 1 | 2018-01-01 15:03:50 |                 10 |      user_3 answer to admin comment_id_10
(8 rows)

As you can see through this output, it returned a list of hierarchically ordered comments. The hierarchy is well visible thanks to the lpad function adding one space for every level of comments. The recursive CTE can also be applied to all the comments and create a list of all comments with nested levels:

WITH RECURSIVE comments_tree AS (
  (
    SELECT
      id, replied_comment_id, author, added, parent_comment_id,
      content
    FROM comments WHERE replied_comment_id IS NULL
  )
  UNION ALL
  (
    SELECT
      child.id, child.replied_comment_id, child.author, child.added,
      child.parent_comment_id
      LPAD(child.content, LENGTH(child.content) + CARDINALITY(path), ' ') AS content
      FROM comments_tree ct
      JOIN comments child ON child.replied_comment_id = ct.id
  )
)

-- shows all queries, how it goes down for each level
SELECT id, parent_comment_id, added, replied_comment_id, content FROM comments_tree
ORDER BY parent_comment_id ASC, replied_comment_id IS NOT NULL ASC, replied_comment_id ASC;

With the output:

 id | parent_comment_id |        added        | replied_comment_id |                  content                 
 
----+-------------------+---------------------+--------------------+------------------------------------------
-
  1 |                 1 | 2018-01-01 10:00:00 |                    | user_1 comment_id_1
  2 |                 1 | 2018-01-01 10:01:00 |                  1 |  admin answer to user_1 comment_id_1
  3 |                 1 | 2018-01-01 10:02:00 |                  2 |   user_1 answer to admin comment_id_2
 13 |                 1 | 2018-01-01 15:03:50 |                  2 |   user_3 answer to admin comment_id_2
  8 |                 1 | 2018-01-01 16:03:00 |                  3 |    admin answer to user_1 comment_id_3
 10 |                 1 | 2018-01-01 15:03:00 |                  8 |     admin answer to user_2 comment_id_8
 11 |                 1 | 2018-01-01 15:03:30 |                 10 |      admin answer to admin comment_id_10
 12 |                 1 | 2018-01-01 15:03:50 |                 10 |      user_3 answer to admin comment_id_10
  4 |                 4 | 2018-01-01 14:00:00 |                    | user_2 comment_id_5
  5 |                 5 | 2018-01-01 14:01:00 |                    | user_3 comment_id_6
  7 |                 5 | 2018-01-01 15:01:00 |                  5 |  user_2 answer to user_3 comment_id_6
  6 |                 5 | 2018-01-01 15:00:00 |                  5 |  user_1 answer to user_3 comment_id_6
  9 |                 5 | 2018-01-01 15:02:00 |                  7 |   admin answer to user_1 comment_id_7
(13 rows)

However the above output also shows some limitation of the use of recursive CTE to nested hierarchy construction. It's because this expression works on breadth-first manner instead of depth-first. That said, it returns a complete level every time instead of the complete branch. To remedy this problem we can make the same trick as in the article quoted in "Read more" section (Recursive CTEs Explained) consisting on constructing a path and ordering by it:

WITH RECURSIVE comments_tree AS (
  (
    SELECT
      id, replied_comment_id, author, added, parent_comment_id, ARRAY[id] AS path,
      content
    FROM comments WHERE replied_comment_id IS NULL
  )
  UNION ALL
  (
    SELECT
      child.id, child.replied_comment_id, child.author, child.added,
      child.parent_comment_id
      , (ct.path || child.id) AS path,
      LPAD(child.content, LENGTH(child.content) + CARDINALITY(path), ' ') AS content
      FROM comments_tree ct
      JOIN comments child ON child.replied_comment_id = ct.id
  )
)

Now the output will look like:

correlated_subqueries-# SELECT id, parent_comment_id, added, replied_comment_id, content FROM comments_tree
correlated_subqueries-# ORDER BY path ASC;
 id | parent_comment_id |        added        | replied_comment_id |                  content                 
 
----+-------------------+---------------------+--------------------+------------------------------------------
-
  1 |                 1 | 2018-01-01 10:00:00 |                    | user_1 comment_id_1
  2 |                 1 | 2018-01-01 10:01:00 |                  1 |  admin answer to user_1 comment_id_1
  3 |                 1 | 2018-01-01 10:02:00 |                  2 |   user_1 answer to admin comment_id_2
  8 |                 1 | 2018-01-01 16:03:00 |                  3 |    admin answer to user_1 comment_id_3
 10 |                 1 | 2018-01-01 15:03:00 |                  8 |     admin answer to user_2 comment_id_8
 11 |                 1 | 2018-01-01 15:03:30 |                 10 |      admin answer to admin comment_id_10
 12 |                 1 | 2018-01-01 15:03:50 |                 10 |      user_3 answer to admin comment_id_10
 13 |                 1 | 2018-01-01 15:03:50 |                  2 |   user_3 answer to admin comment_id_2
  4 |                 4 | 2018-01-01 14:00:00 |                    | user_2 comment_id_5
  5 |                 5 | 2018-01-01 14:01:00 |                    | user_3 comment_id_6
  6 |                 5 | 2018-01-01 15:00:00 |                  5 |  user_1 answer to user_3 comment_id_6
  7 |                 5 | 2018-01-01 15:01:00 |                  5 |  user_2 answer to user_3 comment_id_6
  9 |                 5 | 2018-01-01 15:02:00 |                  7 |   admin answer to user_1 comment_id_7
(13 rows)

The hierarchical queries are quite nice feature present in the RDBMS. As shown in the first section they allow to work with hierarchical data structures, as for instance trees. One of great examples illustrating it is the genealogical tree. The second section shown a simple example of the hierarchical query simulating the for loop behavior. The last part was about more realistic use-case of a website comments returned through 3 different queries: one comment query and queries showing breadth-first and depth-first result generations.

Read also about Hierarchical queries here: Recursive CTEs Explained , 7.8. WITH Queries (Common Table Expressions) .

Share, like or comment this post on Twitter: