Another way to manipulate hierarchical data in MySQL are nested sets. This approach uses an interesting technique to represent hierarchies of data.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
The first part of the article will introduce us to a new concept of hierarchical data management. It will explain in what this concept is innovative and will present a sample structure of a table. The second part will show us, how to retrieve and manipulate the data based on nested set model.
What is nested set model ?
Why the nested set is revolutionary ? It's because of its way of thinking about hierarchical data. In fact, it doesn't consider the data as a tree but as a nested set. Take a look of the picture and the SQL table with family members :
Parents
Daughter
Son
This is a translation of this image on SQL language :
CREATE TABLE IF NOT EXISTS `family` ( `id` smallint(5) NOT NULL AUTO_INCREMENT, `role` varchar(30) NOT NULL, `lft` smallint(5) NOT NULL, `rgt` smallint(5) NOT NULL, PRIMARY KEY (`id`) ); -- add some family members -- parents INSERT INTO family (id, role, lft, rgt) VALUES (NULL, 'parents', 1, 28); -- daughter INSERT INTO family (id, role, lft, rgt) VALUES (NULL, 'daughter', 2, 15 ); -- son INSERT INTO family (id, role, lft, rgt) VALUES (NULL, 'son', 16, 27); -- grandchildren INSERT INTO family (id, role, lft, rgt) VALUES (NULL, 'son', 3, 8); INSERT INTO family (id, role, lft, rgt) VALUES (NULL, 'son', 9, 14); INSERT INTO family (id, role, lft, rgt) VALUES (NULL, 'son', 17, 18); INSERT INTO family (id, role, lft, rgt) VALUES (NULL, 'daughter', 19, 26); -- great grandchildren INSERT INTO family (id, role, lft, rgt) VALUES (NULL, 'daughter', 4, 5); INSERT INTO family (id, role, lft, rgt) VALUES (NULL, 'son', 6, 7); INSERT INTO family (id, role, lft, rgt) VALUES (NULL, 'daughter', 10, 11); INSERT INTO family (id, role, lft, rgt) VALUES (NULL, 'son', 12, 13); INSERT INTO family (id, role, lft, rgt) VALUES (NULL, 'daughter', 20, 21); INSERT INTO family (id, role, lft, rgt) VALUES (NULL, 'daughter', 22, 23); INSERT INTO family (id, role, lft, rgt) VALUES (NULL, 'daughter', 24, 25);
We observe that almost every family member contains another one, and the another one contains another one and so on. You note that next to each member we put the numbers (columns lft and rgt). These numbers are growing up at each level. They serve to identify the groups belonging to the parent. The numbers at the left and at the right of every item can be used, for example, to retrieve the family members with SQL's BETWEEN clause :
SELECT * FROM family WHERE lft BETWEEN 2 AND 15;
This query returns the family tree of the parent's daughter. Now we can compare it with the query from the article about managing hierarchical data in adjacency list.
Retrieving a data in nested set model
Full tree construction can be done at the same manner as in adjacency list model. But the way of getting a node of one branch changes. It's simpler to do it in nested set model. We don't have to base our SQL query on JOINs anymore. Thanks to lft and rgt columns which fix the size of a node, we can use a simple query with BETWEEN clause.
We did it previously very shortly. Here we will present in more on details. We want to get all family of the parent's daughter side. Code to use should be :
SELECT children.* FROM family parent JOIN family children ON children.lft BETWEEN parent.lft AND parent.rgt WHERE parent.id = 3;
Simple and thick, isn't it ? Everything what we need to know is the id of the root node.
Adding new data in nested set model
From both queries, the query based on nested set is simpler to maintain. It's also simpler to extend. Imagine that you have to add a new child for a person whose children (daughter and son) have lft and rgt columns between 4 and 7. They are only 2 things to do when we add a new node in nested set model :
- generate lft and rgt numbers
- update all lft and rgt numbers greater than generated rgt by adding 2
Let's do them one by one. First, we generate lft and rgt values for the new node. We retrieve the greater number of rgt column of concerned row. The query which we can use for that is :
SELECT MAX(rgt) FROM family WHERE id = 9; -- returns 7
Query returns 7. We already know that the next inserted row will store 8 and 9 as lft and rgt columns values. The generated insert query is following (but don't execute it !) :
INSERT INTO family (id, role, lft, rgt) VALUES (NULL, 'son', 8, 9);
Before executing the insert query, we have to update already existing rows by increasing theirs rgt and lft values by 2. Only the rows with lft or rgt column bigger than 7 will be updated :
UPDATE family SET rgt = rgt + 2 WHERE rgt > 7; UPDATE family SET lft = lft + 2 WHERE lft > 7;
Now we can execute the insert query and check if it works. To check it, we can prepare a simple query susceptible to retrieve added row. In our case, we want to get whole genealogical tree on the side of parent's daughter. Used query is following :
SELECT * FROM family WHERE lft BETWEEN 3 AND 10;
New son should appear at the returned list of results :
+----+----------+-----+-----+ | id | role | lft | rgt | +----+----------+-----+-----+ | 4 | son | 3 | 10 | | 8 | daughter | 4 | 5 | | 9 | son | 6 | 7 | | 15 | son | 8 | 9 | +----+----------+-----+-----+
Deleting a row from nested set model
But not only adding a new row is simple. The deleting is quite simple too. Unlike adjacency model, we don't have to retrieve all concerned nodes before executing the delete statement (version with MyISAM engine which doesn't support cascade operations on foreign keys). The steps to follow are :
- get lft and rgt values concerned by the delete
- calculate the depth of deleted node
- execute the delete query
- update all depending rows
The second and last point might appear as difficult. But let's make everything in order :
- get lft and rgt values concerned by the delete
- calculate the depth of deleted node
We can make this both steps in one step with this query :
SELECT @lft := lft, @rgt := rgt, @depth := rgt - lft + 1 FROM family WHERE id = 3
The lft and rgt values are set directly in SELECT statement (@lft :=, @rgt :=). It avoids us to write a client-side code to handle the deleting. You can write it in your own, but for the reasons of consistency, we avoid it here.
Why we have to calculate this depth ? The depth helps us to generate the new values of lft and rgt columns and maintain the data consistency.
- execute the delete query
DELETE FROM family WHERE lft BETWEEN @lft AND @rgt;
Nothing special. A simple DELETE statement with values got in previous query.
- update all depending rows
UPDATE family SET rgt = rgt - @depth WHERE rgt > @rgt; UPDATE family SET lft = lft - @depth WHERE lft > @lft;
Now we can use our @depth variable to shift every rows that rgt is bigger than @rgt or lft is bigger than @lft. Thanks to it we avoid the situations where nested set contains one or more holes. A hole appears when two following rgt and lft values aren't incremented by 1. For example : we have a row with lft = 4 and rgt = 5. We expect that we'll find a row with lft = 6. But this row has lft = 7 instead of 6. The absent 6 value is here considered as a "hole".
Let's check how it works. Firstly, make a SELECT query to see which rows will be deleted :
SELECT children.* FROM family parent JOIN family children ON children.lft BETWEEN parent.lft AND parent.rgt WHERE parent.id = 3;
After that, execute the previous queries one after one. At the end, select all rows and check if the "holes" don't exist. They are a full code snippet to use :
SELECT @lft := lft, @rgt := rgt, @depth := rgt - lft + 1 FROM family WHERE id = 3; DELETE FROM family WHERE lft BETWEEN @lft AND @rgt; UPDATE family SET rgt = rgt - @depth WHERE rgt > @rgt; UPDATE family SET lft = lft - @depth WHERE lft > @lft; SELECT * FROM family ORDER BY lft ASC;
If everything go well, you should see the following results :
mysql> SELECT * FROM family ORDER BY lft ASC; +----+----------+-----+-----+ | id | role | lft | rgt | +----+----------+-----+-----+ | 1 | parents | 1 | 18 | | 2 | daughter | 2 | 17 | | 4 | son | 3 | 10 | | 8 | daughter | 4 | 5 | | 9 | son | 6 | 7 | | 15 | son | 8 | 9 | | 5 | son | 11 | 16 | | 10 | daughter | 12 | 13 | | 11 | son | 14 | 15 | +----+----------+-----+-----+ 9 rows in set (0.00 sec)
In this article we saw that another way of thinking trees in SQL exists. Instead of making a flat relation based on one parent column (adjacency list), we used here a set whose construction is based on the maximum left and right values. Thanks to it the data reading is simplified. Data manipulation is based on this maximum values too. When we insert or delete an existen row, we have to shift some of rows to right or to left side.