Database Table design. It is not best to have only the most suitable (adjacent table, Path Enumeration, nested set, closure table), enumeration packet table
When designing a database, do we break through the general rules and find the most suitable design scheme? The following is an example: A parent_id field will be added to the commonly used adjacent table design, for example, the region table (country, province, city, district ):
CREATE TABLE Area ([id] [int] NOT NULL,[name] [nvarchar] (50) NULL,[parent_id] [int] NULL,[type] [int] NULL );
Name: name of the region, parent_id is the parent ID, Province parent ID is the country, city parent ID is the province, and so on. Type is the class of the region: 1: country, 2: province, 3: City, 4: The level is determined, so there is no problem in designing the table, it is convenient to call. However, this design method does not meet all requirements. When we are not sure about the hierarchy, let's assume that I have the following comment structure: record the comment data (comments table) in the adjacent table ):
Comment_id |
Parent_id |
Author |
Comment |
1 |
0 |
James |
I don't quite agree with this point of view |
2 |
1 |
Xiao Zhang |
I do not agree. |
3 |
2 |
Xiaohong |
I agree upstairs |
4 |
1 |
Xiaoquan |
Why do you disagree? |
5 |
4 |
James |
I have encountered this situation before. |
6 |
5 |
Xiao Zhang |
That doesn't mean you are right. |
7 |
5 |
Xiaoxin |
This depends on the situation. |
No one found that it is difficult to query all the descendants of a node in this design table. You can use association query to obtain a comment and its descendants:
SELECT c1.*, c2.* FROM comments c1 LEFT OUTER JOIN comments c2 ON c2.parent_id = c1.comment_id;
However, this query can only obtain two layers of data. The feature of this tree is that it can be expanded in any depth. You need to have corresponding methods to obtain its deep data. For example, you may need to calculate the number of comment branches or the total overhead of a mechanical device. In some cases, the use of an adjacent table in a project applies. The advantage of the adjacent table design is that it can quickly obtain the direct parent and child nodes of a given node, and it is easy to insert new nodes. If you want to perform all operations on the layered data in your project, you can use the adjacent table. There are several solutions to the above tree model: Path Enumeration, Nested Sets, and closure tables. These solutions usually look much more complex than the neighboring tables, but they do make some operations that use the adjacent tables more complex or inefficient. If your project does need to provide these operations, these designs will be a better choice for the adjacent tables. I. path Enumeration in the comments table, we use the path FIELD OF THE varchar type to replace the original parent_id field. The content stored in this path field is the sequence from the top-level ancestor of the current node to its own. Just like a UNIX path, you can even use '/' as the path separator.
Comment_id |
Path |
Author |
Comment |
1 |
1 |
James |
I don't quite agree with this point of view |
2 |
1/2 |
Xiao Zhang |
I do not agree. |
3 |
1/2/3 |
Xiaohong |
I agree upstairs |
4 |
1/4 |
Xiaoquan |
Why do you disagree? |
5 |
1/4/5 |
James |
I have encountered this situation before. |
6 |
1/4/5/6 |
Xiao Zhang |
That doesn't mean you are right. |
7 |
1/4/5/7 |
Xiaoxin |
This depends on the situation. |
You can query a node ancestor by comparing the paths of each node. For example, to find comment #7, the path is the ancestor of 1/4/5/7, you can do this:
SELECT * FROM comments AS c WHERE '1/4/5/7' LIKE c.path || '%' ;
The query statement matches the nodes with the paths 1/4/5/%, 1/4/%, and 1/%. These nodes are the ancestor of comment #7. You can also query all the descendants of a given node by swapping the parameters on both sides of the LIKE keyword. For example, to query comments #4 and all descendants whose path is '123', you can use the following statement:
SELECT * FROM comemnts AS c WHERE c.path LIKE '1/4' || '%' ;
The following background paths can be found in the query statement: 1/4/5, 1/4/5/6, and 1/4/5/7. Once you can easily obtain a subtree or a path from a descendant node to an ancestor node, you can easily implement more queries, for example, you can query the sum of values on all nodes of a subtree. Inserting a node can be as simple as using an adjacent table. All you need to do is copy the Father's Day path of the node to be inserted and append the ID of the new node to the end of the path. Path Enumeration also has some disadvantages. For example, the database cannot ensure that the path format is always correct or that the nodes in the path do exist. It depends on the logic code of the application program to maintain the path string, and it is costly to verify the correctness of the string. No matter how large the varchar length is, the length limit still exists. Therefore, the tree structure cannot be infinitely expanded.
Ii. Nested SetsThe nested set solution stores information about child nodes, rather than the direct ancestor of nodes. We use two numbers to encode each node to indicate this information. These two numbers can be called nsleft and nsright. Each node determines the value of nsleft and nsright in the following way: the value of nsleft is smaller than the IDs of all future generations of the node, and the value of nsright is greater than the IDs of all future generations of the node. These numbers are not associated with the value of comment_id. The simple method for determining these three values (nsleft, comment_id, nsright) is to perform a depth-first traversal on the tree, and incrementally allocate the nsleft value in the layer-by-layer in-depth process, the nsright values are sequentially allocated when the return result is returned. The data obtained is as follows:
Comment_id |
Nsleft |
Nsright |
Author |
Comment |
1 |
1 |
14 |
James |
I don't quite agree with this point of view |
2 |
2 |
5 |
Xiao Zhang |
I do not agree. |
3 |
3 |
4 |
Xiaohong |
I agree upstairs |
4 |
6 |
13 |
Xiaoquan |
Why do you disagree? |
5 |
7 |
12 |
James |
I have encountered this situation before. |
6 |
8 |
9 |
Xiao Zhang |
That doesn't mean you are right. |
7 |
10 |
11 |
Xiaoxin |
This depends on the situation. |
Once you assign these numbers to each node, you can use them to find the ancestor and descendant of the specified node. For example, to search for comments #4 and all its descendants, you can search for the node IDs in the range of nsleft and nsright of comment #4. For example:
SELECT c2.* FROM comments AS c1 JOIN comments AS c2 ON c2.nsleft BETWEEN c1.nsleftAND c1.nsright WHERE c1.comment_id = 4;
For example, you can search for comments #6 and all its ancestors by searching for the nsleft and nsright ranges of nodes with the ID of #6. For example:
SELECT c2.* FROM comments AS c1 JOIN comments AS c2 ON c1.nsleft BETWEEN c2.nsleftAND c2.nsright WHERE c1.comment_id = 6;
The main advantage of nested set design is that when you want to delete a non-leaf node, its descendant will automatically replace the deleted node and become the direct descendant of its direct ancestor node. That is to say, the layer has been automatically reduced. However, some queries are very simple in the design of the adjacent table. For example, getting the direct father of a node or the direct descendant of a node will become more complicated in the design of the nested set. In the nested set, if you need to query the direct father of a node, we will do this. For example, to find the direct father of comment #6:
SELECT parent.* FROM comments AS c JOIN comments AS parent ON c.nsleft BETWEEN parent.nsleft AND parent.nsrightLEFT OUTER JOIN comments AS in_between ON c.nsleft BETWEEN in_between.nsleft AND in_between.nsrightAND in_between.nsleft BETWEEN parent.nsleft AND parent.nsright WHERE c.comment_id = 6AND in_between.comment_id IS NULL;
In short, it is complicated. Tree operations, such as inserting and moving nodes, are much more complex than other designs to use nested assembly. When you insert a new node, You need to recalculate the adjacent sibling nodes, ancestor nodes, and the brothers of the newly inserted nodes, to make sure that their left and right values are greater than the left values of the new node. At the same time, if this new node is a non-leaf node, you also need to check its child nodes. If simple and fast query is the most important part of the program, the nested set is the best choice, which is much easier and quicker than operating on individual nodes. However, the insertion and movement of Nested Sets are complicated because you need to re-allocate left and right values. If your application needs to frequently insert or delete nodes, the nested set may not be suitable.
Iii. Closure tableA closure table is a simple and elegant solution for Hierarchical Storage. It records the relationships between all nodes in the tree, not just those direct parent and child nodes. When designing the comment system, we created an additional table named tree_paths, which contains two columns, each of which points to the foreign key in comments. Instead of using the comments table to store the tree structure, we store any node pairs in the tree that have (ancestor-descendant) relationships in the treepaths table, even if the two nodes do not have a direct parent-child relationship, we also add a line pointing to the node itself.
Ancestor |
Descendant |
Ancestor |
Descendant |
Ancestor |
Descendant |
Ancestor |
Descendant |
1 |
1 |
1 |
7 |
4 |
6 |
7 |
7 |
1 |
2 |
2 |
2 |
4 |
7 |
|
|
1 |
3 |
2 |
3 |
5 |
5 |
|
|
1 |
4 |
3 |
3 |
5 |
6 |
|
|
1 |
5 |
4 |
4 |
5 |
7 |
|
|
1 |
6 |
4 |
5 |
6 |
6 |
|
|
Using the treepaths table to obtain the ancestor and descendant is more direct than using the nested set. For example, to obtain the descendant of comment #4, you only need to search the row where the ancestor is comment #4 in the treepaths table. The same is true for retrieving future generations. To insert a new leaf node, such as a subnode of comment #6, you should first insert a link between yourself and yourself, and then search for the child node of comment #6 in the treepaths table, add the "First Ancestor and one descendant" Relationship Between the node and the new inserted node (the ID of the new node should be 8 ):
INSERT INTO treepaths (ancestor, descendant)SELECT t.ancestor, 8FROM treepaths AS tWHERE t.descendant = 6UNION ALL SELECT 8, 8;
To delete a leaf node, such as comment #7, delete all rows whose descendants are comments #7 in the treepaths table:
DELETE FROM treepaths WHERE descendant = 7;
To delete a complete subtree, such as comment #4 and all its descendants, you can delete all rows whose descendant is #4 in the treepaths table, and the rows that use comments #4 as descendants. The design of a closure table is more direct than that of a nested set. Both of them can quickly query the ancestor and descendant of a given node, but the closure table can maintain hierarchical information more simply. These two designs make it easier to query the direct descendant and ancestor of a given node than to use the adjacent table or Path Enumeration. However, you can optimize the closure table to make it easier to query Father's Day points or subnodes: Add a path_length field to the treepaths table. The self-referenced path_length of a node is 0, and the path_length of its child node is 1, the next layer is 2, and so on. This makes query much easier.
Conclusion: Which design should you use:Each type of design has its own advantages and disadvantages. How to choose a design depends on which operation of the application is your most need for performance optimization.
Solution |
Table quantity |
Subquery |
Query tree |
Insert |
Delete |
Integrity of reference |
Adjacent table |
1 |
Simple |
Difficult |
Simple |
Simple |
Yes |
Enumeration path |
1 |
Simple |
Simple |
Simple |
Simple |
No |
Nested set |
1 |
Difficult |
Simple |
Difficult |
Difficult |
No |
Closure table |
2 |
Simple |
Simple |
Simple |
Simple |
Yes |
Hierarchical Data design comparison 1. The adjacent table is the most convenient design, and many programmers know about it 2. If your database supports recursive queries WITH or connect by prior, this makes the query of the adjacent table more efficient. 3. Enumeration paths can intuitively show paths between the ancestor and the descendant, but this design is very fragile because it cannot ensure the integrity of the reference. Enumeration paths also make data storage redundant. 4. nested set is a smart solution, but it may be too intelligent. It cannot ensure the integrity of reference. It is best to use a query with high performance requirements and other requirements. 5. Closure tables are the most common design, and the above solution only allows a node to belong to multiple trees. It requires an additional table to store the relationship, and uses the space-for-time solution to reduce the consumption caused by redundant computing during the operation. These design schemes are only part of our daily design. More options will be available for development. Which solution is the most suitable one based on the actual needs of your project and the advantages and disadvantages of the solution. I met some developers who, in order to be perfunctory, only considered whether the current task could be completed when designing database tables. They did not pay much attention to future expansion issues and did not consider whether the query would consume performance. When there may be a small amount of data in the early stage, there is no impact, but a little more data is obvious (for example, you can use external join queries, but instead use subqueries ). I think designing a database is a very interesting and challenging job. It sometimes shows how broad your vision is. Sometimes it can make you sleep. In short, it hurts and is happy.