Logical database Design-simple tree (recursive relational data)

Last Update:2016-04-02 Source: Internet

Author: User

Tags closure joins

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I believe that the development experience of friends have encountered such a demand. Suppose you are developing a comment function for a news site, and readers can comment on the original text or even reply to each other.

This need is not simple, mutual reply will lead to an infinite number of branches, an infinite number of ancestor-descendant relationships. This is a typical recursive relational data.

For this problem, here are a few solutions, you can choose objectively.

First, adjacency table: Dependent parent Node

The scenario for the adjacency table is as follows (just stating the problem):

CREATE TABLE Comments (commentid int PK, ParentID int,--Record parent node ArticleID int, Commentbody nvarchar (500 ), FOREIGN key (ParentID)  REFERENCES Comments (commentid)-self-connected, primary key foreign keys are in their own table FOREIGN key (ArticleID) REFERENCES A Rticles (ArticleID))

As a result of laziness, so the use of books in the picture, Bugs is articles:

This type of design is called adjacency table. This is probably the most common scenario for storing hierarchical data.

Here is some data to show the hierarchical structure data in the comment table. Sample table:

Picture Description Storage Structure:

　　Analysis of the optimal absence of adjacency table

For the adjacency table above, many programmers have used it as a default solution, but even so, it still has problems in the past.

Analysis 1: Query all descendants of a node (seeking subtree) How to check?

Let's take a look at the SQL statements that previously queried the two tiers of data:

SELECT c1.*,c2.* from Comments C1 left OUTER joins COMMENTS2 C2 on C2. ParentID = C1.commentid

Obviously, each need to check more than one layer, you need to join one more table. The number of joins for SQL queries is limited, so it is not possible to get all the descendants infinitely deep. Moreover, such a junction, the execution of aggregate functions such as count () is also very difficult.

Said it was before, now what time, after SQL Server 2005, a common table expression is done, incidentally, also solve the problem of the aggregation function (aggregate functions such as count () can also be simple and practical), such as query comment 4 All child nodes:

With Comment_cte (Commentid,parentid,commentbody,tlevel) as (    --Basic statement    SELECT Commentid,parentid,commentbody, 0 as Tlevel from Comment    WHERE parentid = 4    UNION all  --recursive statement    SELECT c.commentid,c.parentid,c. Commentbody,ce.tlevel + 1 from Comment as C     INNER JOIN comment_cte as CE    --recursive query on    c.parentid = CE. Commentid) SELECT * from Comment_cte

The results appear as follows:

　　So how do you look up the tree of ancestor nodes? For example, check all ancestor nodes of node 6:

With Comment_cte (Commentid,parentid,commentbody,tlevel) as (    --Basic statement    SELECT Commentid,parentid,commentbody, 0 as Tlevel from Comment    WHERE commentid = 6    UNION all    SELECT c.commentid,c.parentid,c.commentbody, Ce.tlevel-1 from  Comment as C     INNER JOIN comment_cte as CE-recursive query on    CE. ParentID = C.commentid    where CE. Commentid <> CE. ParentID) SELECT * from Comment_cte ORDER by Commentid ASC

The results are as follows:

Furthermore, because common table expressions can control the depth of recursion, you can easily get a subtree of any hierarchy.

OPTION (maxrecursion 2)

It seems that brother was vindicated for the adjacency table.

Analysis 2: of course, adjacency tables also have advantages, such as the convenience of adding a record.

INSERT into Comment (Articleid,parentid)    ... --just need to provide the parent node ID to be able to add.

Analysis 3: It is also easy to modify the location of a node or a subtree.

UPDATE Comment SET parentid = ten WHERE Commentid = 6-only modifies the parentid of a node, and the descendant nodes behind it are automatically justified.

　　Analysis 4: Deleting subtrees

Imagine if you delete an intermediate node, then what happens to the child nodes of that node (whose parent node is), so if you want to delete an intermediate node, you have to find all the descendants and delete them before you can delete the middle node.

Of course, this process can also be done automatically through a FOREIGN key constraint on the DELETE cascade Cascade.

Analysis 5: Delete Intermediate nodes and promote child nodes

In the face of the ascending child node, we must first modify the parentid of the direct child node of the intermediate node before we can delete the node:

SELECT parentid from Comments WHERE commentid = 6;　　  --Search for the parent node of the node to be deleted, assuming 4 UPDATE Comments SET parentid = 4 WHERE parentid = 6;　　　　　　--Modify the parentid of the child nodes of the intermediate node to delete the ParentID delete from Comments WHERE Commentid = 6;    --You can finally delete the middle node.

As can be seen from the above analysis, the adjacency table is basically already very powerful.

Second, Path enumeration

The design of a path enumeration is defined by combining information from all ancestors into a single string and saving it as a property of each node.

A path enumeration is a complete path that consists of contiguous direct hierarchical relationships. such as "/home/account/login", where home is the direct father of account, which means that home is the ancestor of login.

There is also an example of a news review, where we use path enumeration instead of adjacency table design:

CREATE TABLE Comments (commentid int PK, Path varchar (100),--just changed the field and deleted the foreign key ArticleID int, Comment Body nvarchar (+), FOREIGN KEY (ArticleID) REFERENCES articles (ArticleID))

A brief description of the problem is provided in the following data sheet:

Commentid Path Commentbody

1 1/What is the cause of this bug?

2 1/2/I think it's a null pointer.

3 1/2/3 No, I checked.

4 1/4/We need to check for invalid input.

5 1/4/5/Yes, that's a problem.

6 1/4/6/Well, check it out.

7 1/4/6/7/solved the

　　Advantages of PATH enumeration:

For the table above, let's say we need to query all the ancestors of a node, and the SQL statement can write this (assuming all ancestor nodes of query 7):

SELECT * from Comment as Cwhere ' 1/4/6/7/' like C.path + '% '

The results are as follows:

Suppose we want to query all descendants of a node, assuming 4 descendants:

SELECT * from Comment as cwhere c.path like ' 1/4/% '

The results are as follows:

Once we can easily get a subtree or a path from a descendant node to an ancestor node, we can easily implement more queries, such as counting the number of all nodes in a word (count aggregate function)

Inserting a node can also be as simple as using an adjacency table. You can insert a leaf node without modifying any of the other rows. All you need to do is copy a logical Father node path to the node to be inserted and append the ID of the new node to the end of the path. If this ID is generated by the database when it is inserted, you may need to insert this record first, then get the ID of the record and update its path.

　　Disadvantages of Path enumeration:

1, the database does not ensure that the path is always the correct format or the node in the path does exist (the middle node is deleted, no foreign KEY constraints).

2. It is expensive to rely on high-level programs to maintain strings in the path, and to verify the correctness of the strings.

3, the length of varchar is difficult to determine. Regardless of the length of the varchar is set to how large, there is a situation can not be infinitely extended.

Path enumeration is designed to be easily sorted according to the hierarchy of nodes, because the distance between nodes in the path is always 1, so you can know the depth of the hierarchy by comparing the length of the string.

Three, nested sets

A nested set solution is the information that stores the descendant nodes, not the direct ancestors of the nodes. We use two numbers to encode each node to represent this information. These two numbers can be referred to as Nsleft and Nsright.

Or with the news-comments above as an example, the way tables for nested sets can be designed to:

CREATE TABLE Comments (commentid int PK, Nsleft int,--A previous parent node       nsright int,--became two ArticleID int, Commentbody nvarchar (+), FOREIGN KEY (ArticleID) REFERENCES articles (ArticleID))

Determination of the Nsleft value: The value of Nsleft is less than the ID of all descendants of that node.

Determination of the Nsright value: The value of Nsright is greater than the ID of all descendants of that node.

Of course, the above two numbers and the value of the Commentid have no association, the way to determine the value is to do a depth-first traversal of the tree, in the process of layer by step in order to incrementally allocate the value of Nsleft, and on return, the value of the nsright is incrementally assigned.

Use the diagram in the book to illustrate the situation:

Once you assign these numbers to each node, you can use them to find the ancestors and descendants of a given node.

　　Advantages of nested sets:

I think it is the only advantage to query the ancestor tree and sub-tree convenient.

For example, by searching for those nodes, the Conmentid between Nsleft and Nsright in Comment 4 can be obtained with all its descendants:

SELECT c2.* from Comments as C1   joins Comments as C2 on cs.neleft between C1.nsleft and C1.nsright WHERE C1. Commentid = 1;

The results are as follows:

Comment 6 and all its ancestors can be obtained by searching for the ID of the comment 6 between the nsleft and nsright ranges of the nodes:

SELECT c2.* from Comment as C1 joins Comment as C2 on c1.nsleft between C2.nsleft and c2.nsright WHERE c1.commentid = 6;

The design of this nesting set also has the advantage that when you want to delete a non-leaf node, its descendants automatically replace the deleted node, which is referred to as the direct descendant of its immediate ancestor node.

Nested set design does not have to preserve hierarchical relationships. Therefore, when a node is deleted resulting in a discontinuous value, it does not have any effect on the structure of the tree.

Nesting Set Disadvantages:

1, inquires the direct father.

In the design of nested sets, the idea of this requirement is that the direct father of a given node C1 is an ancestor of this node, and there should be no other nodes between the two nodes, so you can query a node with a recursive outer join, which is the ancestor of C1 and the descendant of the other node Y. Then we make Y=x query, until the query returns NULL, that is, there is no such node, at this time y is the direct Father node of C1.

For example, to find the direct parent node of comment 6: Frankly, the SQL statement is long and smelly, the line must be OK, but I really can't write.

2, the operation of the tree, such as inserting and moving nodes.

When inserting a node, you need to recalculate the neighboring sibling nodes of the new insertion node, the ancestor nodes, and the siblings of its ancestor nodes to ensure that their left and right values are larger than the value of the new node. Also, if the new node is a non-leaf node, you should also check its descendant nodes.

That's enough, that's enough. It is difficult to check the direct parent node, this thing is very unpopular. I'm sure I won't use this design.

Four, closure of the table

The closure table is a simple and elegant solution for tiered storage, which records all the node relationships in a table and is not just a direct parent-child relationship.
In the design of the closure table, an additional treepaths table (space for time) is created, which contains two columns, each of which is a foreign key that points to commentid in comments.

CREATE TABLE Comments (commentid int PK, ArticleID int, commentbody int, FOREIGN KEY (ArticleID) REFERENCES articles (Id ))

Parent-child Relationship table:

CREATE TABLE treepaths (ancestor    int, descendant int, PRIMARY key (Ancestor,descendant),    --composite primary key FOREIGN key (an cestor) REFERENCES Comments (Commentid), FOREIGN KEY (descendant) REFERENCES Comments (Commentid))

In this design, the comments table will no longer store the tree structure, but instead stores the ancestor-descendant relationship in the book as a row of treepaths, even if the two nodes are not direct parent-child relationships, but also add a line to the node itself, can not understand? Is that the Treepaths table stores records of all ancestor-descendant relationships. Such as:

Comment Table:

Treepaths table:

Advantages:

　　1. Query all descendant nodes (Chazi):

SELECT c.* from Comment as C    INNER joins treepaths t on c.commentid = t.descendant    WHERE t.ancestor = 4

The results are as follows:

　　2. Check all ancestors of review 6 (check ancestral tree):

SELECT c.* from Comment as C    INNER joins treepaths t on c.commentid = t.ancestor    WHERE t.descendant = 6

The results appear as follows:

3. Insert a new node:

To insert a new leaf node, you should first insert yourself into your own relationship, and then search the Treepaths table for descendants that are comments 5, increasing the node's "ancestor-descendant" relationship with the new node to be inserted.

For example, the following is a Treepaths table statement that inserts a child node of comment 5:

INSERT into Treepaths (ancestor,descendant)    SELECT t.ancestor,8 from    treepaths as t    WHERE t.descendant = 5< C3/>union all    SELECT 8,8

After execution:

As for the comment watch, it's simply not going to be said.

　　4. Delete the leaf node:

For example, to delete a leaf node 7, delete all rows in the Treepaths table with descendants 7:

DELETE from treepaths WHERE descendant = 7

　　5. Delete Subtree:

　　To delete a complete subtree, such as comment 4 and all its descendants, you can delete all rows in the Treepaths table with descendants of 4, and those that are descendants of the descendants of comment 4:

DELETE from Treepaths where descendant in (SELECT descendant from treepaths where ancestor = 4)

In addition, moving nodes, first disconnecting from the original ancestor, and then the new node to establish a relationship with the SQL statement is not difficult to write.

In addition, the closure table can also be optimized, such as adding a path_length field, self-reference to 0, the direct child node is 1, then a lower layer of 2, once and so on, the query directly from the child node becomes very simple.

Summarize

In fact, in the past work, have seen different types of design, adjacency table, Path enumeration, adjacency table path enumeration together have seen.

Each design has its merits and demerits, if the choice of design depends on which operation in the application is most needed for performance optimization.

Here's a table to show how easy it is to design:

Design	Number of tables	Query Sub	Query tree	Insert	Delete	Referential integrity
adjacency table	1	Simple	Simple	Simple	Simple	Is
Enumerate paths	1	Simple	Simple	Simple	Simple	Whether
Nested sets	1	Difficult	Simple	Difficult	Difficult	Whether
Closed-Bag Table	2	Simple	Simple	Simple	Simple	Is

1, adjacency table is the most convenient design, and many software developers are aware of it. And with the help of recursive query, it makes the query of adjacency table more efficient.

2. The enumeration path can visually show the path between ancestors and descendants, but because the referential integrity is not ensured, this design is more fragile. The enumeration path also makes the storage of the data redundant.

3. Nested sets are a smart solution, but they are not guaranteed referential integrity and can only be used for queries with high performance requirements, while others are generally required.

4, the closure table is the most common design, and the most flexible, easy to expand, and a node can belong to more than one tree, can reduce the computational time of redundancy. But it requires an extra table to store the relationship, which is a scheme for the space to be exchanged for time.

Logical database Design-simple tree (recursive relational data)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More