http://blog.csdn.net/v_JULY_v/article/details/6530142
For 984 years, Guttman of the University of California, Berkeley, published a paper titled "R-TREES:A Dynamic Index structure for spatial searching", introducing the data structure of R-tree, which deals with the storage of high-dimensional spaces. This article is based on this paper to complete the writing, so if you are very interested in R-Tree, I think it is best to refer to the original:). In order to show respect for the ox, give a quote first:
Guttman, A.; "R-TREES:A dynamic index structure for spatial searching," ACM, 1984, 14
The achievements of R-tree in fields such as databases are very significant. It solves the problem of searching in high-dimensional space very well. An example of an R-tree that can be solved in the real world: find all the restaurants within 20 miles. What would you do if there were no R-trees? In general, we will divide the coordinates of the restaurant (x, y) into two fields in the database, one field records the longitude, and the other field records the latitude. That way we need to traverse all the restaurants to get their location information, and then calculate if the requirements are met. If there are 100 restaurants in a region, we're going to have to do a 100-time position calculation, and if it's applied to a large database such as Google Maps, this approach is bound to be out of line.
The R-Tree solves this high-dimensional spatial search problem very well. It extends the idea of B-tree into multidimensional space, adopts the idea of B-tree partition space, and adopts the method of merging and decomposing nodes in adding and deleting operations to ensure the balance of the tree. Therefore, the R-tree is a balanced tree used to store high-dimensional data.
OK, next, the data structure of the R tree and the operation of the R tree are described in detail in this article. As for the extension of R-Tree and the performance of R-tree, the relevant papers can be consulted.
Data structure of R-Tree
As mentioned above, the R-Tree is an extension of the B-tree in the high-dimensional space and is a balanced tree. The leaf nodes of each R tree contain multiple pointers to different data that can be stored on the hard disk or in memory. Based on this data structure of the R-Tree, when we need to make a high-dimensional spatial query, we only need to traverse through the pointers contained in a few leaf nodes to see if the data pointed to by these pointers meets the requirements. This approach allows us to get answers without having to traverse all of the data, significantly increasing efficiency. 1 is a simple example of the R-tree:
As we said above, R-Tree uses the idea of space division, how is this idea realized? The R-Tree uses a method called the MBR (Minimal bounding Rectangle), where I translate it to the "minimum bounding rectangle". From the leaf nodes start with a rectangle (rectangle) to frame the space, the more nodes upward, the larger the box space, in order to split the space. A little confused? Never mind, keep looking down. I would also like to mention here that R tree R should represent rectangle (refer to Wikipedia for an introduction to R-trees), rather than the region in most domestic textbooks (many books call R-Trees an area tree, which is wrong). Let's take two-dimensional space for example. is a picture in the Guttman paper:
Let me explain this picture in more detail. Take a look at figure (b) First, we assume that all data is a point in a two-dimensional space, and the figure only marks the data in the R8 region, that is, the shape of the object. Don't take that piece of irregular graph as a data, we think of it as a region of multiple data. To implement the R-tree structure, we use a minimum bounding rectangle to precisely frame the irregular area, so that we construct an area: R8. R8 is clearly characterized by the fact that it is just right to frame all the data in this area. Other areas surrounded by solid lines, such as R9,r10,r12, are the same. In this way, we have a total of 12 most basic minimum rectangles. These rectangles are stored in the child nodes. The next step is to do a senior level of processing. We found that the R8,r9,r10 three rectangles are closest to each other, so you can use a larger rectangle R3 to frame the 3 rectangles exactly. The same reason, R15,r16 was R6 just framed, R11,r12 was R4 happened to frame, and so on. After all the most basic minimum bounding rectangles are framed into larger rectangles, iterate again and frame the rectangles with larger boxes. I think we should all understand the characteristics of this data structure. Using the map example to explain that all the data is the location of the restaurant, the adjacent restaurant is divided into the same area, after dividing all the restaurants, then the adjacent areas into a larger area, divided after the division of a higher level, until divided into only two of the largest area left. It's convenient to look for it.
These large and small rectangles can be stored in our R-tree. The root node holds the two largest rectangles, the two largest rectangles that hold all the remaining rectangles and, of course, all the data. The next layer of nodes holds the sub-large rectangles, which narrow the range. Each leaf node is the smallest rectangle stored, and these rectangles may contain n of data.
In this case, the reader does not have to dwell on how to divide the data into the smallest area of the rectangle, or how to use a larger rectangle to frame the small rectangle, these are the next section we will discuss.
Having finished the basic data structure, let's talk about an example of how to query for specific data. Also take the restaurant as an example, suppose I want to inquire Guangzhou Tianhe District Tianhe City near a kilometer of all restaurant address how to do? Open the map (that is, the entire R-tree), first select domestic or foreign (that is, the root node). Then select South China (corresponding to the first layer node), select Guangzhou (corresponding to the second layer node), and then select Tianhe District (corresponding to the third layer node), and finally select the area where the Tianhe city (corresponding to the leaf node, stored with the smallest rectangle), traverse all the nodes in this area, to see if we can meet the requirements. Well, actually, the R-tree looks like a map, right? Corresponding:
An R-Tree satisfies the following properties:
1. Unless it is a root node, all leaf nodes contain a m to M record Index (entry). The leaf nodes that serve as root nodes can have fewer than M records. Usually, M=M/2.
2. For all records stored in leaves (entries), I is the smallest rectangle that can completely cover the points represented by these records in space (note: the "rectangle" here is extensible to the high-dimensional space).
3. Each flying leaf node has a m to m child node, unless it is a root node.
4. For each entry on a non-leaf node, I is the smallest rectangle that can fully cover the store represented by these entries in space (same property 2).
5. All leaf nodes are located on the same layer, so the R tree is a balanced tree.
Structure of leaf nodes
Let's start by exploring the structure of the leaf nodes. The data forms that the leaf nodes hold are: (I, Tuple-identifier).
Where tuple-identifier represents a tuple stored in a database, which is a record, it is n-dimensional. I is a rectangle of n-dimensional space and can exactly frame the points in the n-dimensional space represented by all records in this leaf node. I= (I0,i1,..., In-1). Its structure is as follows:
Describes the information to be stored in a leaf node in a two-dimensional space.
In this diagram, I represents the rectangle in the diagram, and its range is a<=i0<=b,c<=i1<=d. There are two tuple-identifier, which are represented in the figure as those two points. This form can be fully extended to high-dimensional space. Let's just think about what it looks like in a three-dimensional space. In this way, the structure of the leaf knot is finished.
Non-leaf knot point
The structure of non-leaf nodes is very similar to the leaf nodes. Imagine the B-tree, where the leaf nodes of the B-tree are stored in real-world data, and the non-leaf nodes are the "boundaries" of the data, or an index (readers with doubts can review the section on the B-tree in the first section above).
Similarly, the non-leaf nodes of the R-tree are stored in the following data structures: (I, Child-pointer).
Where Child-pointer is a pointer to a child's node, I is a rectangle that covers all children's nodes corresponding to the rectangle. This is a bit of a mouthful, but I don't think it's hard to understand. Give a picture:
The d,e,f,g is the rectangle that corresponds to the child's knot point. A is a larger rectangle that can cover these rectangles. This a is the rectangle corresponding to the non-leaf node. You should have realized that, right? Whether it is a leaf node or a non-leaf node, they all correspond to a rectangle. The rectangle corresponding to the upper node of the tree structure can completely cover the corresponding rectangle of the child's node. The root node is also the only one that corresponds to a rectangle, and this rectangle can cover all the points we have in the space represented by the data information.
I personally feel that this picture is not so accurate, it should be rectangular A to just cover d,e,f,g, and should not leave so much useless space. However, in order to respect the original drawing of the creator, special no modification.
Operation of the R-tree
This part is perhaps the most interesting issue for programmers. How can such an efficient data structure be implemented? This is the question that needs to be addressed in this section.
Search
The R-Tree search operation is simple, very similar to the search on the B-tree. The result it returns is all record entries that match the lookup information. And what is the input? In my personal understanding, input is more than just a range, it can be seen as a rectangle in a space. In other words, we are entering a search rectangle.
The pseudo-code is given first:
Function:search
Description: Suppose T is the root node of an R tree, looking for all record entries covered by the search rectangle s.
s1:[Find subtree] If T is a non-leaf node, if the corresponding rectangle of T is coincident with S, then check all the items stored in T, and for all of these entries, use the search operation on the root node of the subtree to which each entry points (that is, the child node of the T node).
s2:[find leaf node] If T is a leaf node, if the corresponding rectangle of T is coincident with s, then directly check all the record entries pointed to by S. Returns a record that meets the criteria.
We can understand this search operation.
The rectangle corresponding to the shaded section is the search rectangle. It overlaps with the largest rectangle (not drawn) corresponding to the root node. This will effect the search operation on its two subtrees. The rectangles corresponding to the two subtrees are R1 and R2 respectively. Search for R1 and find overlapping with R4 rectangles in R1 to continue searching for R4. Finally, the R4 contains the R11 and R12 two rectangles to find if there are qualifying records. The process of searching for R2 is the same. It is clear that the algorithm is an iterative operation.
Insert
The insert operation of the R-tree is similar to the insert operation of the B-tree. When the new data record needs to be added to the leaf node, if the leaf node overflows, then we need to split the leaf node. Obviously, the insertion of leaf nodes is more complex than the search operation. The insert operation requires some helper methods to complete.
Take a look at the pseudo-code:
Function:insert
Description: Inserts a new record entry E into the given R tree.
i1:[find the appropriate insert leaf node for the new record] start the Chooseleaf method to select the leaf node L to place the record E.
i2:[add new record to leaf node] if l have enough space to place a new record entry, add E to L. If there is not enough space, then the Splitnode method is used to obtain two nodes L and LL, both of which contain the entries in the original leaf node L and the new entry E.
I3:[the transformation upward] begins the adjusttree operation of the node L, and if a split operation is performed, the adjusttree operation of LL is required.
I4:[the tree for increased operation] if the node splits, and the split propagates upward causing the root node to split, then a new root node needs to be created, and its two children nodes are two nodes after the original root node split.
Function:chooseleaf
Description: Select the leaf node to place the new item E.
Cl1:[initialize] Sets n as the root node.
cl2:[leaf node Check] If n is a leaf node, return n directly.
cl3:[Select subtree] If n is not a leaf node, traverse the node in N to find the node with the least expansion when the E.I is added and define the node as F. If you have more than one of these nodes, select the node with the smallest area.
cl4:[down to leaf node] set N to f and repeat operation from CL2.
Function:adjusttree
Description: Changes in leaf nodes are passed up to the root node to change the individual matrices. The splitting of nodes may occur during the transfer of transformations.
At1:[initialization] n is set to L.
at2:[Test completed] If n is the root node, the operation is stopped.
at3:[Adjust the minimum bounding rectangle of the parent node entry] set p to the parent of N, and en for the entry pointing to N in parent p. Adjust the en.i to ensure that all rectangles in n are surrounded.
at4:[up-pass node splitting] If n has a node nn that has just been split, create an entry enn that points to the NN. If P has room to hold enn, Enn is added to P. If not, the P is splitnode to get P and pp.
At5:[to the next level] if n equals L and a split occurs, the NN is placed as pp. Repeats the operation starting from AT2.
Similarly, we use graphs to understand this insert more intuitively.
Let's analyze the insert operation by graphs. Now we need to insert the R21 rectangle. We start with chooseleaf operations. There are two entries in the root node, each of which is r1,r2. In fact, R1 has completely covered the R21, and if you add R21 to R2, it will increase the r2.i a lot. Obviously we chose R1 insert. Then proceed to the next level of operation. Adding R21 to R3 is more appropriate than R4, as R3 covers the R21 with a relatively small amount of area to increase. This inserts the R21 in the leaf node where the B8,B9,B10 is located. Split operations are performed because there is not enough space for the leaf nodes.
The insert operation looks like this:
This insert operation is actually similar to the first section of the B-tree insert operation, here is no longer specific, but must have seen the above pseudo-code people should also be clear.
Delete
The delete operation of R-Tree is different from that of the B-tree, but like the B-tree, it involves operations such as compression. It is believed that after reading the following pseudocode, the reader will have some experience. The deletion of R-trees is also more complex and requires some auxiliary functions to complete the operation.
The pseudo code is as follows:
Function:delete
Description: Deletes a record e from the specified R tree.
d1:[find leaf nodes containing records] Use the Findleaf method to find the leaf node L containing the record E. If the search fails, it is terminated directly.
d2:[Delete Records] removes E from L.
D3:[Pass record] to l use Condensetree operation
d4:[Reduction Tree] when the above adjustment, if the root node contains only one child node, then the only child node is set to the root node.
Function:findleaf
Description: The root node is T and expects to find the leaf node containing the record E.
fl1:[Search subtree] If T is not a leaf node, check the entry F in each T to find the F that is coincident with the rectangle corresponding to E (it does not have to be completely overwritten). For all f that satisfies the condition, Findleaf operates on the child node to which it points, until the E is found or all entries are checked.
fl2:[Search for leaf nodes to find records] if T is a leaf node, check that each entry has an e presence, and if so, returns T.
Function:condensetree
Description: L is the leaf node containing the deleted item. If L has too few entries (less than the minimum required), you must remove the leaf node L from the tree. After this delete operation, the remaining entries in L must be reinserted into the tree. This action repeats until the root node is reached. Similarly, adjust the rectangle size for all nodes on the path that passes through the process of modifying the tree.
ct1:[initialization] makes n the L. Initializes a linked list Q that stores the entries contained by the deleted node.
ct2:[Find Parent Entry] If n is the root node, jump directly to CT6. Otherwise, p is the parent node of N, so that en is the entry that points to n stored in the P node.
ct3:[Delete underflow node] If n contains fewer than m, remove en from P and add the entry in node N to the linked list Q.
ct4:[Adjust Overlay Rectangle] If n is not removed, the en.i is adjusted so that its corresponding rectangle can cover exactly the rectangle of all the entries in N.
ct5:[up a layer of nodes) so that n equals p, and repeats the operation starting from CT2.
ct6:[re-inserting orphaned entries] all entries in the node in Q need to be reinserted. Entries that originally belong to leaf nodes can be reinserted using the insert operation, and those that belong to non-leaf nodes must be inserted into the nodes of the layer before they are deleted to ensure that the subtree they point to is still on the same layer.
The Condensetree operation during the R-Tree Delete record is different from the B-tree. We know that when the B-tree is deleted, if there is less than half full (ie underflow) of the nodes, the records are "fused" with the records of other leaves, that is to say, two adjacent nodes are merged. However, the R-tree is directly reinserted.
Again, we use the graph to illustrate the operation intuitively.
Assume that the maximum number of nodes is 4, and the minimum number of entries is 2. In this picture, our goal is to delete the record C. First use the findleaf operation to find the position of the leaf node where C is located--r11. When C is removed from R11, R11 has only one record, less than 2 of the minimum number of entries, and an underflow occurs, at which point the condensetree operation is invoked. In this way, C is deleted, R11 the remaining entries--pointer to record D--is inserted in the linked list Q. This is then done to a higher layer of nodes. This R12 will be inserted into the linked list. The principle is the same, we will not repeat it here.
One thing to explain is that when we find that the delete operation is passed up, the root node's entry R1 is also inserted into Q so that the root node is left with R2. Do not worry, the re-insert operation will effectively solve the problem. We insert the r3,r12,d to the layer where it was originally located. In this way, we find that the root node has only one entry, at this point, according to the operation in inert, we will delete the root node, its child node, that is, the node where the R5,R6,R7,R3 is placed as the root node. At this point, the delete operation ends.
Map Index R-tree