R Tree Introduction and development history

Source: Internet
Author: User

R-tree spatial index algorithm research history and latest progress analysis
Abstract: This article introduces the concept of spatial index, the R-tree data structure, and the algorithm description of the R-tree spatial index, the improved structure of R-tree-variant R-tree is discussed from the advantages and disadvantages of R-tree Indexing Technology. Finally, the latest research progress of R-tree is analyzed.

 

Key words: spatial index technology; R-tree; research history; latest progress

A key issue in data search is speed. The core technology for improving the speed is Spatial indexing. Spatial indexes are mappings between spatial locations and spatial objects. Currently, some large databases have Spatial Indexing capabilities, such as Oracle and DB2.

The spatial index technology is not only to improve the display speed, but also to solve the problem. Spatial indexes provide a suitable data structure for spatial search to improve the search speed.

The core of the spatial index technology is to quickly find all the spatial objects that intersect the rectangle Based on search conditions, such as a rectangle. When the data size is huge and the rectangular box is very small than the full graph, This set is greatly reduced compared to the full graph dataset, and then the complex search is processed on this scaled-down set, the efficiency will be greatly improved.

The so-called spatial index refers to a Data Structure arranged in a certain order based on the location and shape of the spatial entity or the spatial relationship between the spatial entities, it contains the summary information of the space object, such as the Object ID, external rectangle, and pointer to the space object data. Simply put, it is to divide a space object according to a certain spatial relationship. In the future, the access to space objects will be based on the partitioning block.

 

1 Introduction
Spatial index is a description of the data location information stored on the media. It is used to improve the efficiency of the system in obtaining data. The spatial index is proposed in two ways: first, because the computer architecture divides the memory into two types: memory and external storage, it usually takes 30 to 30 minutes to access these two types of storage ~ 40ns, 8 ~ 10 ms, we can see that the difference between the two is more than 100,000 times. Although there is a "Memory Database" Statement, the vast majority of data is stored on the external disk, if the location of data on the disk is not recorded and organized, the entire data file will be scanned for each data item to be queried. The cost of accessing the disk will seriously affect the system efficiency, therefore, the system designer must record and organize the data on the disk, and replace the non-objective access to the disk through some calculations in the memory to improve the system efficiency, in particular, GIS involves a variety of massive complex data, and indexing is crucial to the processing efficiency. The second is that the multidimensional nature of GIS makes the traditional B-tree index not applicable, because the traditional data types such as characters and numbers that B-tree targets are in a good sequence set, that is to say, it is all in one dimension. The relationship between any two elements in the set can be determined on this dimension to be greater than, less than, or equal to. If multiple fields are indexed, the priority of each field must be specified to form a combined field. However, the multidimensional nature of geographical data does not have a priority problem in any direction. Therefore, B-tree cannot effectively index geographical data, therefore, we need to study a special spatial index method that can adapt to multidimensional features.

In 1984, Gutman published the r tree: a dynamic index structure for spatial queries. It is a highly balanced tree consisting of intermediate nodes and page nodes, the smallest external moment of the actual data object is stored in the page node. The intermediate node is formed by aggregating the external rectangle of its lower-level nodes, including all these external rectangles. Then, people put forward different improvements for different spatial operations on this basis to form a prosperous Index Tree family, which is currently a popular spatial index.

The r tree is another form of development from Tree B to multi-dimensional space. It divides space objects by range. Each node corresponds to a region and a disk page, the disk page of a non-leaf node stores the region range of all its sub-nodes, and the region of all the sub-nodes of a non-leaf node falls within its region range; the disk page of the leaf node stores the external rectangles of all space objects within its region. The number of sub-nodes that each node can possess has the upper and lower limits. The lower limit ensures the effective use of disk space and the upper limit ensures that each node corresponds to a disk page, when a new node is inserted and the required space for a node is greater than a disk page, the node is split into two parts. The r tree is a dynamic index structure, that is, its queries can be inserted or deleted at the same time, and the tree structure does not need to be reorganized on a regular basis.

 

2 R-tree data structure
R-tree is a spatial index data structure. The following is a brief introduction:

(1) R-tree is an N-tree, and N is called an R-tree fan ).

(2) Each node corresponds to a rectangle.

(3) The leaf node contains objects less than or equal to N, and the corresponding moment is the outsourcing rectangle of all objects.

(4) The non-leaf node rectangle is the outsourcing rectangle of all the child node rectangles.

The R-tree is defined broadly. The R-tree is constructed based on the same set of data. Different vertices can produce very different structures. What kind of structure is better? There are two standards:

(1) adjacent nodes should be a parent node in the tree as much as possible.

(2) The ratio of the intersection parts of each sibling node on the same layer is as small as possible.

The r tree is a data structure used to process multi-dimensional data. It is used to access spatial data composed of two-dimensional or higher-dimensional area objects. The R tree is a Balance Tree. There are two types of nodes on the tree: leaf nodes and non-leaf nodes. Each node is composed of several index items. For leaf nodes, index items are shown as (index, obj_id ). The index indicates the minimum external rectangle MBR of the spatial data object, and the obj_id identifies the spatial data object. For a non-leaf node, its index is like (index, child_pointer ). Child_pointer points to the child node of the node. The index still refers to a rectangular area, which is surrounded by the smallest rectangular area of all index items MBR on the child node. Example of an R tree:

 

3 R-tree algorithm description
The algorithm is described as follows:

The number of objects is N, and the slice size is set to fan.

(1) estimate the number of leaf knots K = N/fan.

(2) Sort all geometric objects by the X value of the center point of the rectangular box.

(3) group the sorted objects. The size of each group is * fan, and the last group may not be full.

(4) Each sub-group is sorted by the Y value of the center point of the rectangular box of the geometric object.

(5) group each group after sorting. The size of each group is fan.

(6) each group becomes a leaf node, and the number of leaf nodes is NN.

(7) n = nn, return 1.

 

4 R-tree Spatial Index Algorithm
1 R-tree

The history of multidimensional indexing can be traced back to the middle of 1970s. At that time, various indexing technologies such as cell algorithm, quad-tree, and K-D tree were introduced, but their effects were not satisfactory. Driven by the spatial index technology requirements of GIS and CAD systems, Gutman proposed the r Tree Index Structure in 1984 and published the r tree: A dynamic index structure for spatial queries. It is a highly balanced tree consisting of intermediate nodes and page nodes, the smallest external rectangle of the actual data object is stored in the page node, and the intermediate node is formed by aggregating the external rectangle of its lower-level nodes, including all these external rectangles. Then, people put forward different improvements for different spatial operations on this basis to form a prosperous Index Tree family, which is currently a popular spatial index.

2 R + tree

Based on the work of Gutman, many variants of the r tree have been developed. sellis and so on proposed the R + tree. The R + tree is similar to the r tree, the main difference is that the space areas of the sibling nodes in the r + tree do not overlap, in this way, the division of Space eliminates the "dead zone" generated by the overlap of nodes allowed by the R tree (the blank area of the node that does not contain the data of the node), and reduces the number of invalid queries, this greatly improves the efficiency of spatial indexes. However, operations on inserting and deleting spatial objects must ensure that there is no overlap between spatial areas and reduce the efficiency. At the same time, the R + tree provides redundancy for the storage of cross-region spatial objects. With the increase of data in the database, redundant information will continue to grow. Greene also proposed its R tree variants.

3 R * tree

In 1990, Beckman and kriegel proposed the R * tree variant of the optimal dynamic R tree. The R * tree and the r tree allow the overlap of rectangles. However, in the Construction Algorithm R * tree, not only the "area" of the index space is taken into account, but also the overlap of the index space. This method improves the node insertion and splitting algorithms, and uses the "Force re-insertion" method to optimize the tree structure. However, the R * tree algorithm still cannot effectively reduce the degree of space overlap, especially when the data volume is large and the space dimension increases. The R * tree cannot handle situations where the dimension is greater than 20.

4. QR tree

The QR tree uses the quad-tree to divide the space into some sub-spaces, and uses many R-tree indexes in each sub-space to improve the overlap of the index space. The QR tree combines the advantages of the quad-tree and the R-tree, and is a comprehensive application of the two. Experiments show that, compared with the R tree, the QR tree costs a little higher (sometimes even a little smaller) space, in exchange for higher performance, and the more index targets, the overall performance of the QR tree is better.

5 SS tree

The SS tree improves the R * tree and improves the performance of the nearest neighbor query by using the following measures: Use the smallest boundary circle instead of the smallest boundary rectangle to represent the shape of the region, the performance of the nearest query is enhanced, and nearly half of the storage space is reduced. The SS tree improves the forced re-insertion mechanism of the r * tree. When the dimension is increased to 5, the overlap of the boundary rectangle in the r tree and its variants will reach 90%. Therefore, in the high-dimensional scenario (limit 5), its performance will be poor, not even sequential scanning.

6 x trees

X-tree is a hybrid combination of Linear Arrays and layered R-trees. By introducing super nodes, the overlap between the smallest boundary rectangles is greatly reduced, and the query efficiency is improved. The X Tree uses the boundary circle for indexing. The diameter (diagonal) of the boundary rectangle is larger than that of the boundary circle. The SS tree divides points into small diameter areas. Because the area diameter has a greater impact on the query performance of the nearest neighbor, the performance of the nearest neighbor query of the SS tree is better than that of the r * tree. The average flat volume of the boundary rectangle is smaller than that of the boundary circle, the R * tree divides points into small volume areas. Because the large volume will produce a lot of coverage, the boundary rectangle is superior to the boundary circle in terms of volume. The SR tree uses both the minimum boundary circle (MBS) and the minimum boundary rectangle (MBR). Compared with the SS tree, the SR tree reduces the area of the region and improves the separation between regions, compared with the R * tree, this improves the performance of neighboring queries.

5 latest research on the R-tree Spatial Index Algorithm
The expansion of information makes database search more and more problems. In terms of index building, the primary challenge is how to construct efficient index algorithms to support various database systems (such as multimedia databases and empty-Room databases ), in particular, how to effectively use algorithms to accelerate search. To sum up, the R-tree spatial index algorithm should support high-dimensional data space, effectively split the data space to adapt to the indexing organization; efficient unification of multiple query methods in the system. The latest research on the index structure of R-tree cannot simply accelerate a certain query method or improve the performance of a certain aspect, ignoring the effects of other aspects, this may cause more unnecessary performance consumption.

XML is an extensible markup language. Its indexing method is the XR-tree Indexing Method Based on the traditional R-tree Indexing Technology. This method constructs an index structure suitable for XML data. The XR-tree index method dynamically expands the memory index data structure. It is applicable to the structure connection problems in xiss (XML indexing and storage system: XML index and storage system, an algorithm is designed based on the XR-tree index tree to effectively skip non-matching elements. However, this indexing method still stores a large number of intermediate matching results in Path Join Operations. Therefore, an index-based path connection algorithm based on the overall query mode is proposed, that is, the part matching result generated by using the stack linked list to temporarily press the stack storage, and the stack operation is performed dynamically with the matching. In this way, after the query connection is processed, the final result is directly output, which saves both storage space and improves operation efficiency.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.