Database index B + Tree

Last Update:2015-12-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The interview was inadvertently asked this question: Database index storage structure is generally B + tree, why not suitable for the red black tree and other ordinary two-tree?

After discussing with classmates, we get the following situation:

1. The database files are placed on the hard disk, each time the database is read on the disk to search, so we need to consider the disk seek time, we all know that the disk seek cost is very large. At the same time, the index is generally very large, the memory can not be put down, so it will be placed on the disk. (also related to the principle of locality and disk pre-reading ).

2. B + Tree All the keywords appear in the leaf node's linked list (dense index), and the list of keywords is ordered. A non-leaf node only acts as an index (sparse index).

The leaf node is equivalent to the data layer where the keyword is stored.

Therefore, we conclude that the number of disk I/O should be minimized due to the amount of data stored on the disk. Exactly, the leaf node of the B + Tree stores the data layer of the key (which we can call the primary key), so we can query the data directly from the primary key. If using a two-fork tree, the binary tree index is only the primary key position, I also need to base on the binary tree index primary key position, I/O operation to get the primary key.

Did not study the database, just and classmates after the discussion made a summary, the mistake is unavoidable.

Below, in the study of B-tree, B-tree, + + tree, b* tree

B-Tree

Binary search tree:

1. All non-leaf nodes have a maximum of two sons (left and right);

2. All nodes store a keyword;

3. The left pointer of a non-leaf node points to a subtree smaller than its key, and the right pointer points to a subtree larger than its key;

Such as:

B-Tree search, starting from the root node, if the keyword of the query is equal to the keyword of the node, then hit; otherwise, if the query keyword is smaller than the node keyword, enter the left son; if the pointer to the left son or right son is empty, the report cannot find the corresponding keyword;

If the number of nodes of all non-leaf nodes in the B-tree remains approximately (balanced), then the search performance of B-tree is approximate to the binary lookup; but it has the advantage over the binary lookup of contiguous memory space that changing the B-tree structure (inserting and deleting nodes) does not require moving large segments of memory data, or even constant overhead;

Such as:

However, after several insertions and deletions, the B-tree may lead to different structures:

The right side is also a B-tree, but its search performance is already linear, the same keyword collection may lead to different tree structure index, so, the use of B-tree to keep the B-tree as far as possible to maintain the structure of the left graph, and avoid the structure of the right graph, so-called "balance" problem;

The actual B-tree is based on the original B-tree to add the balance algorithm, namely "balanced binary tree"; How to keep the equilibrium algorithm of B-tree node distribution evenly is the key to balance binary tree; The equilibrium algorithm is a strategy to insert and delete nodes in the B-tree.

B-Tree

A multi-path search tree (not two-pronged):

1. Definition of any non-leaf node up to only m sons; m>2;

2. The number of sons of the root node is [2, M];

3. The number of sons of non-leaf nodes outside the root node is [M/2, M];

4. Each node is stored at least m/2-1 (rounded) and up to M-1 keywords; (at least 2 keywords)

5. Number of key words for non-leaf nodes = number of pointers to sons-1;

6. Non-leaf node keywords: k[1], k[2], ..., k[m-1]; K[i] < k[i+1];

7. Pointers to non-leaf nodes: p[1], p[2], ..., p[m], where p[1] a subtree that points to a keyword less than k[1], p[m] a subtree that points to a keyword greater than k[m-1], and other p[i] to the subtree of the keyword belonging to (k[i-1], k[i]);

8. All leaf nodes are located on the same floor;

such as: (M=3)

B-Tree search, starting from the root node, the node in the key (ordered) sequence of binary search, if the hit is finished, otherwise enter the query keyword to the range of the son node; repeat until the corresponding son pointer is empty, or is already a leaf node;

B-Tree Features:

1. The keyword set is distributed throughout the tree;

2. Any keyword appears and appears only in one node;

3. Search may end at non-leaf nodes;

4. Its search performance is equivalent to doing one-time binary search within the complete range of keywords;

5. Automatic level control;

Due to the restriction of non-leaf nodes outside the root node, at least the M/2 son, to ensure the minimum utilization of the end point, its minimum search performance is:

where m is the maximum number of subtree of non-leaf nodes, and n is the total number of keywords;

So the performance of B-tree is always equivalent to binary lookup (independent of M-value), there is no problem of B-tree equilibrium;

Due to the limitation of the M/2, when the node is inserted, if the node is full, it is necessary to divide the node into two M/2 nodes, and to delete the nodes, it is necessary to merge the two M/2 brothers nodes;

B + Tree

B + trees are variants of B-trees and are also a multi-path search tree:

1. Its definition is basically the same as the B-tree, except:

2. The sub-tree pointer of non-leaf node is the same as the number of keywords;

3. The subtree pointer of the non-leaf node p[i], pointing to the subtree (b-tree is open interval) of the key value belonging to [K[i], k[i+1]);

5. Add a chain pointer for all leaf nodes;

6. All keywords appear at the leaf node;

such as: (M=3)

The B + search is basically the same as the B. C-tree, except that the second B-tree only hits the leaf nodes (b-trees can be hit on non-leaf nodes), and its performance is equivalent to doing a binary search in the keyword complete.

Features of B +:

1. All keywords appear in the list of leaf nodes (dense index), and the key words in the list are in order;

2. Cannot be hit on non-leaf nodes;

3. The non-leaf node is equivalent to the index of the leaf node (sparse index), and the leaf node is equivalent to the data layer of storing (key) data;

4. More suitable for file indexing system;

b* Tree

B + Tree Variant, the non-root and non-leaf nodes of the B + tree are then increased to point to the brother's pointer;

b* Tree defines the number of non-leaf node keywords at least (2/3) *m, that is, the minimum usage of the block is 2/3 (instead of the B + Tree 1/2);

B + Tree Division: When a node is full, a new node is allocated, and 1/2 of the original node is copied to the new node, and the pointer to the new node is added to the parent node, and the division of the tree is affected only by the original node and the parent node, without affecting the sibling node, so it does not need to point to the brother's pointer

b*: When a node is full, if its next sibling node is not full, then move part of the data to the sibling node, insert the keyword at the original node, and finally modify the keyword of the sibling node in the parent node (because the sibling node's keyword range has changed); If the brothers are full, The new node is added between the original node and the sibling node, and each copy 1/3 of the data to the new node, and finally the pointer of the new node is added to the parent node;

Therefore, the probability of allocating new nodes to b* tree is lower than that of B + tree, and the space utilization rate is higher.

Transferred from: http://blog.csdn.net/manesking/archive/2007/02/09/1505979.aspx

Summary

B-Tree: Two fork tree, each node only stores a keyword, equal to hit, less than the left node, more than the right node;

B-Tree: Multi-path search tree, each node storage M/2 to M keywords, non-leaf node storage points to the key range of sub-nodes;

All keywords appear in the whole tree, and only once, non-leaf nodes can hit;

B + Tree: On the basis of the tree, the leaf nodes are added to the list pointers, all the keywords appear in the leaf nodes, the non-leaf nodes as the index of the leaf nodes; B + The tree is always hit by the leaf knot.

b* tree: On the basis of B + tree, for non-leaf nodes also increase the linked list pointer, the minimum utilization rate of nodes increased from 1/2 to 2/3;

Database index B + Tree

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Database index B + Tree

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support