B-Tree Detailed

Source: Internet
Author: User
Tags in degrees

B-Tree

Before the specific explanation, there is a point, again stressed: B-tree, that is, a B. tree. Because the original English name of B-Tree is b-tree, and many people in China like to b-tree translation B-tree, in fact, this is a very bad literal translation, it is easy to make people misunderstand. As one might think of a B-tree as a tree, and a tree of trees. In fact,B-tree refers to the B-tree . It is hereby stated.

We know that the B-tree is a multi-fork designed for disk or other storage devices (as you can see below, the B-tree has multiple branches, that is, multiple forks) to balance the lookup tree with respect to the binary. Similar to the red and black trees described earlier in this blog, it is better to reduce disk i/0 operations. Many database systems generally use B-tree or B-tree variant structures, such as the b* tree, which is about to be introduced below, to store information.

The biggest difference between a B-tree and a red-black tree is that the nodes of a B-tree can have many children, from several to thousands of. Why do you say B-trees are similar to red-black trees? Because, like the red and black trees, a B-tree with n nodes is also O (LGN), but may be much smaller than the height of a red-black tree, and its branching factor is larger. Therefore, the B-tree can implement various dynamic collection operations such as INSERT, delete, etc. within O (logn) time.

As shown, that is a B-tree, a key word for the English consonant B-tree, now to find the letter R from the tree species (including n[x] key words x,x have n[x]+1] children (that is, an inner node x if it contains n[x] keywords, then x will contain n[x]+1 children). All leaf nodes are at the same depth, and shaded nodes are the nodes to check when the letter R is found:

Believe that, from what you can easily see, an inner node x if it contains n[x] keyword, then x will contain n[x]+1 children. If there are 3 children in the inner node with 2 keyword D h, 4 children are included in the inner node with 3 keywords Q T x.

B-Tree is also called balanced multi-path search tree. a tree The characteristics of m -order B-Tree (M-fork tree) are as follows :

    1. Each node in the tree contains a maximum of M children (m>=2);
    2. Root nodes and leaf nodes, each of the other nodes has at least [Ceil (M/2)] Children (where ceil (x) is an upper-bound function);
    3. Joghen nodes are not leaf nodes, there are at least 2 children (special case: No Child root node, that is, the root node is a leaf node, the whole tree has only one root);
    4. All leaf nodes appear on the same layer, and the leaf nodes do not contain any keyword information (which can be seen as an external contact or a contact where the query failed, in fact these nodes do not exist, pointers to these nodes are null); ( Reader feedback @ Rengat : There is a mistake here, The leaf node is just a pointer to no children and pointers to children, and these nodes are also present and have elements. @JULY: In fact, the key is to think of what as a leaf node, because, as in the red and black tree, each null pointer as a leaf node, but did not draw it out.
    5. Each non-terminal node contains n keyword information: (N,P0,K1,P1,K2,P2,......,KN,PN). which
      A) Ki (I=1...N) is the keyword, and the keyword is sorted in ascending order of K (i-1) < Ki.
      b) Pi is a contact point pointing to Subtree, and the key of the pointer P (i-1) to all nodes of the subtree is less than Ki, but both are greater than K (i-1).
      c) The number of keywords n must satisfy: [Ceil (M/2) -1]<= n <= m-1. As shown in the following:

For the 5th above, the following: Each node in the B-tree can contain keywords (such as the previous D H and Q T X) have an upper bound and a lower bound. The nether can be expressed as a minimum number of degrees called a B-tree (the Chinese version of the algorithm is translated in degrees, the minimum degree is the smallest child in the node) T (t>=2).

    • Each non-root node must contain at least t-1 keywords. Each non-root inner node has at least one child of T. If the tree is non-empty, the root node contains at least one keyword;
    • Each node can contain more than one 2t-1 keyword. Therefore, an internal node can have up to 2t children. If a node happens to have a 2t-1 keyword, we say that the node is full (and later on the b* tree as a common variant of the B-tree, the b* tree requires that each inner node is at least 2/3 full, rather than half-full as required by the B-Tree here);
    • When the key word t=2 (t=2 means, tmin=2,t can >=2) when the B-tree is the simplest ( There are many people would mistakenly think that B-tree is a binary search tree, but the binary search tree is a binary search tree, B-Tree is a B-tree, The real most accurate definition of a B-tree is: a balanced multi-path lookup tree with a T (t>=2) keyword . Each inner node may therefore contain 2, 3, or 4 children, i.e. a 2-3-4 tree, whereas in practice, a much larger T-value is usually used.

Each node in the B-tree can contain a large number of keyword information and branches according to the actual situation (of course, it cannot exceed the size of the disk block, depending on disk drives, the size of the general block is around 1k~4k); So the depth of the tree is reduced, This means finding an element as long as a few nodes are read into memory from the external memory disk and quickly accessing the data to be found.

The type of B-tree and the node definition are as follows:

For the sake of simplicity, here is a small amount of data to construct a 3-fork tree form, the actual application of the B-tree node in a lot of keywords. In the above figure, for example, the root node, where 17 is the file name of a disk, the Red square indicates where the contents of the 17 file are stored on the hard disk; P1 represents a pointer to the 17 left subtree.

Its structure can be simply defined as:

typedef struct {

/* Number of files */

int file_num;

/* File name (key) */

char * file_name[max_file_num];

/* Pointer to child node */

Btnode * Btptr[max_file_num+1];

/* The location where the file is stored on the hard disk */

File_hard_addr Offset[max_file_num];

}btnode;

If each disk block can hold exactly one node of the B-tree (with exactly 2 file names). Then a Btnode node represents a disk block, and the subtree pointer is the address that holds the other disk block.

Below, let's simulate the process of finding file 29:

    1. Locate the root disk Block 1 of the file directory based on the root node pointer, and import the information into memory. "Disk IO operation 1 times"
    2. In memory, there are two filenames, 17, 35, and three data that store the page addresses of other disks. According to the algorithm we find 17<29<35, so we find the pointer p2.
    3. Based on the P2 pointer, we navigate to disk Block 3 and import the information into memory. "Disk IO operation 2 times"
    4. In memory, there are two filenames 26,30 and three data that store the page addresses of other disks. According to the algorithm we send, 26<29<30, so we find the pointer p2.
    5. Based on the P2 pointer, we navigate to disk Block 8 and import the information into memory. "Disk IO operation 3 times"
    6. There are two filenames in memory, 28, 29. According to the algorithm we find the text, 29, and locate the disk address of the file memory.

Analyzing the above procedure, it is found that 3 disk IO operations and 3 memory lookup operations are required. As for the file name lookup in memory, because it is an ordered table structure, you can use binary lookup to improve efficiency. The determinants of the efficiency of the whole B-tree lookup are affected by IO operations.

Of course, if we use the balanced binary tree disk storage structure to find, disk 4 times, up to 5 times, and the more files, B-tree than the balance of binary tree disk IO operations will be less, more efficient.

The height of the B-tree

As we can see from the above example, the number of Io reads for secondary storage depends on the height of the B-tree. And what is the height of the B-tree determined by what?

According to the height formula of the B-tree:

where T is the degree (the number of elements each node contains), the so-called Order, N is the total number of elements or the total key word count.

We can see that T has a decisive effect on the height of the tree. Therefore, if each node contains more elements, it is more likely to reduce the height of the B-tree if the number of elements is the same. This is why SQL Server needs to establish a clustered index with narrow keys as much as possible. Because the size of each node in SQL Server is 8092 bytes, if you reduce the size of the key, you can accommodate more elements, which reduces the height of the B-tree and improves the performance of the query.

The above B-tree height formula can also be deduced, the number of elements of each level added up, such as the degree of T node, the root is 1 nodes, the second layer is at least 2 nodes, the third layer is at least 2t nodes, the fourth layer is at least 2t*t nodes. Add all the minimum nodes to get the formula for the number of nodes N:

The height formula of the tree can be obtained by taking the logarithm on both sides.

This means that each node must have at least two child elements, because according to the height formula, if each node has only one element, that is, the t=1, then the height will tend to infinity.

4.B+-tree

B+-tree: It is a b-tree deformation tree which is produced by the file system.

The difference between a M-order B + tree and a M-order tree is:

1. N subtrees tree nodes contain n keywords, while B-tree is n subtrees tree has n-1 keyword)

2. All the leaf nodes contain information about all the keywords, and pointers to the records containing these keywords, and the leaf nodes themselves are linked by the size of the keywords from a large order of origin. (The leaf node of the B-tree does not include all the information it needs to find)

3. all non-terminal nodes can be considered as the index part , and the nodes contain only the largest (or smallest) keywords in the nodes of their sub-roots. (The non-final node of the B-tree also contains valid information that needs to be found)

A) Why is it that b+-tree is better suited to the file index and database index of the operating system in the actual application than the B-tree?

1) b+-tree disk read and write cost less

The internal node of the B+-tree does not have pointers to specific information about the keyword. Thus its internal nodes are smaller than the B-trees. If you keep all of the same internal nodes in the same disk block, the number of keywords that the disk block can hold is more. The more keywords you need to find when you read into memory at once. The number of Io reads and writes is correspondingly lower.

For example, suppose that a disk block in a disc holds 16bytes, while a keyword of 2bytes, a keyword specific information pointer 2bytes. An internal node of a 9-order B-tree (a node with a maximum of 8 keywords) requires 2 disks fast. and the B + tree internal nodes only need 1 disks fast. When an internal node needs to be read into memory, the B-tree is one more block-lookup time (the disk is the time of disc rotation) than a B + tree.

2) B+-tree query efficiency is more stable

Because a non-endpoint is not a node that ultimately points to the contents of a file, it is only the index of the keyword in the leaf node. So any keyword search must take a path from the root node to the leaf node. The path length of all keyword queries is the same, resulting in a query efficiency equivalent for each data.

b) Application of B+-tree: VSAM (Virtual storage access method) file (source paper the ubiquitous Btree D COMER-1979)

5.b*-tree

B*-tree is a variant of B+-tree , in addition to the non-root and non-leaf nodes of the B + tree to increase the pointer to the brother; the b* tree defines the number of non-leaf node keywords at least (2/3) *m, that is, the minimum usage of the block is 2/3 (instead of 1/2 of the B + tree). A simple example is given, as shown in:

B + Tree Division: When a node is full, assign a new node, and copy 1/2 of the data from the original node to the new node, and finally add a pointer to the new node in the parent node; The division of the tree only affects the original node and the parent node, without affecting the sibling node, so it does not need to point to the sibling's pointer.

b*: When a node is full, if its next sibling node is not full, then move part of the data to the sibling node, insert the keyword at the original node, and finally modify the keyword of the sibling node in the parent node (because the sibling node's keyword range has changed); If the brothers are full, A new node is added between the original node and the sibling node, each copying 1/3 of the data to the new node, and finally adding pointers to the new node at the parent node.

Therefore, the probability of allocating new nodes to b* tree is lower than that of B + tree, and the space utilization rate is higher.

6, the B-tree insert, delete operation

The 3rd section above briefly describes how the structure of the B-tree can access the data in the external memory disk, and let's take a second example to introduce the insertion (insert) and delete basic operations of the B-tree. But before we do, we have to briefly review the characteristics of the next m-Order B-Tree (M-fork tree), as follows:
    1. Each node in the tree contains a maximum of M children, i.e. m satisfies: ceil (M/2) <=m<=m.
    2. Root nodes and leaf nodes, each of the other nodes has at least [Ceil (M/2)] Children (where ceil (x) is an upper-bound function);
    3. Joghen nodes are not leaf nodes, there are at least 2 children (special case: No Child root node, that is, the root node is a leaf node, the whole tree has only one root);
    4. All leaf nodes appear on the same layer, and the leaf nodes do not contain any keyword information (which can be seen as an external contact or a contact where the query failed, in fact these nodes do not exist, pointers to these nodes are null);
    5. Each non-terminal node contains n keyword information: (N,P0,K1,P1,K2,P2,......,KN,PN). which
      A) Ki (I=1...N) is the keyword, and the keyword is sorted in ascending order of K (i-1) < Ki.
      b) Pi is a contact point pointing to Subtree, and the key of the pointer P (i-1) to all nodes of the subtree is less than Ki, but both are greater than K (i-1).
      c) The number of keywords for nodes outside of the root node must satisfy: [Ceil (M/2) -1]<= n <= m-1 (leaf nodes must also satisfy this article about the nature of the key words, except for the root node).

OK, let's take a 5-step (that is, any node in the tree with a maximum of 4 keywords, 5 subtrees tree) B-tree instance to explain (as shown):

Note:

    1. Key Words (2-4) for--Non-root nodes (including leaf nodes), number of children (3-5)--for nodes outside the root node and leaf nodes. Of course, the root node must have at least 2 children, or it would be a straight-line search tree.
    2. Was asked in an interview, what is the maximum height of a B-tree with an M-order with N total key words? Answer: Log_ceil (M/2) N (the 1th feature on M-Order B-tree above has been mentioned: each node in the tree contains a maximum of M children, that is, M satisfies: ceil (M/2) < =m<=m. Each node in the tree has a smaller number of children, and the height of the tree is greater. This issue was also asked in the 2012 Microsoft April written test. For more principles, see section 3 above: height of the B-tree.

The keywords in uppercase letters, in ascending alphabetical order.

The nodes are defined as follows:

typedef struct{

int Count; Number of key elements in the current node

ItemType Key[4]; Storing an array of keyword elements

Long branch[5]; Pseudo-pointer array, (number of records) for easy determination of merging and splitting situations

} NodeType;

6.1. Inserting (insert) operation

When inserting an element, first in the B-tree, if it does not exist, it ends at the leaf node, and then inserts the new element in the leaf node, note: If the leaf node space is sufficient, you need to move the element to the right that is greater than the newly inserted keyword in the leaf node. If the space is full and there is not enough space to add a new element, the node is "split", dividing half the number of key elements into the new adjacent right node, and the middle key element moving to the parent node (of course, if the parent node space is full, it also requires a "split" operation). And when the key element in the node moves to the right, the relevant pointer also needs to move to the right. If a new element is inserted at the root node and the space is full, the split operation is performed so that the intermediate key element in the original root node moves up to the new root node, thus causing the tree's height to increase by one level. As shown in the following:

1, OK, let's go through an example to gradually explain the next. Insert the following character letter into an empty B-tree (non-root node key words are small (less than 2) on the merge, large (more than 4) on the split): C N G A H E K Q M F W L T Z D P R X Y S, first, the node space is enough, 4 letters into the same node Such as:

2, when we try to insert H, the node found that the space is not enough to split it into 2 nodes, moving the middle element g up to the new root node, in the implementation process, we will leave A and C in the current node, and H and N placed in the New Right neighbor node. Such as:

3, when we insert e,k,q, do not need any split operation

4. Inserting m requires a split, note that M happens to be an intermediate key element, so it moves up to the parent node

5, insert f,w,l,t do not need any split operation

6, insert Z, the most right leaf node space is full, need to split operation, the middle element T moved up to the parent node, note that by moving the middle element, the tree will eventually remain balanced, the result of splitting the node has 2 keyword elements.

7, insert D, resulting in the leftmost leaf node is split, D is just the middle element, moved up to the parent node, and then the letter P,r,x,y inserted do not need any division operation (do not forget, the tree up to 5 children).

8, finally, when inserting s, the nodes containing n,p,q,r need to split, the intermediate element Q is moved to the parent node, but the situation comes, the parent node space is full, so also to split, the parent node in the middle element m moved to the newly formed root node, Note that the third pointer that was previously in the parent node is modified to include the D and G nodes. The completion of such a specific insert operation, the following describes the delete operation, the deletion operation is relative to the insert operation to consider the situation more points.

6.2. Delete operation
(1) Two steps to delete an operation
First step: Find the location of the deleted keyword K in the tree
The second step: to delete the operation of K

(2) Delete the operation of K
The B-Tree is a generalization of the two-fork sort tree, in which the sequential traversal of the B-Tree also gets the ordered sequence of the keywords (see the practice for a specific traversal algorithm). The ordinal pre-trend (successor) of either keyword K must be the last (first) keyword in the right (left) node of K's Zuozi (right subtree).
If the keyword K is deleted from the node of the non-leaf, then the middle sequence of k (or subsequent) K ' replace K, and then delete K ' from the leaves. Three scenarios for deleting a keyword k from the leaf *x are:
Situation one: If x->keynum>min, then simply delete the K and its right pointer (*x is the leaf, K's right pointer is empty) can make the deletion operation to end.
Attention:

Case two: If X->keynum=min, the number of keywords in the leaf is the minimum, the deletion of K and its right pointer will destroy the B-tree properties (3). If the number of keywords in the left (or right) neighbor *y of *x is greater than min, the maximum (or minimum) keyword in *y is moved up to the parent node *parent, and the corresponding keyword in *parent is moved down to X. Obviously this move makes the number of keywords in both parents unchanged; *y is moved out of a keyword, so its keynum minus 1, because it is greater than min, so the reduction of 1 keywords after the keynum is still greater than the Min, and *x has been moved into a keyword, so delete k after the *x still have a min keyword. The three nodes involved in the Moving keyword satisfy the B-Tree's Nature (3). Please verify that the above operation still satisfies the B-tree nature (1). When the move is complete, the deletion process also ends.
Scenario Three: If the number of keywords in the *x and its neighboring siblings (and possibly only one sibling) is the minimum min, then the above-mentioned moves do not work, and the *x and left or right brothers must be merged at this time. may wish to have the right neighbour Brother *y (to the their neighbourhood brothers discussion and similar), after the deletion of K in *x, the parent node *parent intermediary between *x and *y the keyword K, as an intermediate keyword, and the keywords in x and *y together "merge" to a new node to replace *x and *y. Because *x and *y each have a min keyword, from the parents to remove the K ' offset from the *x removed from the K, so the new node just 2Min (that is, 2"m/2"-2≤m-1) keyword, no damage to the nature of B-Tree (3). However, since K ' moved from parent to new node, it is equivalent to delete K ' from *parent, if Parent->keynum is greater than min, then the deletion operation ends; otherwise, the same will be done by moving the keywords in the left and right siblings of the *parent or *parent their The right and left brothers merge method to maintain B-tree properties. In the worst case, the merge operation propagates up to the root, and when there is only one keyword in the root, the merge operation will merge the root node and its two children into a new root, thus reducing the height of the whole tree by one layer.



Analysis:
The 1th deleted keyword h is in the leaves, and the leaves of the keynum>min (5 order B-Tree min=2), so directly deleted. The 2nd deletion of R is not in the leaves, so the successor S in the middle order to replace R, that is, to copy S to the position of R, and then delete s from the leaves. The number of keywords in the leaves of the 3rd deleted p is the Min min, but the keynum>min of the right sibling can be shifted left to move the parent s to the node where P is located, and the smallest (i.e. leftmost) keyword T on the right sibling is moved up to the parent substitution s. When D is deleted, the node where D is located and its left and right siblings do not have redundant keywords, so it is necessary to combine the deletion of D with one of the two brothers (in the picture, select the left sibling (AB)) and their parents to separate the two merged nodes together to form a new node (ABCE). But because the parents lost C after keynum<min, it must be adjusted to the node, at this time it only a right brother, and the right brother no redundant keywords, it is impossible to solve by moving the keyword. This causes the merge to be merged again, because the root has only one keyword, so the combined tree height is reduced by one layer, resulting in the last figure.

the height and performance analysis of B-Tree

The time to operate on a B-tree is usually made up of both the time of the disk and the CPU compute time. The number of access disks required for most basic operations on the B-tree depends on the tree height h. When the total number of keywords is the same, the smaller the B-tree height, the less time disk I/O takes.
Disk I/O is much slower than high-speed CPU computing, so sometimes it ignores CPU compute time, only the number of disk accesses required by the algorithm (the number of disk accesses multiplied by the average time of a read-write disk) is the total time of disk I/O.

1, B-height of the tree
Theorem 9.1 If n≥1,m≥3, then to any one has n keywords of M-order B-tree, its tree height h is at most:
Logt ((n+1)/2) +1.
Here T is the minimum degree of each (outside of root) inner node, i.e.

The above theorem shows that the B-tree height is O (logtn). So the number of read/write disks found, inserted, and deleted on the B-Tree is O (logtn), and the CPU time is O (mlogtn).

2. Performance Analysis
The ①n of the equilibrium of the nodes of the two-fork-ordered height h (i.e. LGN) is about LGT times greater than the B-tree height H.
"Example" if m=1024, then lgt=lg512=9. At this point, if the B-tree height is 4, the height of the balanced two-fork sort tree is approximately 36. Obviously, the greater the M, the smaller the B-tree height.
② to be an in-memory lookup table, B-trees are not necessarily better than the balanced two-fork sort tree, especially when M is larger.
Because the CPU compute time for the find operations is on the B-tree
O (MLOGTN) =0 (LGN ( M/LGT))
and m/lgt>1, so m larger when O (MLOGTN) than the balanced two-fork sorting tree on the corresponding operation time O (LGN) much larger. Therefore, only B-trees that are used in memory must take a smaller m. (Usually the minimum value is m=3, at which point the B-tree each internal node can have 2 or 3 children, this 3-order B-tree is called 2-3 tree).

B + Tree

B + trees are variants of B-trees and are also a multi-path search tree:

1. Its definition is basically the same as the B-tree, except:

2. The sub-tree pointer of non-leaf node is the same as the number of keywords;

3. The subtree pointer of the non-leaf node p[i], pointing to the subtree (b-tree is open interval) of the key value belonging to [K[i], k[i+1]);

5. Add a chain pointer for all leaf nodes;

6. All keywords appear at the leaf node;

such as: (M=3)


The B + search is basically the same as the B. C-tree, except that the second B-tree only hits the leaf nodes (b-trees can be hit on non-leaf nodes), and its performance is equivalent to doing a binary search in the keyword complete.
Features of B +:

1. All keywords appear in the list of leaf nodes (dense index), and the key words in the list are in order;

2. Cannot be hit on non-leaf nodes;

3. The non-leaf node is equivalent to the index of the leaf node (sparse index), and the leaf node is equivalent to the data layer of storing (key) data;

4. More suitable for file indexing system;


b* Tree

is a variant of B + tree in which the non-root and non-leaf nodes of the B + tree are added to the pointer of the brother;


b* Tree defines the number of non-leaf node keywords at least (2/3) *m, that is, the minimum usage of the block is 2/3 (instead of the B + Tree 1/2);
B + Tree Division: When a node is full, a new node is allocated, and 1/2 of the original node is copied to the new node, and the pointer to the new node is added to the parent node, and the division of the tree is affected only by the original node and the parent node, without affecting the sibling node, so it does not need to point to the brother's pointer

b*: When a node is full, if its next sibling node is not full, then move part of the data to the sibling node, insert the keyword at the original node, and finally modify the keyword of the sibling node in the parent node (because the sibling node's keyword range has changed); If the brothers are full, The new node is added between the original node and the sibling node, and each copy 1/3 of the data to the new node, and finally the pointer of the new node is added to the parent node;

Therefore, the probability of allocating new nodes to b* tree is lower than that of B + tree, and the space utilization rate is higher.

B-Tree Detailed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.