From: http://blog.csdn.net/acs713/article/details/6880375
B-tree is a common data structure. Using the B-tree structure can significantly reduce the intermediate process experienced when locating records, thus accelerating access. According to the translation, B is generally considered as the abbreviation of balance. This data structure is generally used for database indexing, and the overall efficiency is high.
In B-tree, each node contains:
1. Number of keywords contained in this node;
2. pointer to the parent node;
3. Keywords;
4. pointer to a subnode;
B-tree is a multi-path search tree (not a binary tree). For an M-level tree:
1. Define any non-leaf node with a maximum of m children; and M> 2;
2. The number of children at the root node is [2, m], unless the root node is a leaf node;
3. The number of non-leaf nodes except the root node is [m/2, m];
4. Number of keywords for non-leaf nodes = number of pointers to Son-1;
5. Corresponding to 3, each non-leaf node stores at least m/2-1 (rounded up) and at most M-1 keywords;
6. Non-leaf node keywords: K [1], K [2],…, K [M-1]; and K [I] <K [I + 1];
7. Non-leaf node pointer: P [1], p [2],…, P [m]; where P [1] points to a subtree with a keyword less than K [1], p [m] points to a subtree with a keyword greater than K [M-1, other P [I] points to the subtree where the keyword belongs (K [I-1], K [I;
8. All leaf nodes are on the same layer;
Example: (M = 3)
B-tree has the following features:
1. The set of keywords is distributed in the entire tree;
2. Any keyword appears only in one node;
3. The search may end at a non-leaf node;
4. The search performance is equivalent to performing a binary search in the complete set of keywords;
5. Automatic hierarchical control;
Because the non-leaf nodes except the root node are restricted and contain at least m/2 sons, the minimum utilization of the nodes is ensured. The lowest search performance is as follows:
Minimum B-tree search performance
M indicates the maximum number of Subtrees for non-leaf nodes and N indicates the total number of keywords;
Therefore, the performance of B-tree is always equivalent to binary search (irrelevant to m value), so there is no B-tree balance problem;
Due to the limitation of M/2, if the node is full when inserting the knot, you need to split the knot into two nodes each occupying M/2. When deleting the knot, merge two sibling nodes with less than M/2.
B-tree has good positioning characteristics and is often used in scenarios with demanding search time requirements. For example:
1. B-tree indexes are a method for accessing and searching files (called records or key values) in a database.
2. the node on the hard disk is also in the B-tree structure. Compared with the memory, the hard disk must spend multiple times to access a data element, because the speed of reading and writing data by mechanical components of the hard disk is far behind that of the memory of pure electronic media. Compared with the binary tree with one node and two branches, B-tree uses the nodes of multiple branches (called Subtrees) to reduce the number of knots experienced when obtaining records, to save access time.
In most systems, the algorithm execution time on the B-tree is mainly determined by the number of reads and writes to the disk. Each read/write operation can increase the algorithm execution speed.
The node size in the B-tree is generally a disk page, and the keywords contained in the node and the number of children depend on the disk page size.
Note:
① For a large B-tree on the disk, the number of children (that is, the degree of the node) of each node is usually 50 to 2000 m
② A B-tree with a degree of M is called a B-tree with a degree of M.
③ Selecting a higher node level can reduce the tree height and reduce the number of disk accesses required to search for any keyword.
A 3-Level B-tree with a height of 3 is given.
Note:
① Each node contains 1000 keywords, so there are more than 1 million leaf nodes on the third layer. These leaf nodes can accommodate more than 1 billion keywords.
② The number in each node in the figure indicates the number of keywords.
③ Generally, the root node can always be placed in the primary storage. Therefore, you only need to access the external storage for any keyword in the B-tree.
B-tree storage structure
# Define Max l000 // maximum number of keywords in a node: max = m-1, and m is the order of B-tree
# Define min 500 // minimum number of keywords in non-Root-knot points: min = running M/2 running-1
Typedef int keytype; // The keytype should be defined by the user.
Typedef struct node {// the pointer to the record represented by the keyword is omitted in the node Definition
Int keynum; // The number of currently owned keywords in the node.
Keytype key [Max + 1]; // The keyword vector is key [1 .. keynum], and key [0] is not required.
Struct node * parent; // point to the parent node
Struct node * son [Max + 1]; // The Child pointer vector is son [0 .. keynum]
} Btreenode;
Typedef btreenode * btree;
Note:
For simplicity, the auxiliary information fields are omitted. In practice, what is stored together with each keyword is not a relevant auxiliary information domain, but a pointer to another disk page. The disk page contains the records represented by this keyword, and the related auxiliary information is stored in this record.
Some B-trees (such as the B + tree introduced in Chapter 10th) store all the auxiliary information in the leaf node, while the internal node (the root node may also be considered as the internal node) only the key words and the pointer to the child node are stored, and the pointer to the secondary information is not stored, so that the degree of the internal node is maximized as much as possible.
B-basic operations on the tree
1. B-tree search
(1) B-tree search method
The method for finding a given keyword in the B-tree is similar to that in the binary sorting tree. The difference is that, on each node, it is determined that the downward search path is not necessarily a two path, but a keynum + 1 path.
Search for the vector key [L .. keynum] that stores the sequence of ordered keywords in a node using sequential search or semi-query. If the keyword K to be queried is found in a node, the address of the node and K are returned in key [1 .. keynum]; otherwise, after determining the node between a key [I] and key [I + 1, read the node specified by son [I] from the disk to continue searching ....... The query fails until the query is successful at a node or when the query fails at the leaf node.
[Example] the dotted line on the left shows the process of searching for keyword 1, which fails on the NULL pointer between H and K of the leaf node; the dotted line on the right shows the process of searching for keyword S, and return the address of the node where S is located and the address of S in key [1 .. location 2 in keynum.
(2) B-tree search algorithm
Btreenode * searchbtree (btree T, keytype K, int * POS)
{// Search for the keyword K in B-tree T. If the key is successfully found, the address of the node and position of K in the key * POS
// If it fails, null is returned, and * POS is not defined
Int I;
T → key [0] = K; // set the Sentinel. Search for the key in sequence [1 .. keynum]
For (I = T-> keynum; k <t-> key [I]; I --); // find 1st keywords smaller than or equal to K from the forward
If (I> 0 & T-> key [I] = 1) {// the query is successful, and T and I are returned.
* Pos = I;
Return T;
} // The node fails to be searched, but t-> key [I] <k <t-> key [I + 1]. The node for the next query should be
// Son [I]
If (! T-> son [I]) // * t is the leaf. If K is not found in the leaf, the entire search process fails.
Return NULL;
// Find the location of the inserted keyword. * Pos = I should be set and T should be returned. See the insert operation below.
Diskread (t-> son [I]); // read the tree node searched by the person on the disk to the memory.
Return searchbtree (t-> son [I], K, POS); // Recursively search for the tree T-> son [I]
}
(3) time overhead of the search operation
There are two basic steps for searching on the B-tree:
① Search for nodes in the B-tree. This query involves the diskread operation on the read disk, which is an external query;
② Search within the node.
The search operation time is:
① When the number of external disk reads does not exceed the tree height h, the time is O (h );
② In the search, the number of keywords in each node is keynum <m (M is the order of B-tree), so the time is O (NH ).
Note:
① In fact, the external search time may be much longer than the internal search time.
② When the B-tree is used as a database file, the root node must be read into the memory after the file is opened, and the root will remain in the memory until the file is closed, therefore, the time for reading the root node is not counted.
Tree B insertion and Deletion
Next we will use another example to describe the basic insert and delete operations of the B-tree. But before that, we have to briefly review the features of the next M-Level B-tree (m-tree), as shown below:
- Each node in the tree contains a maximum of m children, that is, M satisfies: Ceil (M/2) <=M<= M.
- In addition to the root node and leaf node, each other node must have at least [Ceil (M/2)] children (Ceil (X) is a maximum function );
- If the root node is not a leaf node, there are at least two children. (In special cases, the root node has no children, that is, the root node is a leaf node, and the entire tree has only one root node );
- All the leaf nodes appear on the same layer. The leaf node does not contain any keyword information (it can be seen as an external contact or a link that fails to be queried. In fact, these nodes do not exist, all pointers pointing to these nodes are null );
- Each non-terminal node contains N keywords (n, P0, K1, P1, K2, P2,..., kN, PN ). Where:
A) ki (I = 1... n) is the keyword, And the keyword is sorted in ascending order K (I-1) <Ki.
B) PI is the point pointing to the sub-tree root, and the pointer P (I-1) points to all the nodes of the sub-tree are less than Ki, but greater than K (I-1 ).
C) The number of keywords for nodes other than the root node n must meet the following requirements: [Ceil (M/2) -1] <= n <= s-1 (the leaf node must also satisfy the nature of the keyword number, except for the root node ).
OK. Next we will build a 5-level tree (M = 5, that is, a maximum of 5 children and at least 3 children at the inner node except the root node and the leaf node) b-tree instances (as shown in ):
Note: The number of keywords (2-4) is for non-root nodes (including Leaf nodes) and the number of children (3-5). It is for internal nodes other than root nodes and leaf nodes. Of course, the root node must have at least two children, otherwise it will become a linear search tree.
The keywords are uppercase letters in ascending order.
The node is defined as follows:
Typedef struct {
Int count; // number of key elements in the current node
Itemtype key [4]; // an array that stores the keyword Element
Long Branch [5]; // a pseudo-pointer array (number of records) to facilitate the determination of merging and splitting
} Nodetype;
Insert operation
When inserting an element, first check whether the element exists in Tree B. If it does not exist, end at the leaf node, and then insert the new element into the leaf node. Note: if the leaf node has enough space, you need to move the leaf node to the right to find the New Keyword element. If the space is full, there is not enough space to add new elements, split the node, split half of the keyword elements into the new adjacent right node, and move the intermediate keyword elements to the parent node (of course, if the space of the parent node is full, the operation also needs to be split.) When the key elements in the node are moved to the right, the related pointer needs to be shifted to the right. If a new element is inserted at the root node and the space is full, the split operation is performed. In this way, the intermediate keyword element in the original root node is moved up to the new root node, which leads to an additional layer of tree height.
1. Let's explain it step by step through an instance. Insert the following letters into an empty B tree (non-Root NodeKeyword countMerge when it is small (less than 2) and split when it is large (more than 4 ): c n g a h e k q m f w l t z d p r x y s. First, the node space is sufficient. Four letters are inserted into the same node, for example:
2. When we try to insert H, the node finds that there is not enough space to split it into two nodes and move the intermediate element g to the new root node. in the implementation process, let's leave A and C in the current node, while H and N in the New Right neighbor node. For example:
3. When we insert E, K, and Q, no split operation is required.
4. Insert m to split it once. Note that M is just an intermediate keyword element, so that it is moved up to the parent node.
5. Insert F, W, L, and T without any split operations.
6. When inserting Z, the rightmost leaf node space is full. Split the node and move the intermediate element T to the parent node. Note that the intermediate element is moved up, in the end, the tree remains balanced. The split result node has two keyword elements.
7. When D is inserted, the leftmost leaf node is split. D is also an intermediate element. Move it to the parent node, and the letter P, R, X, Y inserts one after another without any split operations (don't forget, there are up to five children in the tree ).
8. Finally, when inserting S, the node Containing N, P, Q, and r needs to be split and the intermediate element Q is moved to the parent node, the space in the parent node is full, so we need to split it and move the middle element m in the parent node to the New Root-knot point, note that the third pointer in the parent node contains the d and g nodes after modification. In this way, the specific insert operation is completed. The following describes the delete operation. The delete operation has more considerations than the insert operation.
Delete operation
First, find the elements to be deleted in Tree B. If the element exists in Tree B, delete the element in its node. If the element is deleted, first, determine whether the element has left and right child nodes. If yes, move a similar element from the child node to the parent node, and then move it. If no, after the deletion, it will be moved.
Delete the element. After moving the corresponding element, if the number of elements in a node (that is, the number of keywords) is less than Ceil (M/2)-1, check whether a neighboring sibling node is full (the number of elements in the node is greater than Ceil (M/2)-1 )(Do you still remember the C point in Section 1 about the 5th features of B tree? : C) the number of keywords for nodes (including Leaf nodes) except the root node n must meet the following requirements: (Ceil (M/2)-1) <= n <= S-1. M indicates a maximum of m children, and N indicates the number of keywords. In the example of a B-tree in this section, the number of keywords n satisfies: 2 <= n <= 4). If the node is full, an element is used to satisfy the condition for the parent node. If the adjacent brother is out of poverty, the number of nodes is smaller than Ceil (M/2) -1, then the node and a neighboring sibling node "merge" into a node to meet the conditions. Let's take a look at the following examples.
A fifth-Level B tree constructed by the preceding insert operation (the tree contains a maximum of M (M = 5) children, so the minimum number of keywords isCeil (M/2)-1 = 2.Or,Keyword countFor example, if the number is smaller than 2, merge and if the number is greater than 4, split. Delete h, T, R, and E in sequence.
1. Delete the element h first. Of course, first search for the element H and H in a leaf node, and the number of elements of the leaf node 3 is greater than the minimum number of elements Ceil (M/2) -1 = 2, the operation is very simple, we only need to move K to the original H location, move l to K Location (that is, delete the element behind the element in the node to move forward)
2. Next, delete T, Because t is not found in the leaf node, but in the middle node. We find its successor W (The next element in ascending order of letters ), move W to T and delete W from the child node that contains W. After W is deleted, the number of elements in the child node is greater than 2, merge operations are not required.
3. Delete R in the next step. R is in the leaf node, but the number of elements in the node is 2. As a result, only one element is deleted, which is smaller than the minimum number of elements Ceil (5/2) -1 = 2, as we know before: If a neighboring sibling node is full (the number of elements is greater than Ceil (5/2)-1 = 2 ), you can borrow an element from the parent node, then, move the largest adjacent sibling node to the parent node and move the last or the first element to the parent node (do you see the left-hand operation shadow in the red/black tree ?), In this instance, the right adjacent sibling node is full (3 elements are greater than 2), so first borrow an element W from the parent node and move it down to the leaf node, in place of the original s position, S is moved forward; then X is moved to the parent node in the adjacent right brother node, and then X is deleted from the adjacent right brother node, followed by the element moving forward.
4. Delete E in the last step, which will cause many problems after deletion, because the number of nodes where E is located is just as high as the standard, and the minimum number of elements is enough (Ceil (5/2)-1 = 2 ), the same is true for adjacent sibling nodes. Deleting an element does not meet the conditions. Therefore, the node must be merged with an adjacent sibling node; first, move the elements in the parent node (the element is between two node elements to be merged) to the child node, and then merge the two nodes into one node. Therefore, in this instance, we first move the Element D in the parent node to the node that has deleted e but only f, and then move the node that contains D and F and the node that contains, the adjacent sibling nodes of C are merged into one node.
5. You may think that the delete operation is over. Otherwise, you may immediately find that the parent node contains only one element g, not up to standard (because the number of keywords N for non-root nodes including Leaf nodes must be 2 = <n <= 4, and n = 1 here), this is unacceptable. If the adjacent brothers of the problematic node are plump, you can borrow an element from the parent node. Assume that the right sibling node (containing Q, x) has more than one element (with elements on the Right of Q), and then we move m down to a child node with few elements, move Q to the m position. Then, the left subtree of Q will become the right subtree of M, that is, the right subtree Containing N, and the P node will be attached to the right pointer of M. Therefore, in this instance, we cannot borrow an element. We can only merge it with the sibling node into a node, and the unique element m in the root node moves down to the subnode, the height of the tree is reduced by one layer.
To further discuss the deletion details,Let's look at another instance.:
Here is a different 5-order B tree, so let's try to delete C
Therefore, the D element in the right subnode of the deleted element C is moved to the position of C. However, after the element is moved up, there is only one node.
Because the node contains e, the adjacent sibling node is just getting rid of poverty (the minimum number of elements is 2), and it is impossible to borrow elements from the parent node. Therefore, only merging operations can be performed, so, the left sibling node of B and the node containing e are merged into one node.
In this case, only one element F node exists. In this case, the adjacent sibling node is full (the number of elements is 3> the minimum number of elements 2 ), in this way, you can borrow elements from the parent node and move J from the parent node to the node. If J has elements in the node, the node moves forward, then, the first element (or the last element) in the adjacent sibling node is moved to the parent node, and the elements (or the previous element) are moved forward (or backward). Note that the element contains K, l nodes were previously attached to the left of M and now become attached to the right of J. In this way, each node satisfies the B-tree structure.
From the preceding operations, we can see that the number of keywords for nodes (including Leaf nodes) except the root node n satisfies the following conditions: (Ceil (M/2)-1) <= n <= s-1, 2 <= n <= 4. This also proves our previous point of view. The deletion operation is complete.
B-tree Study Notes