BST
That is, the binary search tree:
1. All non-leaf nodes have at most two sons (Left and Right );
2. All nodes store a keyword;
3. The left pointer of a non-leaf node points to the subtree smaller than its keyword, and the right pointer points to the subtree larger than its keyword;
For example:
B-tree (B-tree)
Is a multi-path search tree (not binary ):
1. Define any non-leaf node with a maximum of M sons; and M> 2;
2. The number of sons at the root node is [2, M].
3. The number of non-leaf nodes except the root node is [M/2, M];
4. Each node holds at least M/2-1 (rounded up) and at most M-1 keywords; (at least 2 keywords)
5. Number of keywords for non-leaf nodes = number of pointers to son-1;
6. Non-leaf node Keywords: K [1], K [2],…, K [M-1]; and K [I] <K [I + 1];
7. Non-leaf node pointer: P [1], P [2],…, P [M]; where P [1] points to a subtree with a keyword less than K [1], P [M] points to a subtree with a keyword greater than K [M-1, other P [I] points to the subtree where the keyword belongs (K [i-1], K [I;
8. All leaf nodes are on the same layer;
Example: (M = 3)
B-tree search: starts from the root node and performs a binary search for the keyword (ordered) sequence in the node. If hit, the query ends. Otherwise, the son node in the search keyword range is entered; repeat until the corresponding son pointer is null or is already a leaf node;
B-tree features:
1. The set of keywords is distributed in the entire tree;
2. Any keyword appears only in one node;
3. The search may end at a non-leaf node;
4. The search performance is equivalent to performing a binary search in the complete set of keywords;
5. Automatic hierarchical control;
Because the non-leaf nodes except the root node are restricted, at least M/2 sons are contained, and the minimum utilization of the nodes is ensured. The lowest search performance is O (LogN)
B + tree
The B + tree is a variant of the B-tree and also a multi-path search tree:
1. Its definition is basically the same as that of B-tree,:
2. The number of subtree pointers and keywords for non-leaf nodes is the same;
3. the subtree pointer P [I] for non-leaf nodes, pointing to the subtree with the key value [K [I], K [I + 1]) (B-tree is an open interval );
5. Add a chain pointer to all leaf nodes;
6. All keywords appear at the leaf node;
Example: (M = 3)
The search for B + is basically the same as that for B-trees. The difference is that B + trees hit only when they reach the leaf node (B-trees can hit non-leaf nodes ), its performance is also equivalent to performing a binary search in the full set of keywords;
Features of B +:
1. All keywords appear in the linked list of leaf nodes (dense index), and the keywords in the linked list are exactly ordered;
2. It is impossible to hit non-leaf nodes;
3. Non-leaf nodes are equivalent to leaf node indexes (sparse indexes), and leaf nodes are equivalent to data layers that store (keywords) data;
4. More suitable for file index systems;
B + tree advantages over B-tree:
1. Different from B-tree, B-tree is only suitable for random search. B + tree supports both random search and sequential search. It is widely used in practice.
2. Why is B + tree more suitable for the file index and database index of the operating system in actual applications than B-tree?
1) the disk read/write cost of the B + tree is lower.
The internal node of the B + tree does not point to the specific information of the keyword. Therefore, the internal node is smaller than the B-tree node. If you store all the keywords of the same internal node in the same disk, the more keywords the disk can hold. The more keywords you need to search for in-memory reading at one time. IO reads and writes are reduced.
For example, assume that a disk block contains 16 bytes, while a keyword is 2 bytes, and a keyword is 2 bytes. Two disks are required for an internal node of a 9-Level B-tree (a node can have up to eight keywords. While the internal node of the B + tree only needs one disk (all the keywords are in the leaf node ?). When the internal node needs to be read into the memory, the B-tree has one more disk block query time than the B + tree (the disk is the disk rotation time ). (In fact, we can understand the difference between B-tree B + space utilization. If there is a storage space in the memory, if the B-tree nodes are stored, we can store 10, in B-tree nodes, we can find that the most common 10 nodes are pointers, that is, these spaces can index 10 keys. If we store B + tree nodes, each of the 10 nodes is an index, so each index can point to a linked list)
2) the query efficiency of the B + tree is more stable.
Because the non-endpoint is not the final point to the file content node, it is only the index of the keyword in the leaf node. Therefore, any keyword search must follow a path from the root node to the leaf node. The path length of all keyword queries is the same, resulting in the query efficiency of each data.
3 The biggest difference between B + and B-trees is:
1) B-tree keywords and records are put together. Leaf nodes can be seen as external nodes without any information; the non-leaf nodes of the B + tree only have keywords and indexes pointing to the next node. The records are only placed in the leaf node.
2) in the B-tree, the closer the record to the root node, the faster the query time. As long as the keyword is found, the existence of the record can be determined; in the B + tree, the search time for each record is basically the same. You need to go from the root node to the leaf node and compare the keywords in the leaf node. From this perspective, the performance of the B-tree seems to be better than that of the B + tree, but in actual application, the performance of the B + tree is better. Because the non-leaf nodes of the B + tree do not store actual data, each node can accommodate more elements than the B-tree, and the tree height is smaller than that of the B-tree, this reduces the number of disk accesses. Although the B + tree requires more comparisons than the B-tree for finding a record, the time for one disk access is equivalent to the time for hundreds of memory comparisons, therefore, in practice, the performance of the B + tree may be better, and the leaf nodes of the B + tree are connected by pointers to facilitate sequential traversal (for example, viewing all files in a directory, all records in a table). This is why many databases and file systems use the B + tree.
B * tree
Is a variant of the B + tree. In the non-root and non-leaf nodes of the B + tree, add a pointer to the sibling node;
The B * tree defines that the number of non-leaf node keywords should be at least (2/3) * M, that is, the minimum block usage is 2/3 (instead of 1/2 of B + tree );
Split of B + tree: When a node is full, allocate a new node, copy 1/2 of the data from the original node to the new node, and add a pointer to the new node in the parent node; the split of the B + tree only affects the original node and the parent node, but does not affect the sibling node, so it does not need to point to the sibling node;
B * tree split: When a node is full, if its next sibling node is not full, move part of the data to the sibling node, and then insert a keyword into the original node, finally, modify the keywords of the sibling node in the parent node (because the keyword range of the sibling node has changed). If the sibling node is full, add a new node between the original and sibling nodes, copy 1/3 of the data each to the new node, and add a pointer to the new node at the parent node;
Therefore, the probability of B * tree allocating new nodes is lower than that of B + tree, and the space usage is higher;
Summary
BST tree: binary search tree. Each node stores only one keyword. If it is equal to or equal to a hit, it is smaller than the left node and greater than the right node;
B-tree (B-tree): multi-path search tree. Each node stores M/2 to M keywords, and non-leaf nodes store subnodes that point to the keyword range; all keywords appear in the entire tree only once, and can be hit by non-leaf nodes;
B + tree: on the basis of B-tree, add a linked list pointer to the leaf node. All keywords appear in the leaf node. Non-leaf nodes are used as the index of the leaf node; the B + tree always hits the leaf node;
B * tree: on the basis of B + tree, the linked list pointer is also added for non-leaf nodes to increase the node's lowest utilization rate from 1/2 to 2/3;
B tree, B-tree B + tree, and B * tree