Background: These days watching "high-performance MySQL", in the view of creating high-performance indexes, the book says that the MySQL storage engine InnoDB the type of index used is b+tree, then, you have not produced such a question, for the data index, why use b+tree this data structure, Compared to other trees, what are the advantages of it? After reading this article you will learn about the principles of these data structures and their respective application scenarios.
Introduction to Binary search tree
Binary search tree, also known as ordered binary search tree, satisfies the general nature of binary search tree, and refers to an empty tree having the following properties:
- If the left subtree of any node is not empty, the value of the left subtree is smaller than the value of the root node.
- The right subtree of any node is not empty, and the value of the right subtree is greater than the value of the root node.
- The left and right subtree of any node is also a two-fork search tree.
- There are no nodes with key values equal.
Limitations and applications
A binary lookup tree is randomly composed of n nodes, so for some cases, the binary lookup tree will degenerate into a linear chain with n nodes. For example:
b figure for a common two-fork search tree, we look at a graph, if our root node selection is the smallest or largest number, then the two-fork search tree is completely degenerate into a linear structure, so, on the basis of binary search tree, there are AVL trees, red and black trees, they are two are based on binary search tree, It is only on the basis of binary search tree and it is limited.
Introduction to AVL Trees
The AVL tree is a two-fork search tree with equilibrium condition, which is generally judged by the balance factor difference and balanced by rotation to achieve balance, the left and right subtree tree height is not more than 1, and the red black tree, it is strictly balanced binary tree, balance conditions must be met (all nodes of the left and right subtree height difference not more than 1). Regardless of whether we are performing an insert or delete operation, as long as we do not meet the above conditions, it is necessary to maintain the balance by rotation, and the rotation is very time-consuming, so we can know that the AVL tree is suitable for inserting fewer deletions, but finding more cases.
From the above diagram we can see that the left and right sub-tree of any node of the balance factor difference is not greater than 1.
Limitations
Since the cost of maintaining this high level of balance is greater than the efficiency gains obtained from it, the actual application is not much, and more places are used in the pursuit of local rather than very strict overall balance of the red and black trees. Of course, if the insert deletion is not frequent in the scenario, only the lookup requirement is higher, then AVL is better than the red black
Application
The Windows NT kernel is widely available.
About red and black trees
A binary lookup tree, but adds a storage bit to each node that represents the color of the node, which can be red or black. By limiting the way each node is shaded from the root to the leaf, the red-black tree ensures that no path is twice times longer than the other path. It is a weakly balanced binary tree (due to the fact that if the balance can be rolled out, the same node case, the AVL tree height is lower than the red Black tree ), Compared to the strict AVL tree, it has fewer rotations, so for search, insert, and delete operations, we use red-black trees.
Properties
- Each node is not red or black.
- The root node is black.
- Each leaf node (the leaf node, the end of the tree, the nul pointer or the null node ) is black.
- If a node is red, then its two sons are black.
- For any node, each path to the leaf point tree nil pointer contains the same number of black nodes.
Each path contains the same black node.
Application
- In STL, which is widely used in C + +, both map and set are implemented with red and black trees.
- The famous Linux process scheduling completely Fair Scheduler, with the red and black tree management Process Control block, the virtual memory area of the process is stored in a red black tree, each virtual address area corresponds to a node of the red and black tree, the left pointer points to the adjacent address virtual storage area, The right pointer points to the adjacent high address virtual address space.
- The implementation of IO multiplexing Epoll uses the red and Black tree Organization management SOCKFD, to support the rapid additions and deletions to the search.
- Ngnix, with red black tree management timer, because the red black tree is orderly, you can quickly get away from the current minimum timer.
- Implementation of TreeMap in Java.
b/b+ Tree
Notice that the B-tree is a tree,-just a symbol.
Brief introduction
b/b+ Tree is a balanced multi-path lookup tree designed for disk or other storage devices (relative to binary, B-tree has multiple branches per inner node), compared to red-black trees, in the case of the same node, the height of a b/b+ tree is much smaller than the height of the red-black tree ( below b/b+ In the performance analysis of the tree ). The time of the operation on the b/b+ tree is usually made up of the time of the disk and CPU time, and the CPU is very fast, so the operation efficiency of the B-tree depends on the number of times the disk is accessed, the smaller the total number of keywords, the less time the disk I/O takes.
The nature of B-Tree
- Define any non-leaf nodes with a maximum of only m sons; and m>2;
- The number of sons of the root node is [2, M];
- The number of sons of non-leaf nodes outside the root node is [M/2, M];
- Each node is stored at least m/2-1 (rounded) and up to M-1 keywords; (at least 2 keywords)
- Number of key words for non-leaf nodes = number of pointers to sons-1;
- Keywords for non-leaf nodes: k[1], k[2], ..., k[m-1]; K[i] < k[i+1];
- Pointers to non-leaf nodes: p[1], p[2], ..., p[m], where p[1] a subtree that points to a keyword less than k[1], p[m] a subtree that points to a keyword greater than k[m-1], and other p[i] a subtree that points to a keyword belonging (k[i-1], k[i]);
- All leaf nodes are located on the same layer;
Here is just a simple B-tree, in the actual B-tree node in a lot of keywords. In the figure above, like 35 nodes, 35 represents a key (index), and the small black block represents the actual storage location of the content that the key points to in memory. is a pointer.
B + Tree
B + trees are a variant tree (a directory-first-level index of the file system that is required by the filesystem, and only the bottom-most leaf nodes (files) Save Data .), non-leaf nodes only save the index , do not save the actual data, The data is stored in the leaf node. Isn't that what file system files look for? We'll give you an example of a file lookup: There are 3 folders, A,b,c, A contains b,b containing C, a file yang.c, A,b,c is the index (stored on non-leaf nodes), A, B, C is just the YANG.C key to find, and the actual data yang.c stored on the leaf node.
all non-leaf nodes can be viewed as part of the index
The nature of B + trees (all of which are referred to as the different properties of the B. Tree)
- The sub-tree pointers of non-leaf nodes are the same as the number of key words;
- Subtree pointer of a non-leaf node p[i], pointing to the subtree of the keyword value belonging to [k[i],k[i+1]]. (b-Tree is an open interval, that is, B-tree does not allow the repetition of keywords, B + trees allow repetition);
- Adds a chain pointer to all leaf nodes.
- All keywords appear on the leaf node (dense index). (and the keywords in the list happen to be orderly);
- The non-leaf node is equivalent to the index of the leaf node (sparse index), and the leaf node corresponds to the data layer that stores (the keyword) data.
- More suitable for file system;
See:
A non-leaf node (such as 5,28,65) is just a key (index), where the actual data exists on the leaf node (5,8,9) is the real data or pointer to the real data.
Application
B-and + + trees are used primarily for indexing in file systems and databases. For example, MySQL;
Performance analysis of b/b+ tree
- The height of the balanced binary tree of n nodes is H (i.e. Logn), while the height of the b/b+ tree of N Nodes is Logt ((n+1)/2) +1;
- To be a lookup table in memory, B-trees are not necessarily better than balanced binary trees, especially when m is larger. Because the time to find the CPU is O (MLOGTN) =o (LGN (M/LGT)) on the B-tree, and m/lgt>1; so M large O (MLOGTN) The operation time is much larger than that of the balanced binary tree. Therefore, the use of B-trees in memory must take a smaller m. (Usually the minimum value is m=3, at which point the B-tree can have 2 or 3 children, and this 3-order B-Tree is called the 2-3 tree).
Why is it that B+tree is more suitable for the file index and data index of the operating system in the actual application than the B-tree.
- The internal node of the B+-tree does not have pointers to the specific information of the keyword, so its internal node is smaller than the B-tree, if all the same internal nodes are stored in the same disk block, then the number of key words can be accommodated in the disk block, a one-time read into the memory of the need to find more keywords, The relative IO read and write times are reduced.
- Because a non-endpoint is not a node that ultimately points to the contents of a file, it is only the index of the keyword in the leaf node. So any keyword search must take a path from the root node to the leaf node. The path length of all keyword queries is the same, resulting in a query efficiency equivalent for each data.
PS: I see someone say that in the knowledge, I feel that said is quite reasonable:
They think that the main reason for database indexes to use B + trees is that B-trees do not solve the problem of my inefficiency in the traversal of elements while improving IO performance, and it is in order to solve it that this is an application. B + trees can traverse the entire tree simply by traversing the leaf nodes. and the scope-based queries in the database are very frequent, and the tree does not support such operations (or inefficient).
The principle and application of AVL tree, red-black tree, B-tree, + + Tree