Use Based on B-tree and B + tree: detailed introduction of data search and database index

Source: Internet
Author: User

B-tree

1. B-tree Definition

B-tree is a balanced multi-path search tree that is useful in file systems.

Definition: an m-Level B-tree, an empty tree, or an m-tree that meets the following features:
(1) Each node in the tree has at most m Subtrees;
(2) If the root node is not a leaf node, there are at least two Subtrees;

(3) All non-terminal nodes except the root node have at least [m/2] subtree;
(4) All non-terminal nodes contain the following information data:

(N, A0, K1, A1, K2 ,..., Kn,)
Where: Ki (I = 1, 2 ,..., N) is the key code, and Ki <Ki + 1,

Ai is the pointer to the Child root node (I = ,..., N), and the key code of all nodes in the subtree indicated by the pointer Ai-1 is less than Ki (I = ,..., N), An indicates that the key code of all nodes in the subtree is greater than Kn.

N is the number of key codes.
All the leaf nodes of the worker appear at the same level without information (it can be seen as an external node or a node that fails to be searched. In fact, these nodes do not exist, the pointer to these nodes is null ).

That is, all leaf nodes have the same depth and are equal to the tree height.

For example, a level 4 B-tree has a depth of 4.


The B-tree search is similar to the query of the binary sorting tree. The difference is that each node of the B-tree is an ordered table with multiple key codes. When a knot is reached, first, search in the ordered table. If it is found, the search is successful. Otherwise, search in the subtree pointing to according to the corresponding pointer information. When it reaches the leaf knot, the key code is not found in the tree.

The process of searching for keyword 47 on the B-tree is as follows:

1) First, locate * Node Based on the Root Node pointer, because * node a only has one keyword and the given value is 47> keyword 35, in this case, the sub-tree indicated by the A1 pointer must exist.

2) The X-c node is found along the pointer. The node has two keywords (43 and 78), and 43 <47 <78, if there is a ratio in the subtree referred to by the A1 pointer.

3) Similarly, find the * g node with the pointer, and find the keyword 47 on the node. The search is successful.

2. Search Algorithms
Copy codeThe Code is as follows:
Typedef int KeyType;
# Define m 5/* level of B tree, set to 5 */
Typedef struct Node {
Int keynum;/* Number of keycodes in the node, that is, the node size */
Struct Node * parent;/* points to the parent Node */
KeyType key [m + 1];/* key code vector. Unit 0 is not used */
Struct Node * ptr [m + 1];/* subtree pointer vector */
Record * recptr [m + 1];/* Record pointer vector */
} NodeType;/* B tree node type */

Typedef struct {
NodeType * pt;/* points to the node found */
Int I;/* key code serial number in the node, node serial number range [1... M] */
Int tag;/* 1: search successful, 0: search failed */
} Result;/* search Result type of tree B */

Result SearchBTree (NodeType * t, KeyType kx)
{
/* Search for the key code kx and reverse return (pt, I, tag) on the B-tree t of level m ). If the search is successful, the feature value tag is 1 ,*/
/* The I-th key code in the node pointed by the pointer pt is equal to kx; otherwise, the feature value tag is 0, which is equal to the key code record of kx */
/* It should be inserted between the key code I and I + 1 in the node indicated by the pointer pt */
P = t; q = NULL; found = FALSE; I = 0;/* initialization, p points to the node to be queried, and q points to the parent node of p */
While (p &&! Found)
{N = p-> keynum; I = Search (p, kx);/* at p --> key [1... In keynum */
If (I> 0 & p-> key [I] = kx) found = TRUE;/* Find */
Else {q = p; p = p-> ptr [I];}
}
If (found) return (p, I, 1);/* search successful */
Else return (q, I, 0);/* If the query fails, the kx Insertion Location Information is returned */
}

B-tree search algorithm analysis

From the search algorithm, we can see that searching in the B-tree contains two basic operations:

(1) Search for nodes in the B-tree;

(2) Search for keywords in the node.

Because the B-tree is usually stored on the disk, the previous search operation is performed on the disk, and the next search operation is performed in the memory, that is, after finding the node pointed by the pointer p on the disk, read the information in the node into the memory, and then use the sequential search or semi-query to find the keyword that the query is equal to K. Obviously, one query on a disk consumes much more time than one query in the memory.

Therefore, the number of searches on the disk, that is, the layered tree on the B-tree where the keyword to be searched is located, is the primary factor determining the search efficiency of the B-tree.

How deep is the worst case for the m-Level B-Tree Containing n key codes? Similar analysis can be performed based on the binary balancing tree. First, we will discuss the minimum number of knots on each layer of B-numbers in level m.

Defined by Tree B: Tree B contains n keywords. Therefore, n + 1 leaves are on the J + 1 layer.

1) The first layer is the root, at least one node, and the root has at least two children. Therefore, the second layer has at least two nodes.

2) Apart from roots and leaves, other nodes have at least [m/2] children. Therefore, the third layer has at least 2 * [m/2] nodes, on the fourth layer, there must be at least 2 * [m/2] 2 nodes...

3) in the J + 1 layer, there are at least 2 * [m/2] J-1 nodes, and the node of the J + 1 layer is the leaf node, the number of leaf nodes is n + 1. Include:

That is to say, in the B-tree search of n keywords, the number of nodes involved from the root node to the node where the keyword is located cannot exceed:


3. B-Tree Insertion

The generation of B-tree is also obtained from the empty tree by inserting keywords one by one. However, because the number of keywords in the B-Tree node must be greater than or equal to ceil (m/2)-1, inserting a keyword each time does not add a leaf node to the tree, instead, add a keyword to a non-terminal node at the lowest layer. If the number of keywords on the node does not exceed the value of S-1, the insertion is complete. Otherwise, a "split" of the node is generated ",

(A) a Level 3 B-tree (the F node is omitted in the figure (that is, the leaf node). Assume that the keywords 30, 26, and 85 need to be inserted in sequence.



1) First, locate and determine the Insert Location. Start from root * a and check that 30 of the data to be inserted is in the * d node. Because the number of keywords in * d cannot exceed 2 (that is, s-1), the first keyword is inserted completely: for example, (B)


2) Similarly, * d should also be inserted by searching for the identified keyword 26. because the number of ** d node keywords exceeds 2, you need to split ** d into two nodes, and the first and last pointers of the keyword 26 are still in the ** d node, the key 37 and its first and last pointers are stored in the newly generated node * d. Insert the pointer of the keyword 30 and the indicator node * d' to the parent node. Because * The number of keywords in Node B does not exceed 2, the insertion is complete. For example, (c) (d)




3) (e)-(g) after 85 is inserted;


Insert algorithm:
Copy codeThe Code is as follows:
Int InserBTree (NodeType ** t, KeyType kx, NodeType * q, int I ){
/* Insert the key code kx between the key [I] and key [I + 1] of the * q node on the * t node of the B-level m tree */
/* If the node is too large, perform necessary node splitting and adjustment along the parent chain so that * t is still the B-tree of level m */
X = kx; ap = NULL; finished = FALSE;
While (q &&! Finished)
{
Insert (q, I, x, ap ); /* insert x and ap into q-> key [I + 1] and q-> ptr [I + 1] */
If (q-> keynum <m) finished = TRUE;/* Insert completed */
Else
{/* Split node * p */
S = m/2; split (q, ap); x = q-> key [s];
/* Convert q-> key [s + 1... M], q-> ptr [s... M] and q-> recptr [s + 1... M] move to new node * ap */
Q = q-> parent;
If (q) I = Search (q, kx);/* query the kx Insertion Location in the parent node * q */
}
}
If (! Finished)/* (* t) is an empty tree or the root node has been split into * q * and ap */
NewRoot (t, q, x, ap);/* generate a new root node containing information (t, x, ap) * t, the original * t and ap are subtree pointers */
}

4. B-tree deletion

If you delete a keyword on the B-tree, you should first find the node where the keyword is located and delete it from it. If the node is a non-terminal node at the lowest level, if the number of keywords is not less than ceil (m/2), the deletion is complete. Otherwise, the "merge" Node operation is required. If the deleted keyword is a non-terminal node's Ki, you can use Y, the smallest keyword in the subtree referred to by Ai, to replace Ki, and then delete Y in the corresponding node. For example, if you delete 45 from the B-tree in Figure 4.1 (a), you can replace 50 in * f node with 45, and then delete 50 in * f node.


Fig. 4.1 ()

Therefore, we only need to discuss how to delete the keywords in the lower-level non-terminal nodes. There are three possibilities:

(1) If the number of keywords in the node where the deleted keyword is located is not less than ceil (m/2), you only need to delete the key Ki and the corresponding pointer Ai from the node, the other parts of the tree remain unchanged. For example, if you delete the keyword 12 from the B-tree shown in Figure 4.1 (a), then the B-tree shown in Figure 4.2 (a) After the deletion is shown:


Fig. 4.2 ()

(2) The number of keywords in the node where the deleted keyword is located is equal to ceil (m/2)-1, and the right brother (or left brother) adjacent to the node) if the number of keywords in a node is greater than ceil (m/2)-1, you need to move the minimum (or maximum) keywords in the sibling node to the parent node, move the key words that are smaller than (or greater than) in the parent node and close to the top-up keyword to the node where the deleted key word is located.

For example, if you delete 50 from Figure 4.2 (a), you need to move 61 from the right brother node to * e node, and move 53 from * e node to * f, so that the number of keywords in * f and * g is not less than ceil m-1)-1, and the number of keywords in the parent node remains unchanged, as shown in Figure 4.2 (B.


Fig. 4.2 (B)

(3) the number of keywords in the node where the deleted keyword is located and Its Adjacent sibling nodes is equal to ceil (m/2)-1. If the node has a right brother and the address of the right brother node is pointed by the pointer Ai In the parent node, the remaining keywords and pointers in the node where the node is deleted after the keyword is deleted, add the key word Ki in the parent node and merge it into the sibling node specified by Ai (if there is no right sibling node, It is merged into the left sibling node ).

For example, if 53 is deleted from the B-tree shown in Figure 4.2 (B), the * f node should be deleted and the remaining information in * f (Pointer "null") should be deleted ") and 61 in the parent * e node are merged into the right brother node * g. The deleted tree 4.2 (c) is shown in.

 

Fig. 4.2 (c)

If the number of keywords in the parent node is less than ceil (m/2)-1, and so on.

[For example], after the keyword 37 is deleted from B-tree in Figure 4.2 (c), the remaining information in the parent B node ("pointer c ") merge it with the keyword 45 in the parent * a node to the right sibling node * e, as shown in the deleted B-tree 4.2 (d.

Figure 4.2 (d)


B-tree is mainly used in the file system

In order to store large database files on the hard disk to reduce the number of hard disk accesses, a balanced multi-path Search Tree-B-tree structure is proposed. The performance analysis shows that its retrieval efficiency is quite high. to improve B-tree performance, there are many B-tree variants, try to improve B-tree


B + Tree B + tree is a B-tree deformation tree generated according to the needs of the file system. A m-Level B + tree and a m-Level B-
The difference between trees is:
(1) nodes with n subtree contain n key codes;
(2) All leaf nodes contain all key code information and pointers to records containing these key codes.
The leaf node itself is connected in a small and large order based on the key code.
(3) All non-terminal nodes can be regarded as the index part. The node only contains the maximum (or minimum) key code of its child root node.
A level-3 B + tree:
Generally, there are two headers on the B + tree, one pointing to the root node, and the other pointing to the leaf node with the smallest keyword. Therefore, you can perform two search operations on the B + tree: one is sequential search from the smallest keyword, and the other is random search from the root node. The process of random search, insertion, and deletion on the B + tree is basically similar to that on the B-tree. Only when searching, if the key code on a non-terminal node is equal to the given value, it does not end, but continues until the leaf node. Therefore, in B +
Tree, no matter whether the search is successful or not, each search takes a path from the root to the leaf node.

Application of B + tree in Database

1. Role of indexes in Databases

During the use of the database system, data query is the most frequently used data operation.

The most basic query algorithm is linear search, which traverses the table and matches the row-by-row match to determine whether the value is equal to the keyword to be searched. the time complexity is O (n ). However, tables with low time complexity of O (n) algorithms and databases with low load can also have good performance. However, when the data increases, the time complexity of the O (n) algorithm is obviously bad, and the performance will soon decline.

Fortunately, the development of computer science provides many better search algorithms, such as binary search and binary tree search. If you perform a slight analysis, you will find that each search algorithm can only be applied to a specific data structure. For example, binary search requires that the retrieved data be ordered, while binary search can only be applied to binary search trees, however, the organizational structure of the data itself cannot fully satisfy all kinds of data structures (for example, theoretically it is impossible to organize both columns in order at the same time, the database system also maintains data structures that meet specific search algorithms. These data structures reference (point to) data in some way, so that advanced search algorithms can be implemented on these data structures. This data structure is an index.

An index is a structure that sorts the values of one or more columns in a database table. Compared to searching all rows in a table, indexes use pointers to the data values stored in the specified column in the table, and sort these pointers in the specified order to help you get information faster. Generally, an index must be created on a table only when data in the index column is frequently queried. Indexes occupy disk space and affect the data update speed. However, in most cases, the data retrieval speed advantage brought by indexes greatly exceeds its shortcomings.

2. Application of B + tree in database index


Currently, most database systems and file systems use B-Tree or its variant B + Tree as the index structure.

1) database index Application

In the application of database indexes, the B + tree is organized as follows:

① Organization of leaf nodes. The search key of the B + tree is the primary key of the data file, and the index is dense. That is to say, in the leaf node, a key and pointer pair is set for the first record of the data file. The data file can be sorted by the primary key or not by the primary key. The data files are sorted by the primary key, the B + tree is a sparse index. In the leaf node, a key and pointer pair are set for each block of the data file. data files are not sorted by key properties, this attribute is the search key of the B + tree. In the leaf node, each attribute K in the data file has a key and pointer pair, the pointer executes the first of the records whose sorting key value is K.

② Organization of non-leaf nodes. The non-leaf nodes in the B + tree form a multi-level sparse index on the leaf nodes. Each non-leaf node has at least ceil (m/2) pointers and at most m pointers.

2) insert and delete B + tree indexes

① When inserting new data into the database, you also need to insert the corresponding index key value to the database index. Then you need to insert a new key value to the B + tree. That is, the B-Tree Insertion Algorithm we mentioned above.

② When deleting data from the database, you also need to delete the corresponding index key value from the database index. You need to delete the key value from the B + tree. B-tree deletion algorithm

Why use B-Tree (B + Tree)

Binary Search Tree data structures such as the red and black trees of evolutionary varieties can also be used for indexing. However, B-/+ Tree is widely used as the index structure in file systems and database systems.

In general, the index itself is also very large and cannot be all stored in the memory. Therefore, the index is often stored on the disk as an index file. In this way, disk I/O consumption is required during index search. Compared with memory access, I/O access consumes several orders of magnitude, therefore, evaluating a data structure as the index's most important indicator is the progressive complexity of the number of disk I/O operations during the search process. In other words, the structure of the index should minimize the number of disk I/O accesses during the search process. The use of B-/+ Tree is also related to the disk access principle.

Local principle and disk pre-read

Because of the characteristics of the storage medium, the access to the disk itself is much slower than the primary storage, coupled with the cost of mechanical movement, the access speed of the disk is often one of the primary storage, so in order to improve efficiency, minimize disk I/O. To achieve this goal, the disk is usually not read strictly on demand, but preread every time. Even if only one byte is required, the disk starts from this location, read data of a certain length in sequence into the memory. This theory is based on the well-known local principle in Computer Science:

When a data is used, the data nearby it is usually used immediately.

The data required during the program running is usually concentrated.

Because sequential disk reading is highly efficient (with little rotation time required without seeking time), preread can improve I/O efficiency for local programs.

The preread length is generally an integer multiple of the page. Pages are logical blocks for computer memory management. Hardware and operating systems often divide primary and disk storage areas into contiguous blocks of the same size, each block is called a page (in many operating systems, the page size is usually 4 k). The primary storage and disk exchange data in pages. When the data to be read by the program is not in the primary storage, a page missing exception is triggered, and the system sends a disk reading signal to the disk, the disk finds the starting position of the data and reads one or more pages consecutively into the memory. If an exception is returned, the program continues to run.

We analyzed above B-/+ Tree to retrieve the most accessible nodes at a time:

H =

The database system cleverly utilizes the disk pre-read principle to set the size of a node to equal to one page, so that each node can be fully loaded only once I/O. To achieve this goal, you also need to use the following techniques to implement B-Tree:

Each time you create a node, you can directly apply for a page space to ensure that a node is physically stored on a page. In addition, the computer storage allocation is page-aligned, A node only needs one I/O operation.

A B-Tree retrieval requires a maximum of H-1 I/O (root node resident memory), and the progressive complexity is O (h) = O (logmN ). In practice, m is a very large number, usually over 100, so h is very small (usually no more than 3 ).

In conclusion, the efficiency of using B-Tree as the index structure is very high.

The structure of the red and black trees is much deeper than h. Because logically close nodes (Parent and Child) may be far physically unable to use locality, the I/O complexity of the red and black trees is O (h ), the efficiency is much lower than that of B-Tree.

MySQL B-Tree index (technically speaking, B + Tree)

In MySQL, there are four types of indexes: B-Tree index, Hash index, Fulltext index, and R-Tree index. We mainly analyze B-Tree indexes.

B-Tree indexes are the most frequently used index types in MySQL databases. All storage engines except the Archive storage engine support B-Tree indexes. The Archive engine does not support indexing until MySQL 5.1 and only supports indexing a single AUTO_INCREMENT column.

Not only in MySQL, but in many other database management systems, the B-Tree index is also the most important index type, this is mainly because the storage structure of B-Tree indexes has excellent performance in database data retrieval.

In general, most physical files of the B-Tree index in MySQL are stored in the Balance Tree structure, that is, all the actually required data is stored in the Leaf Node of the Tree, and the shortest path length to any Leaf Node is identical, therefore, we all call it B-Tree indexes. Of course, various databases (or various MySQL storage engines) may slightly modify the storage structure when storing their own B-Tree indexes. For example, the actual storage structure used by the B-Tree index of the Innodb Storage engine is actually B + Tree, that is, a small transformation has been made on the basis of the B-Tree data structure, in addition to storing the index key information on each Leaf Node, it also stores the pointer information pointing to the next LeafNode adjacent to the Leaf Node (adding sequential access pointers ), this is mainly to accelerate the efficiency of retrieving multiple adjacent Leaf nodes.

The following describes how to implement indexes for two storage engines: MyISAM and InnoDB:

1. MyISAM index implementation:

1) primary key index:

The MyISAM engine uses B + Tree as the index structure. The data domain of the leaf node stores the data record address. Principle of MyISAM primary key indexFigure:


(Figure myisam1)

Here, the table has three columns in total. Assume that we use Col1 as the Primary key, and figure myisam1 is the Primary index (Primary key) of a MyISAM table. It can be seen that the index file of MyISAM only stores the address of the data record.

2) Secondary index (Secondary key)

In MyISAM, the primary index and Secondary index (Secondary key) have no difference in structure, but the primary index requires that the key is unique, and the Secondary index key can be repeated. If we create a secondary index on Col2, the index structure is shown in:

It is also a B + Tree that stores data records in the data field. Therefore, the index search algorithm in MyISAM first searches for indexes based on the B + Tree search algorithm. If the specified Key exists, the value of its data field is obtained, then, read the corresponding data records using the data domain value as the address.

The index method of MyISAM is also called "non-clustered". The reason for this is to distinguish it from the clustered index of InnoDB.

2. InnoDB Index implementation

InnoDB also uses B + Tree as the index structure, but the implementation method is different from that of MyISAM.

1) primary key index:

The MyISAM index file and data file are separated. The index file only stores the data record address. In InnoDB, the table data file itself is an index structure organized by B + Tree. The leaf node data field of this Tree stores complete data records. The key of this index is the primary key of the data table, so the InnoDB table data file itself is the primary index.

(Figure inndb primary key index)

(Figure inndb primary key index) is an InnoDB primary index (also a data file). We can see that the leaf node contains a complete data record. This index is called a clustered index. Because the data files in InnoDB need to be clustered by the primary key, InnoDB requires that the table have a primary key (MyISAM may not). If it is not explicitly specified, mySQL automatically selects a column that uniquely identifies a data record as the primary key. If this column does not exist, MySQL automatically generates an implicit field for the InnoDB table as the primary key, this field is 6 bytes in length and its type is long integer.

2). InnoDB secondary index

All secondary indexes of InnoDB reference the primary key as the data field. For example, to define a secondary index on Col3:

InnoDB tables are created based on Clustered indexes. Therefore, InnoDB indexes provide a fast primary key search performance. However, its Secondary Index (Secondary Index, which is also a non-primary key Index) will also contain primary key columns. Therefore, if the primary key definition is large, other indexes will also be large. If you want to define a large number of indexes on a table, try to define a smaller primary key. InnoDB does not compress indexes.

ASCII code of Chinese characters as a comparison criterion. Clustered index makes the search by primary key very efficient, but secondary index search requires two indexes: first, retrieve the secondary index to obtain the primary key, then, use the primary key to search for the record in the primary index.

The indexing implementation methods of different storage engines are very helpful for correct use and optimization of indexes. For example, after knowing the index Implementation of InnoDB, it is easy to understand why it is not recommended to use too long fields as the primary key, because all secondary indexes reference the primary index, too long primary index will make the secondary index too large. For example, it is not a good idea to use non-monotonous fields as the primary key in InnoDB, because the InnoDB data file itself is a B + Tree, non-monotonous primary keys will cause frequent split and adjustment of data files to maintain the features of B + Tree during the insertion of new records, which is very inefficient, using an auto-increment field as the primary key is a good choice.

Differences between InnoDB indexes and MyISAM indexes:

First, the difference between the primary index and the InnoDB data file itself is the index file. The indexes and data of MyISAM are separated.

Second, the difference between secondary indexes: the secondary index data domain of InnoDB stores the value of the primary key of the corresponding record instead of the address. The secondary index of MyISAM is not much different from the primary index.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.