MYSQL---anatomy innodb indexing principle

Source: Internet
Author: User

Summary:

This article is for reference to other people's articles (http://blog.csdn.net/voidccc/article/details/40077329)

1 Various tree-shaped structures

Originally not intended to start from the two-fork search tree, because there are too many related articles on the Internet, but considering the clear diagram of the understanding of the problem is very helpful, but also to ensure the integrity of the article, and finally added this part.

Let's look at several tree-shaped structures:

1 Search binary Tree: Each node has two sub-nodes, the increase in the amount of data will inevitably lead to a rapid increase in height, obviously this is not suitable as a large number of data storage infrastructure.

2 B-Tree: A M-order B-Tree is a balanced m-Path search tree. The most important property is that each non-root node contains the number of keywords J satisfies: ┌m/2┐-1 <= J <= M-1; the number of child nodes of a node is 1 more than the keyword, so the keyword becomes the split flag of the child node. Usually in the illustration of the key calligraphy and painting to the middle of the child nodes, very image, but also easy and the back of the B + tree to distinguish. Because the data exist both in the leaf node and non-leaf nodes, it is not possible to simply traverse through the keywords in the B-tree sequentially, and must use the method of middle order traversal.

3 B + Tree: A M-order B-tree is a balanced m-Path search tree. The most important property is that each non-root node contains the number of keywords J satisfies: ┌m/2┐-1 <= J <= m; The number of subtrees can be as many as the keyword. A non-leaf node stores the smallest keyword in a subtree. At the same time, the data nodes exist only in the leaf nodes, and the horizontal pointers are added between the leaf nodes, which makes it very easy to traverse all the data sequentially.

4 b* Tree: A M-order B-tree is a balanced m-Path search tree. The most important of the two properties is 1 each non-root node contains the number of keywords J satisfies: ┌m2/3┐-1 <= J <= m;2 non-leaf nodes with horizontal pointers added.

b/b+/b* Three types of trees have similar operations, such as retrieving/inserting/deleting nodes. This only focuses on the insertion of nodes, and only the insertion of their current node is full, because this action is slightly more complex and can adequately reflect the differences in several trees. The comparison is that the retrieval node is easier to implement, and the deletion node as long as the completion of the reverse process with the insertion (in the actual application is not the full inverse of the insert, often only delete the data and leave space for subsequent use).

Look at the division of the B-tree first, and the red value is the node that is newly inserted each time. Every time a node is full, it needs to split (splitting is a recursive process, referring to the following 7 insert leads to a two-layer split), because the B-tree's non-leaf node also holds the key value, so the full node split after the value will be distributed in three places: 1 original node, 2 parent node of the original node, 3 The new sibling node of the original node (refer to the insertion process of the 5,7). Splitting may result in increased height of the tree (refer to the 3,7 insertion process), or it may not affect the height of the tree (refer to the 5,6 insertion process).

Division of B + trees: When a node is full, a new node is allocated and 1/2 of the original node is copied to the new node, and the new node pointer is added to the parent node, and the division of B + tree affects only the original and parent nodes, without affecting the sibling node, so it does not need pointers to the siblings.

b*: When a node is full, if its next sibling node is not full, move part of the data to the sibling node, insert the keyword at the original node, and finally modify the keyword of the sibling node in the parent node (because the sibling node's keyword range has changed). If the sibling is full, add new nodes between the origin and sibling nodes, and copy 1/3 of the data to the new node, and finally the pointer to the parent node to add the new node. Can see the division of the b* tree is very ingenious, because the b* tree to ensure that the split node is also 2/3 full, if the use of B + Tree method, simply the full node is divided into two, will cause each node only 1/2 full, which does not meet the requirements of the b* tree. Therefore, the strategy of the b* tree is to continue to insert the sibling node after this node is full (which is why the b* tree needs to be in a non-leaf node plus a sibling list), until the brother node is stuffed, and then pull up the brothers node together chipped, and the brother node each contributed 1/3 to establish a new node, The result is that 3 nodes are just 2/3 full, to meet the requirements of the b* tree, happy.

The B + tree is well-suited as a database infrastructure, entirely because of the computer's memory-mechanical hard disk two-tiered storage structure. Memory can do fast random access (random access gives any address that requires returning data stored at this address) but with less capacity. and the random access of the hard disk to be mechanical action (1 Head Movement 2 disc rotation), access efficiency than memory a few orders of magnitude, but the hard disk capacity is large. A typical database capacity is much larger than the available memory size, which determines that retrieving a single piece of data in a B + tree is likely to be done with several disk IO operations. As shown: an action that normally reads a node down may be a disk IO operation, but a non-leaf node typically loads memory in the initial phase to speed up access. At the same time, in order to improve the traverse speed between nodes, the blue CPU/memory reading in the graph may be optimized into binary search tree (page directory mechanism in InnoDB).


The B + tree in the real database should be very flat, and you can verify how flat the B + tree in InnoDB is by inserting enough data into the table sequentially. We create a test table with a simple field by using the Create statement, and then add data to populate the table. The statistical data (sources see reference 1) can be analyzed in several intuitive conclusions, which show the scale of the B + tree in the database.

1 each leaf node stores 468 rows of data, and each non-leaf node stores about 1200 key values, which is a balanced 1200-way search Tree!

2 for a 22.1G capacity table, you only need a B + tree height of 3 can be stored, this capacity is likely to meet the needs of many applications. If you increase the height to 4, the storage capacity of the B + tree will increase to 25.9T!

3 for a 22.1G capacity table, the B + Tree height is 3, if you want to load all non-leaf nodes into memory only need less than 18.8M of memory (how to come to this conclusion?). Because for a tree with a height of 2, 1203 leaf nodes also need only 18.8M space, while the 22.1G Magdalene table has a height of 3 and non-leaf nodes 1204. At the same time we assume that the leaf node size is greater than the non-leaf node, because the leaf node stores the row data, not the leaf node only keys and a small amount of data. , using only such a small amount of memory to ensure that only one disk IO operation is required to retrieve the required data, the efficiency is very high.

2 MySQL storage engine and Index

It can be said that the database must have an index, the retrieval process becomes a sequential lookup without an index, and the time complexity of O (n) is almost intolerable. It's very easy to imagine how a table with a single keyword can be indexed using a B + tree, as long as the keyword is stored in the tree's node. When the database contains multiple fields in a single record, a B + tree can only store the primary key, and if the non-primary key field is retrieved, the primary key index loses its function and becomes a sequential lookup.  A second set of indexes should be established on the second column to retrieve. This index is organized by a separate B + tree. There are two common ways to solve the problem of multiple B + trees accessing the same set of table data, a clustered index (clustered index), a non-clustered index (secondary). Although these two names are called indexes, this is not a separate index type, but a way of storing data. For clustered index storage, row data and primary key B + trees are stored together, auxiliary keys B + trees store only secondary and primary keys, and primary and non-primary key B + trees are almost two types of trees. For non-clustered index storage, the primary key B + Tree stores a pointer to the true data row in the leaf node, rather than the primary key.

InnoDB uses a clustered index, organizes the primary key into a B + tree, and the row data is stored on the leaf node, and if the "where id = 14" condition is used to find the primary key, the corresponding leaf node is found according to the B + tree's retrieval algorithm, then the row data is obtained. A conditional search on the Name column requires two steps: The first step is to retrieve the name in the secondary index B + tree and reach its leaf node for the corresponding primary key. The second step uses the primary key to perform a B + tree retrieval operation at the main index B + species, and finally reaches the leaf node to get the entire row of data.

Myism uses the non-clustered index, the nonclustered index of the two B + trees does not look different, the structure of the node is exactly the same as the content of the storage, the primary key index B + Tree node stores the primary key, secondary key index B + Tree stores the secondary key. Table data is stored in a separate place, the leaf nodes of the two B + trees use an address to point to the real table data, and there is no difference between the two keys for table data. Because the index tree is independent, retrieve the index tree without accessing the primary key through the secondary key.

To better visualize the differences between the two indexes, we imagine a table that stores 4 rows of data. Where ID is the primary index, name is the secondary index. The diagram clearly shows the difference between the clustered index and the non-clustered index.

We focus on clustered indexes, which appear to be significantly less efficient than nonclustered indexes, since each use of a secondary index is two times the B + Tree lookup, isn't it superfluous? What are the advantages of clustered indexes?

1 because the row data and leaf nodes are stored together so that the primary key and row data are loaded together into memory, the leaf node can be found immediately return the row data, if you follow the primary key ID to organize data, to obtain data faster.

2 Secondary indexes The advantage of using a primary key as a pointer instead of using an address value as a pointer is to reduce the maintenance of the secondary index when a row move or data page splits, using a primary key value as a pointer to make the secondary index occupy more space. The benefit is that INNODB does not need to update this "pointer" in the secondary index while it is moving. That is, the location of the line (implemented in the 16K page to locate, will be involved later) will be changed with the data in the database changes (the previous B + tree node splitting and page splitting), using the clustered index can be guaranteed regardless of the primary key B + tree node changes, the secondary index tree is not affected.

3 page structure

If the previous content is biased towards the principle of interpretation, then it begins to involve a concrete implementation.

Understanding the implementation of InnoDB has to mention the page structure, page is the most basic component of the entire InnoDB storage, is also the smallest unit of INNODB Disk Management, all the content related to the database is stored in this page structure. Page is divided into several types, the common page type is the data page (B-tree Node) undo page (undo Log Page) system page (systems pages) Transaction data page (Transaction system page), and so on. The size of a single page is 16K (compiled macro univ_page_size control), each page is uniquely identified with a 32-bit int value, which corresponds to the InnoDB maximum 64TB storage capacity (16Kib * 2^32 = 64Tib). The basic structure of a page is as follows:

Each page has a common header and tail, but the content in the middle varies depending on the type of page. The page's head has some data that we care about, showing the page's header details:

We focus on the fields related to the data organization: The page's head holds two pointers, pointing to the previous page and the next page, and the header with the page's type information and the number that uniquely identifies the page. Based on these two pointers, it is easy to imagine that the page link is the structure of a doubly linked list.

Looking at the main content of the page, we focus primarily on the storage of row data and indexes, which are located in the User Records section of page, where user records occupies most of the page, and the user records consists of one record, Each record represents a node (not a leaf node and a leaf node) on the index tree. Within a page, the head and tail of a single-linked list is represented by two records of fixed content, and "infimum" in string form represents the beginning, and "supremum" represents the end. The two record used to represent the beginning end is stored in the section of system records, which is a two parallel segment of the system records and user records. InnoDB There are 4 different record types, which are 1 primary key index tree non-leaf node 2 primary key index tree leaf node 3 Auxiliary key index tree non-leaf node 4 secondary key index tree leaf node. The record formats for these 4 nodes are somewhat different, but they all store the next pointer to the next record. We will describe the 4 nodes in detail, and now we just need to treat the record as a single linked list node that stores the data and contains the next pointer.

The User record exists as a single-linked list within the page, and the initial data is arranged in the order in which it was inserted, but as new data is inserted and old data is deleted, the physical order of the data becomes chaotic, but they remain logically sequenced.

By combining the organization form of the user record with a number of page, we see a slightly complete form.

Now look at how to position a record:

1 through the root node began to traverse an index B + tree, through each layer of non-leaf nodes eventually reached a page, the page is stored in the leaf node.

2 traverse the single-linked list from the "Infimum" node within the page (this traversal is often optimized) and returns successfully if the key is found. If the record arrives at "Supremum", indicating that there is no appropriate key in the page, the page's next page pointer will be used to jump to the next page and continue to find it from "infimum".

To see exactly what data is stored in different types of record, the User record can be divided into four formats according to the color difference, depending on the B + tree node.

1 Primary index tree non-leaf node (green)

1 The minimum value of the primary key stored in the child node (Min Cluster key on child), which is required by the B + tree, is to position the specific record in a page.

2 The number of the Page that contains the smallest value (Child Page #), which is the position of the record.

2 primary index tree leaf node (yellow)

1 primary KEY (Cluster key fields), B + tree required, and part of the data row

2 removes all columns except the primary key (Non-key fields), which is the collection of all other columns of the data row that are removed from the primary key.

The 1 and 22 parts here add up to a complete data line.

3 Secondary index tree non-leaf nodes non-(blue)

The minimum value of the secondary key value stored in the 1 sub-node (min secondary-key on child), which is required by the B + tree, is to position the specific record in a page.

2 primary key value (Cluster key fields), why do non-leaf nodes store primary keys? Because secondary indexes can be not unique, the B + tree requires that the values of the keys must be unique, so the value of the secondary key and the value of the primary key are combined as true key values in the B + tree to ensure uniqueness. However, this also results in a secondary index of the B + tree in which the non-leaf nodes are 4 more bytes than the leaf nodes. (That is, the middle blue node is 4 bytes more than red)

3 The number of the Page that contains the smallest value (Child Page #), which is the position of the record.

4 secondary index tree leaf nodes (red)

1 Secondary index key value (secondary key fields), which is required by the B + tree.

2 primary key value (Cluster key fields), which is used to perform a B + tree search in the main index tree to find the entire record.

The following is the most important part of this article, combined with the structure of B + tree and the contents of the 4 types of record presented above, we can finally draw a panorama picture. Since the B + Tree of the secondary index has a similar structure to the primary key index, only the structure of the primary key index tree is drawn, including the green and yellow parts of the primary key non-leaf node and the primary key leaf node.

Restore it to the more concise tree below, which is part of the B + tree. Note that there is no one by one relationship between page and B + tree nodes, page is just a save container for a record, it exists to facilitate bulk management of disk space, the page numbered 47 is split into two separate nodes in the tree structure.

Reference documents:

[1] Jeremy Cole The physical structure of InnoDB index pages

[2] Jeremy Cole B+TREE index structures in InnoDB

[3] Jeremy Cole The physical structure of records in InnoDB

[4] Kang MySQL technology insider-INNODB Storage Engine Second Edition

[5] Schwartz,b/zaitsev,p/tkach high performance MySQL third edition

[6] B-tree Wiki

MYSQL---anatomy innodb indexing principle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.