Anatomy of MySQL InnoDB index

Source: Internet
Author: User

Summary:

This article describes MySQL's InnoDB index-related knowledge, from a variety of tree-to-index principles to the details of storage.

InnoDB is the default storage engine for MySQL (Mysql5.5.5 before MyISAM, documentation). Based on the purpose of efficient learning, this article mainly introduces InnoDB. A small amount involves myisam as a control.

This article is a summary of my learning process. The content is mainly from books and blogs (the references are given). The process has added some of its own understanding. Please point out the inaccurate description of the narrative.

1 Various tree-shaped structures

Originally not intended to start from the two-fork search tree, because there are already too many related articles on the Internet, but considering the clear diagram is very helpful to understand the problem. Also to ensure the integrity of the article, and finally added this part.

Let's look at several tree-shaped structures:

1 Search binary Tree: Each node has two child nodes. The increase in data volume must result in high-speed additions. Obviously this is not suitable as an infrastructure for large amounts of data storage.

2 B-Tree: A M-order B-Tree is a balanced m-Path search tree.

The most important property is that each non-root node includes the number of keyword J satisfies: ┌m/2┐-1 <= J <= M-1; the number of child nodes of a node is 1 more than the keyword number, so keyword becomes the cut flag of the child node.

The keyword is usually drawn to the middle of the child node in the illustration. Very image, also easy and behind the B + tree differentiated. Because the data exists at the same time in the leaf node and non-leaf nodes, it is not easy to complete the keyword in sequential traversal of the B-tree, must use the method of the middle sequence traversal.

3 B + Tree: A M-order B-tree is a balanced m-Path search tree.

The most important property is the number of keyword included in each non-root node J satisfies: ┌m/2┐-1 <= J <= m; The number of subtrees can be as many as keyword.

Non-leaf nodes store the smallest keyword in the subtree. At the same time, the data node exists only in the leaf node. And a horizontal pointer is added between the leaf nodes, so it's easy to traverse all the data sequentially.

4 b* Tree: A M-order B-tree is a balanced m-Path search tree. The most important two properties are 1 each non-root node includes a keyword number J satisfies: ┌m2/3┐-1 <= J <= m;2 Non-leaf nodes are joined by a transverse pointer.






B/b+/b* Three trees have similar operations, such as retrieving/inserting/deleting nodes. This focuses only on the insertion of nodes, and only analyzes their insert operations when the current node is full. Because this action is slightly complex and can fully reflect the differences of several trees.

The comparison is that the search node is easier to implement, and the deletion of the node is just the opposite of inserting the process can be (in the actual application to delete is not inserted completely inverse operation. Often just delete the data and leave the space to be used).

Look at the division of the B-tree first, and the red value is the node that is newly inserted each time.

Each time a node is full, a split is required (splitting is a recursive process, and the following 7 inserts result in a two-layer split). Because the non-leaf nodes of the B-tree hold the same key value. So the value after the split of the full node will be distributed in three places: 1 The original node, 2 the parent node of the original node. 3 The new sibling node of the original node (reference 5. 7 of the insertion process).

Splitting has the potential to cause the height of the tree to be added (3. 7 of the insertion process). It may also not affect the height of the tree (5. 6 of the insertion process).


B + Tree splits: When a knot is full. Assign a new node and copy 1/2 of the data from the original node to the new node. Finally, a pointer to the new node is added to the parent node; the division of B + tree affects only the original node and the parent node, without affecting the sibling nodes. So it doesn't need pointers to sibling nodes.



b*: When a node is full, assume that its next sibling node is not full, then move some of the data to the sibling node. Then insert keyword at the original node, and finally change the keyword of the sibling node in the parent node (due to the keyword range of the sibling node). Let's say the brothers are full. The new node is added between the original node and the sibling node. And each copy 1/3 of the data to the new node. Finally, a pointer to the new node is added at the parent node.

Being able to see the division of the b* tree is ingenious. Since the b* tree to ensure that the split node is still 2/3 full, assuming that the method of B + tree, simply the full node is divided into two, will cause each node just 1/2 full, which does not meet the requirements of b* tree. So the strategy of the b* tree is after this node is full. Continue to insert the sibling node (which is why the b* tree needs to be in a non-leaf node plus a sibling linked list), until the brother node is also stuffed, and then pull up the brothers node together chipped, and the brother node each contributed 1/3 to establish a new node, the result is 3 nodes just 2/3 full, to reach the b* tree requirements, Happy.


The B + tree is suitable as a database infrastructure. is completely due to the computer's memory-mechanical hard disk two-tier storage structure. Memory can be completed high-speed random access questions (random access is given an arbitrary address, asking to return the data stored in this address) but the capacity is small. The random access to the hard drive through the mechanical action (1 Head Movement 2 disc rotation), access to the efficiency of a few orders of magnitude lower than memory, but the hard disk capacity is large. A typical database capacity is much larger than the available memory size, which determines that retrieving a single piece of data in a B + tree is likely to be done with several disk IO operations.

As you can see, for example, an action that reads a node down usually may be a disk IO operation, except that non-leaf nodes typically load memory at the initial stage to speed up access. At the same time to improve the horizontal traverse between nodes speed. The Blue CPU/memory read in the graph may be optimized into a binary search tree (the page directory mechanism in InnoDB) in a real database.


The B + tree in the real database should be very flat and be able to verify how flat the B + tree in InnoDB is by inserting enough data into the table sequentially. We use the CREATE statement for example to create a frequently that simply has a simple field, and then add data to populate the table. The statistical data passed (from the source see reference 1) can be analyzed in several intuitive conclusions, which show the scale of the B + tree in the database.

1 each leaf node stores 468 rows of data, and each non-leaf node stores about 1200 key values, which is a balanced 1200-way search Tree!

2 for a 22.1G capacity table. It only requires a B + tree with a height of 3 to store it, and this capacity is likely to meet the needs of a very large number of applications. Suppose the height is increased to 4. The storage capacity of the B + tree immediately increased to 25.9T!

3 for a 22.1G capacity table, the B + tree has a height of 3. If you want to load all non-leaf nodes into memory only need less than 18.8M of memory (how to come to this conclusion?). For a tree with a height of 2, 1203 leaf nodes also need only 18.8M space, while the 22.1G Magdalene table has a height of 3 and non-leaf nodes of 1204.

At the same time we if the leaf node size is greater than the non-leaf node, because the leaf node stores the row data, not the leaf node is only a key and a small amount of data.

)。 With such a small amount of memory, it is very efficient to retrieve the required data with only one disk IO operation.



2 MySQL storage engine and Index

It is possible to say that the database must be indexed, that the retrieval process becomes a sequential lookup without an index, and that the time complexity of O (n) is almost intolerable. We are easy to imagine a table with only a single keyword How to index a B + tree, just to store keyword in the tree node. When multiple fields are included in a database record, a B + tree can only store the primary key. If you retrieve a non-primary key field, the primary key index loses its function and becomes a sequential lookup.

A second set of indexes should be established on the second column to retrieve.

This index is organized by a separate B + tree.

There are two common ways to address the problem of multiple B + trees visiting the same set of table data. One is called a clustered index (clustered index), which is called a non-clustered index (secondary).

These two names are called indexes, though. However, this is not a separate index type. It's a way of storing data. For clustered index storage, row data is stored together with the primary key B + tree. Auxiliary keys The B + Tree stores only the secondary and primary keys. Primary key and non-primary key B + trees are almost two types of trees.

For non-clustered index storage. The primary key B + Tree stores a pointer to the true data row in the leaf node, rather than the primary key.

InnoDB uses a clustered index, organizes the primary key into a B + tree, and the row data is stored on the leaf node, and if the "where id = 14" condition is used to find the primary key, the corresponding leaf node can be found according to the search algorithm of the B + tree, and then the row data is obtained.

If a conditional search is made on the name column. Requires two steps: The first step is to retrieve the name in the secondary index B + tree. Reach its leaf node to get the corresponding primary key. The second step uses the primary key to run a B + tree retrieval operation at the main index B + species and finally reaches the leaf node to get the entire row of data.

The non-clustered index used by the myism, the two B + trees of the nonclustered index look no different. The structure of the nodes is identical only to the content of the storage. Primary key index The nodes of the B + Tree store the primary key, and the secondary key index B + Tree stores the secondary key. Table data is stored in a separate place, the leaf nodes of the two B + trees use an address to point to the real table data, and for the table data, the two keys are no different. Because the index tree is independent. Retrieve the index tree without visiting the primary key by using the secondary key.

To better visualize the differences between the two indexes. We imagine a table for example to store 4 rows of data. The ID as the primary index. Name as the secondary index. The diagram clearly shows the difference between the clustered index and the non-clustered index.




We focus on clustered indexes, which appear to be significantly less efficient than nonclustered indexes. Since each use of the secondary index is searched two times B + Tree lookup, this is not superfluous? What are the advantages of clustered indexes?

1 because row data and leaf nodes are stored together so that the primary key and row data are loaded together, the leaf nodes can be returned immediately. Assume that data is organized according to the primary key ID to get faster data.

2 Secondary index The advantage of using a primary key as a pointer instead of using an address value as a pointer is to reduce the maintenance of the secondary index when a row is moved or the data page splits, and using a primary key value as a pointer causes the secondary index to occupy a lot of other space. The advantage of this is that INNODB does not need to update this "pointer" in the secondary index when moving rows.

That is, the location of the line (implemented in the 16K page to locate, will be involved in the following) will change with the data in the database changes (the previous B + tree node splitting and page splitting), using a clustered index can ensure that regardless of the primary key B + tree node changes, the secondary index tree is not affected.

3 page structure

Assuming that the previous content is biased towards the principle of interpretation, then it starts with a detailed implementation.

Understanding the implementation of InnoDB has to mention the page structure, page is the most basic component of the entire InnoDB storage, but also the smallest unit of INNODB Disk Management. All the content associated with the database is stored in the page structure.

Page is divided into several types, the common page type is the data page (B-tree Node) undo page (undo Log Page) system page (systems pages) Transaction data page (Transaction system page), and so on. The size of a single page is 16K (compiled macro univ_page_size control), each page is uniquely identified with a 32-bit int value, which is exactly the InnoDB maximum 64TB storage capacity (16Kib * 2^32 = 64Tib). The basic structure of a page such as the one you see:


Each page has a common head and tail. However, the content in the middle varies depending on the type of page. The page's head has some data that we care about, showing the page's header details:


We focus on the fields related to the data organization: The page's head holds two pointers, pointing to the previous page and the next page, and the header with the page's type information and the number that uniquely identifies the page. Based on these two pointers we are very easy to imagine the page link is a doubly linked list structure.

Then look at the main content of the page. We focus primarily on the storage of row data and indexes, which are located in the User Records section of page, where user records occupies most of the page, and the user records consists of one record, Each record represents a node (not a leaf node and a leaf node) on the index tree.

Within a page. The tail of a single-linked list is represented by two records of fixed content. The "infimum" in string form represents the beginning, and "supremum" represents the end.

The two record used to represent the beginning end is stored in the section of system records. This system records and user records are two parallel segments. InnoDB There are 4 different record types, each of which are 1 primary key index tree non-leaf node 2 primary key index tree leaf node 3 Auxiliary key index tree non-leaf node 4 secondary key index tree leaf node.

The record formats for these 4 nodes are somewhat different, but they all store the next pointer to the next record. Perhaps we will specifically introduce these 4 nodes, now only need to think of the record as a stored data at the same time containing the next pointer to a single-linked list node can be.


The User record exists as a single-linked list within the page. The initial data is arranged in the order in which they were inserted, but as new data is inserted and old data is deleted, the physical order of the data becomes chaotic. But they still hold the logical order.


By combining the organization of user record and several page, we see a slightly complete form.


Now look at how to position a record:

1 through the root node starts to traverse an index B + tree, through each layer of non-leaf node finally reached a page, this page is stored in the leaf node.

2 in the page from the "infimum" node start to traverse the single-linked list (such a traversal is often optimized), assuming that the key is found to successfully return. Suppose the record reached "supremum". Indicates that there are no suitable keys in the current page. To use the page's next page pointer, jump to the next page and continue from "infimum" to find it one by one.

Depending on what data is stored in different types of record, the User record can be divided into four formats, depending on the B + tree node. The species is differentiated according to color.

1 Primary index tree non-leaf node (green)

1 The minimum value of the primary key stored in the child node (Min Cluster key on child), which is required by the B + tree, is to locate the detailed record in a page.

2 The number of the Page that contains the smallest value (Child Page #), which is the position of the record.

2 primary index tree leaf node (yellow)

1 primary KEY (Cluster key fields), B + tree required. is also part of the data row

2 Remove all columns except the primary key (Non-key fields). This is the collection of all the other columns of the data row that are removed from the primary key.

The 1 and 22 parts here add up to a complete data line.

3 Secondary index tree non-leaf nodes non-(blue)

The minimum value of the secondary key value stored in the 1 sub-node (min secondary-key on child). This is a B + tree that is required to position the detailed record in a page.

2 primary key value (Cluster key fields). Why do non-leaf nodes store primary keys? Because the secondary index is not unique, the B + tree requires that the value of the key must be unique. So here we combine the value of the auxiliary key and the value of the primary key as the true key value in the B + tree. Guaranteed uniqueness. However, this also results in a secondary index of the B + tree in which the non-leaf nodes are 4 bytes more than the leaf nodes. (That is, the middle blue node is 4 bytes more than red)

3 The number of the Page that contains the smallest value (Child Page #), which is the position of the record.

4 secondary index tree leaf nodes (red)

1 Secondary index key value (secondary key fields), which is required by the B + tree.

2 primary key value (Cluster key fields), which is used to perform a B + tree search in the main index tree to find the entire record.


The following is the most important part of this article, combined with the structure of B + tree and the contents of the 4 types of record presented above, we can finally draw a panorama. Because the B + tree of the secondary index has a similar structure to the primary key index, only the structure of the primary key index tree is drawn, and only the nodes of the primary key non-leaf node and the primary key leaf node are included. That is, the green and yellow parts.


Restore it to the more concise tree, which is part of the B + tree. Note that there is no one by one corresponding relationship between page and B + tree nodes. Page is just a save container for a record, it exists to facilitate bulk management of disk space, the page numbered 47 in the tree structure is split into two separate nodes.


This is the end of this article. This article is simply a summary of the data structures and implementations related to the InnoDB index, and does not involve MySQL's actual combat experience. This is mainly based on several reasons:

1 principle is the cornerstone, only have a full understanding of how the InnoDB index work, we have the ability to effectively use it.

2 The principle of knowledge is particularly suitable for the use of illustrations, I personally like this way of expression.

3 about InnoDB optimization, in the "high-performance MySQL" has a more comprehensive introduction, to optimize MySQL interested in the students are fully able to obtain relevant knowledge, my own accumulation has not reached the point that can share these content.

Another: Students who have a lot of other interests in InnoDB can look at Jeremy Cole's blog (the source of three articles in the literature). This man has been in MySQL. Yahoo,twitter. Google is engaged in database-related work. He's a great article!

References:

[1] Jeremy Cole The physical structure of InnoDB index pages

[2] Jeremy Cole B+TREE index structures in InnoDB

[3] Jeremy Cole The physical structure of records in InnoDB

[4] Kang MySQL technology insider-INNODB Storage Engine Second Edition

[5] Schwartz,b/zaitsev,p/tkach high performance MySQL third edition

[6] B-tree Wiki

Anatomy of MySQL InnoDB index

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.