MySQL database InnoDB Indexing principle detailed Practice Tutorial "diagram"

Source: Internet
Author: User
Tags min requires

InnoDB is the default storage engine for MySQL (Mysql5.5.5 is MyISAM, document). In the purpose of efficient learning, this article is mainly about the introduction of InnoDB, a small number involving MyISAM as a contrast.

This article is a summary of my learning process, the content is mainly from books and blogs (reference will be given), the process of adding some of their own understanding, the description of inaccurate places to say.

1 Various tree-shaped structures

It was not intended to start with a two-fork search tree, because there are already too many related articles on the web, but given the clarity of the illustrations that are very helpful to understanding the problem, and to ensure the integrity of the article, finally add this part.

Let's look at a few tree-shaped structures:

1 Search binary Tree: Each node has two child nodes, the increase of data will inevitably lead to a rapid increase in height, obviously this is not suitable as a large number of data storage infrastructure.

2 B-Tree: A M-order B-Tree is a balanced m-road search tree. The most important attribute is the number of keywords each non-root node contains: ┌m/2┐–1 <= J <= m–1; the number of child nodes of a node is 1 more than the key, so the keyword becomes the split flag of the child node. Usually in the diagram of the key calligraphy to the middle of the child node, very image, but also easy to distinguish with the B + tree behind. Because the data exist at the same time in leaf node and non leaf nodes, it is not easy to complete the key word in sequential traversing B-tree, which must be traversed in the middle sequence.

3 B + Tree: A M-order tree is a balanced m-road search tree. The most important attribute is the number of keywords each non-root node contains: ┌m/2┐–1 <= J <= m; The number of subtrees can be as many as the keywords. Non-leaf nodes store the smallest keywords in the subtree. The data nodes only exist in the leaf nodes and the horizontal pointer is added between the leaf nodes, which makes it very easy to traverse all the data sequentially.

4 b* Tree: A M-order B-tree is a balanced m-road search tree. The most important two properties are 1 the number of keywords that each non-root node contains is J-Satisfied: ┌m2/3┐–1 <= J <= m;2 A horizontal pointer is added between the non-leaf nodes.

B/b+/b* Three of trees have similar operations, such as retrieving/inserting/deleting nodes. The focus here is only on inserting nodes, and only on the inserts that are full in the current node, because the action is slightly more complex and can fully reflect the differences in several trees. In contrast, the retrieval node is relatively easy to implement, and the deletion node as long as the reverse process can be completed (in the actual application of the deletion is not inserted completely reverse operation, often only delete data and save space for subsequent use).

Look at the split of the B-tree, the red value of the following figure is the new inserted node each time. Every time a node is full, splitting is needed (splitting is a recursive process, the insertion of the following 7 causes a two-layer split), because the non-leaf node of the B-Tree also holds the key value, so the value of the full node split will be distributed in three places: 1 original node, 2 original node's parent node, 3 The new sibling node of the original node (reference 5,7 Insert process). Splitting can cause the height of the tree to increase (refer to the 3,7 Insert process), or it may not affect the height of the tree (refer to the 5,6 Insert process).

B + Tree splitting: When a node is full, a new node is allocated, and 1/2 of the data in the original node is copied to the new node. Finally, the pointer of the new node is added to the parent node; the split of the B + tree affects only the original node and the parent node, without affecting the sibling nodes, so it does not need to point to the sibling pointer.

The division of the b* tree: When a knot is full, if its next sibling is not full, move some of the data to sibling nodes, insert the keyword at the original node, and finally modify the key of the sibling node in the parent node (because the sibling node's keyword range has changed). If the brother is full, then add a new node between the original node and the sibling node, copy 1/3 data to the new node, and finally add a pointer to the new node at the parent node. Can see b* tree division is very ingenious, because the b* tree to ensure that after the split node is 2/3 full, if the use of B + Tree method, but simply the full node in two, will cause each node only 1/2 full, this does not meet the requirements of the b* tree. So the strategy adopted by the b* tree is that after this node is full, continue inserting sibling nodes (which is why the b* tree needs to add a sibling link to the non-leaf node) until the sibling node is stuffed, then pull up the sibling node together collections, and the sibling node each funded 1/3 to set up a new node, The result is that 3 nodes are just 2/3 full, reaching the b* tree's requirements and happy.

B + trees are suitable for the infrastructure of the database, solely because of the memory of the computer-the two-tier storage structure of the mechanical hard disk. Memory can be done with fast random access (random access gives you an address that requires you to return the data stored by this address) but with less capacity. and the random access to the hard disk through mechanical action (1 head Mobile 2 disc rotation), access efficiency is lower than the memory of several orders of magnitude, but the hard disk capacity is larger. A typical database capacity significantly exceeds the available memory size, which determines that retrieving a single piece of data in a B + tree is likely to be done with several disk IO operations. As shown in the following illustration: Typically, an action that reads a node down may be a disk IO operation, but a non-leaf node typically loads memory at the initial stage to speed up access. At the same time, in order to improve the traverse speed between the nodes, the real database may optimize the Blue CPU calculation/memory reading in the graph into the binary search tree (page directory mechanism in InnoDB).


A B + tree in a real database should be very flat, and you can verify how flat a B + tree in InnoDB is by inserting enough data into the table sequentially. We create a test table with a simple field from the CREATE statement in the following figure, and then keep adding data to populate the table. A few intuitive conclusions can be analyzed by using the statistical data of the following graph (sources see reference 1), and these conclusions show the scale of the B + tree in the database.

1 each leaf node stores 468 rows of data, each non-leaf node stores about 1200 key values, which is a balanced 1200-way search Tree!

2 for a 22.1G-capacity table, it can only be stored in a B + tree with a height of 3, which can probably meet the needs of many applications. If you increase the height to 4, then the storage capacity of the B + tree immediately increases to 25.9T!

3 for a 22.1G-capacity table, the height of a B + tree is 3, and if you want to load all the non-leaf nodes into memory, you need less than 18.8M of memory (how do you come to this conclusion?) Because for a tree with a height of 2, 1203 leaf nodes only need 18.8M space, and 22.1G is the height of the table is 3, non-leaf nodes 1204. We also assume that the leaf node size is larger than the Non-leaf node, because the leaf node stores the row data instead of the leaf node with only the key and a small amount of data. , with so little memory, it is highly efficient to retrieve the data needed only once for disk IO operations.

2 MySQL's storage engine and index

It can be said that the database must have an index, no index is the retrieval process into a sequential lookup, O (n) time complexity is almost unbearable. It is very easy to imagine how a table with only one keyword can be indexed using a B + tree, as long as the keyword is stored in the tree node. When a database contains more than one field in a record, a B + tree can store only the primary key, and if a non-key field is retrieved, the primary key index loses its effect and becomes sequential lookup.  You should then create a second set of indexes on the second column you want to retrieve. This index is organized by an independent B + tree. There are two common ways to solve the problem of multiple B + trees accessing the same set of table data, one called the Clustered index (clustered index), and the other is called a nonclustered index (secondary index). Although these two names are all called indexes, this is not a separate index type, but rather a way of storing data. For clustered index storage, row data is stored with the primary key B + tree, and the secondary key B + Tree stores only the secondary keys and primary keys, and the primary key and the non primary key B + trees are almost two types of trees. For nonclustered index storage, primary key B + trees store pointers to true data rows, not primary keys, in leaf nodes.

InnoDB uses a clustered index, the primary key is organized into a B + tree, and the row data is stored on the leaf node, and if you use a condition such as the WHERE id = 14″ to find the primary key, you can find the corresponding leaf node according to the B + Tree retrieval algorithm, and then get the row data. A conditional search for the Name column requires two steps: The first step is to retrieve name in the secondary index B + tree, to reach its leaf node and get the corresponding primary key. The second step is to use the primary key to perform another B + tree retrieval operation on the primary index B + trees, eventually reaching the leaf node to get the entire row of data.

Myism is a nonclustered index, and two B + trees with non-clustered indexes look nothing different, the structure of the nodes is exactly the same as the contents of the storage, the primary key is stored by the node of the B + tree, and the secondary key index B + Tree stores the secondary key. The table data is stored in a separate place where the leaf nodes of the two B + trees use an address to point to the real table data, and there is no difference between the two keys for table data. Because the index tree is independent, the index tree without access to the primary key is retrieved through the secondary key.

To illustrate the difference between the two indexes, we assume that a table has 4 rows of data stored in the following figure. Where ID is the primary index and name is the secondary index. The diagram shows clearly the difference between clustered and nonclustered indexes.

We focus on clustered indexes, and it looks as if the efficiency of the clustered index is significantly lower than the nonclustered index, since it takes two B + tree lookups each time you use the secondary index search, isn't it superfluous? What are the advantages of clustered indexes?

1 because the row data and the leaf node are stored together so that the primary key and row data are loaded together in memory, the leaf node can immediately return the row data, and if you organize the data according to the primary key ID, get the data faster.

2 Secondary indexes The advantage of using a primary key as a pointer instead of using an address value is to reduce the maintenance of secondary indexes when rows are moved or when a data page splits, and using primary key values as pointers makes the secondary index take up more space. The benefit is that INNODB does not need to update this "pointer" in the secondary index while moving a row. That is, the position of the line (implemented in the 16K page to locate, later on) will change as the data in the database changes (in front of the B + tree node splitting and page splitting), using a clustered index ensures that no secondary index tree is affected regardless of how the node of the primary key B + tree changes.

3 page structure

If the previous content is biased towards the principle of interpretation, then it begins to involve concrete implementation.

Understanding the implementation of InnoDB has to mention page structure, page is the most basic component of the entire InnoDB storage, is also INNODB Disk Management of the smallest unit, all the content related to the database is stored in this page structure. The page is divided into several types, and the common types of pages are the data page (B-tree Node) undo page (The Undo Log Page) system page (Transaction system pages), and so on. The size of a single page is 16K (compiled macro univ_page_size control), with each page uniquely identified with a 32-bit int value, which corresponds exactly to the InnoDB maximum 64TB storage capacity (16Kib * 2^32 = 64Tib). The basic structure of a page is shown in the following illustration:

Each page has a common header and tail, but the contents of the middle are changed depending on the type of page. Page's head contains some of the data we care about, and the following figure shows the page's header details:

We focus on the fields that are related to the structure of the data: page's head holds two pointers, pointing to the previous page and the next page, the header and the page type information and the number that uniquely identifies the page. Based on these two pointers, it is easy to imagine that the page link is a two-way linked list structure.

And then look at the main page, we focus on the storage of row data and indexes, they are all located in the User Records section of page, user records occupies most of page space, user records consists of a single record, Each record represents a node (not a leaf node and a leaf node) on the index tree. Inside a page, the head and tail of a single list are represented by two records of the fixed content, the "infimum" representation at the beginning of the string form, and the "supremum" represents the end. The two record names that represent the beginning end are stored in the system records segment, and the system records and user records are two parallel segments. There are 4 different record types in InnoDB, they are 1 primary key index tree NON-LEAF Node 2 primary key index tree leaf node 3 Auxiliary key index tree non-leaf node 4 auxiliary key index tree leaf node. The record formats for these 4 nodes are somewhat different, but they all store the next pointer to the next record. We'll give you a detailed description of these 4 nodes, and now we just need to take the record as a single linked table node that stores the data at the same time as the next pointer.

The User record exists as a single linked list in the page, and the initial data is in the order of insertion, but as new data is inserted and old data is deleted, the physical order of the data becomes confusing, but they remain logically sequential.

Combining the form of the user record with a number of page, you see a slightly complete form.

Now look at how to locate a record:

1 through the root node to start traversing an index of the B + tree, through each layer of non-leaf nodes eventually reached a page, this page is stored in the leaf node.

2 in the page from the "infimum" node to start traversing a single list (this traversal is often optimized), if the key is found to successfully return. If the record arrives in "Supremum" and the current page does not have the appropriate key, the page's next page pointer is used to jump to the next page to continue looking from "infimum".

Look at the different types of record in the end what data stored, according to the B + tree node, the User can be divided into four formats, the image below according to the color of the difference.

1 Primary index tree non-leaf node (green)

1 the smallest value in the primary key stored by the child node (Min Cluster key on child), which is required by the B + tree to position the specific record in a page.

The number of the page that contains the minimum value of 2 (child Page numbers), which is the position of the record.

2 primary index tree leaf node (yellow)

1 primary KEY (Cluster key Fields), B + tree must also be part of the data row

2 removes all columns except the primary key (Non-key Fields), which is a collection of all other columns for the drop primary key of the data row.

The 1 and 22 parts here add up to a complete data row.

3 Secondary index tree non-leaf node non-(blue)

1 the smallest value in the secondary key value stored in the child node (Min secondary-key on Child), which is required by the B + tree to locate the location of the specific record in a page.

2 primary key value (Cluster key Fields), non-leaf node why should I store the primary key? Because the secondary index is not unique, the B + tree requires that the value of the key must be unique, so the value of the secondary key and the value of the primary key are combined as true key values in the B + tree to ensure uniqueness. However, this also results in the secondary index B + Tree non-leaf nodes are more than 4 bytes more than the leaf node. (That is, the blue node in the following figure is 4 bytes more than red)

The number of the page that contains the minimum value of 3 (child Page numbers), which is the position of the record.

4 secondary index tree leaf nodes (red)

1 Secondary index key value (secondary key Fields), which is required by the B + tree.

2 primary key value (Cluster key Fields), which is used to do another B + tree retrieval in the main index tree to find the entire record.

Here is the most important part of this article, combined with the structure of the B + tree and the contents of the 4 kinds of record described above, we can finally draw a panoramic picture. Because the B + tree of the secondary index has a similar structure to the primary key index, only the structure diagram of the primary key index tree is drawn, containing only the "PRIMARY key Non-leaf node" and "Primary key leaf node" nodes, the green and yellow portions of the above image.

Restore the above figure to the simpler tree diagram below, which is part of the B + tree. Note that there is no one by one correspondence between page and B + tree nodes, and page is only a storage container for a record, which exists to facilitate batch management of disk space, and the page numbered 47 in the figure above is split into two separate nodes on the tree structure.

At this point even the end of this article is only the InnoDB index-related data structure and implementation of a number of carding summary, does not involve the actual experience of MySQL. This is mainly based on several reasons:

1 principle is the cornerstone, only fully understand the InnoDB index of the working way, we have the ability to use it efficiently.

2 The principle of knowledge is particularly suitable for use of illustrations, I personally like this way of expression.

3 about InnoDB optimization, in the "High-performance MySQL" in a more comprehensive introduction, to optimize MySQL interested students can get the relevant knowledge, my own accumulation has not reached the point of sharing these content.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.