Analyze the data structure behind MySQL index design in layman's

Source: Internet
Author: User
Tags mysql index

In our company's DB specification, it is expressly stated that:

1, the building statement must explicitly specify the primary key 2, no special case, the primary key must be monotonically increment

Many research and development partners do not understand this requirement. In this paper, we will analyze the data structure and algorithms behind MySQL index design in depth, so as to help you to make the following questions:

1 , why does the InnoDB table need a primary key?  2.Why is it suggested that the InnoDB table primary key is monotonically increasing?  3, why do not recommend InnoDB table primary key set too long? 

B-tree (Multi-path search tree, not binary) is a common data structure. Using the B-TREE structure can significantly reduce the intermediate process that is experienced when locating records, thus speeding up the access speed. B is generally regarded as the abbreviation of balance. This data structure is generally used to index the database, the overall efficiency is high. Currently, many database products are indexed based on the b+tree structure. MySQL also uses b+tree, it is a variant of the b-tree, in fact, the characteristics of almost, understand the B-tree also understand the b+tree.

1, a M-order b-tree has the characteristics of "memorize in the Heart"

1) Number of children at the root node >=2(if the tree height is greater than 1)2Root nodes and leaf nodes, other nodes of the child count as [Ceil (m/2), M] each. The Ceil function means that the integer is taken3all leaf nodes appear on the same layer, and leaf nodes do not store data. 4each node contains n keyword information: (P0,K1,P1,K2,P2 ...). KN,PN) of which:4.1) Ki (i=1,2... n) as the keyword, and K (i-1) <Ki, that sort from small to large4.2The number of keywords must be met by N: [Ceil (m/2)-1, M-1]        4.3Pi points to the subtree, and the pointer p (i-1All the keywords in the subtree node pointed to are less than KI. That is: The left child of any key in the parent node is smaller than it, and the right child is greater than it.

2. B-tree Insert operation

1 Insert new element, if leaf node space enough, then insert it, follow from small to large sort; 2) If the node space is full, split. Splits half of the keywords in the node into the new node, and the middle keyword moves up to the parent node.

"Example": if the single from the above characteristics and insert rules can not be seen, please combine the following steps of the legend:

Insert the following number into a 5-step b-tree: [3,14,7,1,8,5,11,17,13,6,23,12,20,26,4,16,18,24,25,19]

First of all, according to the b-tree characteristics, the number of keywords per node range is: 2<=n<=4

"First step": Inserting 3,14,7,1

Here, the number of keywords in the first node is just full.

"Step Two": Insert 8

Since 8 is greater than 7, it should be inserted into the right subtree, a node to store up to 4 keywords, according to the insertion rule, the Middle keyword 7 is moved up to form the parent node, the other in accordance with 50% split into two nodes, such as.

"Step three": Inserting 5,11,17

Since 5 is less than 7, insert the left subtree, 11,17 greater than 7, and insert the right sub-tree. The leaf node is not full of 4 keywords, so you can insert 5,11,17 directly.

"Fourth Step": Insert 13

13 is greater than 7, should be inserted into the right subtree node, because the node is full of 4 keywords, need to split. 13 is just the middle keyword that moves up to the parent node, and the other splits into two nodes according to 50%.

"Fifth step": Inserting 6,23,12,20

These numbers can be inserted directly into the rules without splitting the operation.

"Sixth step": Insert 26

Since 26 is greater than 13, it should be inserted in the right subtree node of 13, but the node is full, needs to be split, the middle 20 is moved up to the parent node, and the others are divided into 50% nodes according to two.

"Seventh Step": Insert 4

Since 4 is less than 7, it should be inserted into the left node of 7, but the node is full, splitting is required, the Middle keyword 4 is moved up to the parent node, and the others are divided into 50% nodes according to two.

"Eighth step": Inserting 16,18,24,25

The above 4 digits are inserted directly into the corresponding position by size, without splitting operation.

"Nineth Step": Insert 19

Insert 19, need to put in the back of 18, but because the node is full, you need to split the operation, the Middle keyword 17 is moved to the parent node, the other according to 50% split into 14,16 and 18, 192 nodes;

Don't think it's over. 17 is moved up to the parent node, because the parent node is full, so the parent node is split, the Middle keyword 13 is moved up to form a new parent node, the other in accordance with 50% split into 4, 7 and 17, 202 nodes, to this, the data insert is complete, Formed a b-tree.

3. Delete operation

The delete operation is slightly more complex, and there are no examples to unfold. The approximate idea is as follows:

1) find the element to be deleted in the B-tree, and if the element exists in the B-tree, the element isdeleted in its node.  2) after deleting the element, determine whether the element has left and right child nodes, if any, then move the child node up to a similar element in the parent, and then enter the third step; if not, go to step three.  3) after moving the corresponding element, if the number of elements in the node is less than Ceil (m/2)-1, then it is necessary to see if the adjacent sibling nodes are sufficient (the number of elements in the node is greater than Ceil (m/2). 1), if sufficient, borrow an element from the parent node;
If the neighboring brothers are not enough, that is, after the completion of the number of node element is less than ceil (m/2)-1, then the node and adjacent to a sibling node merge into a node, in order to meet the conditions

In summary, for index files, whether inserting or deleting b-tree nodes, it is very expensive to continuously split and merge nodes to maintain the b-tree structure.

4, B+tree Introduction:

The MySQL index uses b+tree, which is a b-tree deformation tree that is generated as required by the file system, and their differences are:

1) The subtree pointer of the non-leaf node is the same as the number of keywords; 2) the records in the parent node of B + tree are the minimum values in the lower subtree, 3) All the leaf nodes are connected by a chain pointer; 4) All the keywords appear at the leaf nodes;

For example, here is a typical b+tree (assuming a maximum of 4 keywords per node).

Other features and operations are basically the same as B-tree. To this, B-tree and b+tree basic knowledge has been understood, the following content is based on the above concepts.

The MySQL index implementation is on the storage engine side, different storage engines for the implementation of the index is different, such as InnoDB and MyISAM, the following we focus on the implementation of the InnoDB engine index.

1, InnoDB Index implementation method:

For InnoDB tables, the data file IBD itself is an index structure organized by B+tree, the tree's leaf node data field holds the complete data record.

For example, here is the students table, the ID is the primary key, the name has a secondary index, and there are 6 rows of data records.

If in a 5-step b+tree (keyword range [2,4]), its primary key index is organized as follows:

Is the b+tree of the InnoDB primary key index, the leaf node contains the complete data record, such as the index called the clustered index. Because the InnoDB data file itself is clustered by the primary key, the INNODB requires that the table must have a primary key (MyISAM can not), and if it is not explicitly specified, MySQL will prefer to automatically select a column that uniquely identifies the data record as the primary key, such as a unique index column, if no such column exists. Then MySQL automatically generates an implicit field for the InnoDB table as the primary key with a length of 6 bytes and a type of longint.

Secondary index structure:

For secondary index, the non-leaf node holds the index value, such as the Name field above. The leaf node is no longer stored as a data record, but as a primary key value.

From the above b+tree can be summed up to:

MySQL clustered index makes search by primary key very efficient. Secondary indexes need to search two times index:          First: Retrieve secondary index get primary key value          second: Retrieve records from primary key value to primary key index

Here, again, to analyze the questions raised at the beginning of this article:

Question 1, why does the InnoDB table need a primary key?        1) InnoDB table data files are based on the primary key index organization, there is no primary key, MySQL will find a way to fix me, so the primary key must have;
2) High efficiency based on primary key query;
3) All other types of indexes refer to the primary key index; question 3, why is it not recommended that the InnoDB table primary key be set too long?
Because the secondary index holds the reference primary key index, an too long primary key index causes the secondary index to become too large;

In the example above: Insert the following number into a 5-step b-tree: [3,14,7,1,8,5,11,17,13,6,23,12,20,26,4,16,18,24,25,19]

Inserting these unordered data has been a total of 6 splits, and for disk index files, each split is an expensive operation;

If the above data in order, again inserted is not the effect will be good, I tried the next, although each time is inserted into the most right node, involving less migration data, but the number of divisions is still quite many, need 7 times split.

Each split is carried out in accordance with 50%, so the obvious disadvantage is that the index page space utilization is around 50%, and for the incremental insertion efficiency is not good, on average every two times inserted, the right node has to be split. How did the InnoDB improve?

InnoDB is simply improving the increment/decrement situation, no longer using the 50% split strategy, but using the following split strategy:

For increment/Decrement Index insert operation: 1, insert a new element, determine whether the leaf node space is sufficient, if enough, directly insert 2, if the leaf node space is full, determine whether the parent node space is sufficient, if sufficient, insert the new element into the parent node, if the parent node space is full, then split.

For example, the following 5-step B+tree:

Now that you are inserting 10,11,14,15,17 consecutively, the step-up legend with the optimized split strategy is as follows:

"First step": Insert 10

Because there is room for the right node, you can insert it directly.

"Step Two": Insert 11

Insert 11 o'clock, because the right node space is full, if you use the 50% split strategy, you need to split the operation, but using the optimized split strategy, when the node space is full, but also to determine whether the node's parent node is full, if the parent node has space, then inserted into the parent node, so 11 is inserted into the parent node, At the same time form a sub-node.

"Step two": Insert 14,15,17

The optimized split strategy only targets the increment/decrement situation, significantly reducing the number of splits and greatly improving the utilization of the index page space.

If it is randomly inserted, it may cause a higher cost of splitting probabilities. So the InnoDB storage engine maintains a last-inserted positional variable for each index page, and the last insertion is an increment/decrement identity. InnoDB can determine whether the newly inserted data satisfies the increment/decrement condition According to this information, and if it satisfies, adopts the improved splitting strategy and, if not satisfied, divides the strategy by 50%.

Here we can answer another question raised at the beginning of this article:

Question 2: Why is it recommended that the InnoDB table primary key be monotonically incremented?

if the InnoDB table primary key is monotonically increasing, the improved B+tree splitting strategy can be used to significantly reduce the number of b-tree splits and data migration, thus improving data insertion efficiency.

Not only that, it also greatly improves index page space utilization.

Conclusion

By learning b+tree data structures to deepen our understanding of the MySQL index storage structure, we are very helpful in designing and optimizing indexes.

The above is what I want to share with you, we welcome you to exchange learning.

Reference article:

Http://database.51cto.com/art/201107/275030_1.htm

Http://www.2cto.com/database/201411/351106.html

http://hedengcheng.com/?p=525

Analyze the data structure behind MySQL index design in layman's

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.