MySQL indexing principle

Last Update:2016-05-02 Source: Internet

Author: User

Tags mysql query mysql index

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous article, "MySQL Index," describes the basic content of the index. This article says the MySQL indexing principle.

MySQL indexing principle

# #索引目的

Why do we have to index, for a simple index can improve query efficiency, can be analogous to the book directory. We will not repeat the advantages of the index, please consult the information yourself.

# #索引原理

In addition to the catalogue of books, we often find similar things in our life, such as dictionaries, train stations, etc. They all work the same way, by shrinking the range of data they want to filter out the results they want, and by turning random events into sequential events, that is, we always lock data by the same search method.

The database is the same, but obviously much more complex, because not only is it facing the equivalent query, but also the scope query (>, <, between, in), Fuzzy query (like), the set query (or), and so on.

How should the database choose the way to deal with all the problems?

We recall the example of the dictionary, can we divide the data into segments and then query it in segments? The simplest if 1000 data, 1 to 100 is divided into the first paragraph, 101 to 200 is divided into the second paragraph, 201 to 300 is divided into the third paragraph ... This check No. 250 data, as long as the third paragraph can be, all of a sudden to remove 90% of invalid data. But what if it's a 10 million record and it's better to be divided into sections? A little algorithm based on the students will think of the search tree, its average complexity is LGN, with good query performance. But here we overlook a key problem, the complexity of the model is based on the same operating costs each time, the database implementation is more complex, the data is saved on disk, and in order to improve performance, each time you can read some of the data into memory to calculate, because we know that the cost of accessing the disk is about 100,000 times times the amount of access to memory, So a simple search tree is difficult to meet complex application scenarios.

# # #磁盘IO与预读

Before referring to the access disk, then here is a brief introduction of disk IO and pre-reading, the disk read data by the mechanical movement, each time to read the data can be divided into the seek time, rotation delay, transmission time three parts.

Seek time refers to the time required for the magnetic arm to move to the specified track, and the main disk is generally below 5ms;

Rotation delay is what we often hear of disk speed, such as a disk 7200 rpm, indicating that can be rotated 7,200 times per minute, that is, 1 seconds can go 120 times, rotation delay is 1/120/2 = 4.17ms;

Transfer time refers to the time that reads from disk or writes data to disk, typically in fraction milliseconds, and is negligible relative to the first two times.

Then the time to access a disk, that is, a disk IO time is approximately equal to 5+4.17 = 9ms, sounds pretty good, but to know that a 500-mips machine can execute 500 million instructions per second, because the instruction depends on the nature of the electricity, In other words, the time to execute an IO can execute 400,000 instructions, the database with 1.001 billion or even tens data, each time 9 milliseconds, it is obviously a disaster.

Considering that disk IO is a very expensive operation, the computer operating system does some optimization, when an IO, not only the current disk address data, but also the adjacent data are read into the memory buffer, because the local pre-reading principle tells us that when the computer access to the data of an address, The data adjacent to it will also be accessed quickly. Each IO reads the data we call a page. The specific page of how big the data is related to the operating system, generally 4k or 8k, that is, when we read the data in a page, actually occurred once io, this theory is very helpful for the data structure design of the index.

# # #索引的数据结构

In front of the example of Life Index, the basic principle of the index, the complexity of the database, and the relevant knowledge of the operating system, the purpose is to let everyone understand that any kind of data structure is not produced in a vacuum, there will be its background and use of the scene, we now summarize, we need this data structure can do something, In fact, it is very simple, that is: each time you look for data to control the number of disk IO in a very small order of magnitude, preferably a constant order of magnitude. Then we think if a highly controllable multi-path search tree can meet the needs? In this way, the B + Tree was born.

# # #详解b + Tree

For example, is a B + tree, the definition of B + tree can be seen in the B + tree, here is only a few points, the light blue block we call a disk block, you can see each disk block contains several data items (dark blue) and pointers (shown in yellow), such as disk Block 1 contains data items 17 and 35, including pointers P1, P3,P1 represents a disk block that is less than 17, P2 represents a disk block between 17 and 35, and P3 represents a disk block greater than 35. Real data exists at leaf nodes 3, 5, 9, 10, 13, 15, 28, 29, 36, 60, 75, 79, 90, 99. Non-leaf nodes do not store real data, only data items that guide the direction of the search, such as 17 and 35, do not exist in the data table.

# # #b + Tree discovery process

, if you want to find the data item 29, then the disk Block 1 is loaded into memory by disk, at this time Io, in memory with a binary lookup to determine 29 between 17 and 35, locking disk Block 1 P2 pointer, memory time because of very short (compared to the disk IO) can be negligible, Disk Block 1 through disk address of the P2 pointer to the disk block 3 is loaded into memory, the second io,29 between 26 and 30, locking disk block 3 of the P2 pointer, loading disk blocks 8 through the pointer to memory, a third Io, while in-memory binary find found 29, the end of the query, a total of three IO. The real situation is, the 3-tier B + tree can represent millions of data, if millions of data to find only three Io, the performance will be huge, if there is no index, each data item will occur once IO, then a total of millions of Io, it is obviously very expensive.

# # #b + Tree Nature

1. Through the above analysis, we know that the number of IO depends on the height of B + H, assuming that the current data table data is N, the number of data items per disk block is M, then there is H=㏒ (m+1) n, when the amount of data n a certain case, m larger, h smaller, and m = size of disk block/data item The size of the disk block is also a data page size, is fixed, if the data items occupy less space, the more data items, the lower the height of the tree. This is why each data item, the index field, is as small as possible, such as an int accounting for 4 bytes, which is less than half the bigint8 byte. This is why the B + tree requires the real data to be placed on the leaf node instead of the inner node, and once placed in the inner node, the data items of the disk block will be greatly reduced, resulting in a higher tree. When the data item equals 1 o'clock, it will degenerate into a linear table.

2. When the data item of the B + tree is a composite data structure, such as (Name,age,sex), the B + number is based on the left-to-right order to establish the search tree, such as when the data (Zhang San, 20,f) is retrieved, the B + tree will first compare the name to determine the direction of the next search, If name is the same, then compare age and sex, and finally get the retrieved data, but when the (20,F) does not have the name of the data, B + tree does not know which node to check next, because the search tree when the name is the first comparison factor, You must search by name first to know where to go next. For example, when (Zhang San, F) such data to retrieve, B + tree can use name to specify the direction of the search, but the next field of age is missing, so only the name equal to Zhang San data are found, and then match the gender is the data of F, this is very important property, that is, the index of the leftmost matching characteristics.

Slow query optimization

About MySQL indexing principle is a relatively boring thing, we just need to have a perceptual understanding, do not need to understand very thoroughly and deeply. After understanding the indexing principle, do we have some ideas for slow queries? Let's summarize some of the main principles of indexing:

Several principles of index building

1. The leftmost prefix matching principle , very important principle, MySQL will always match right until it encounters a range query (>, <, between, like) to stop matching, such as a = 1 and B = 2 and C > 3 and D = 4 if built (a,b,c,d) sequential index, D is not indexed, if the establishment (A,B,D,C) of the index can be used, a,b,d order can be arbitrarily adjusted.

2.= and in can be disorderly , such as a = 1 and B = 2 and c = 3 build (a,b,c) index can be arbitrary order, MySQL query optimizer will help you optimize the form of the index can be recognized

3. try to choose a high-differentiated column as the index, the formula for the degree of sensitivity is count (distinct col)/count (*), indicating that the field does not repeat the scale, the greater the proportion of the number of records we scan, the difference between the unique key is 1, and some states, The gender field may be 0 in front of big data, and one might ask, what is the empirical value of this ratio? Using different scenarios, this value is also difficult to determine, generally need to join the field we are required to be more than 0.1, that is, the average 1 scan 10 records

4. The index column cannot participate in the calculation , keep the column "clean", such as from_unixtime (create_time) = ' 2014-05-29 ' can not be used to the index, the reason is simple, B + tree is stored in the Data table field values, but when the retrieval, You need to apply all the elements to the function to compare, obviously the cost is too large. So the statement should be written create_time = Unix_timestamp (' 2014-05-29 ');

5. expand the index as much as possible and do not create a new index . For example, the table already has an index of a, now to add (A, b) of the index, then only need to modify the original index

Query optimization artifact-explain command

About explain command believe everyone is not unfamiliar, specific usage and field meaning can refer to official website Explain-output, here need to emphasize rows is the core indicator, most of the rows small statement execution must be very fast (with exception, as described below). So the optimization statements are basically optimizing rows.

Basic steps for slow query optimization

0. Run first to see if it is really slow, note the setting Sql_no_cache

1.where condition single Check, lock minimum return record table. This sentence means to apply the where of the query to the table the smallest number of records returned in the table began to look up, single table each field query, to see which field is the highest degree of distinction

2.explain View execution plan, consistent with 1 expected (start query from a table with fewer locked records)

3.order by limit SQL statement allows sorted tables to be prioritized

4. Understanding Business Party usage Scenarios

5. Index reference several principles of index construction

6. Observation results, non-conforming to the expected continuation from 0 analysis

Slow Query Case Optimization

Reference

"high performance MySQL"

MySQL optimization

MySQL indexing principle

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More