MySQL indexing principle and slow query optimization

Last Update:2017-02-27 Source: Internet

Author: User

Tags mysql

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

With its excellent performance, low cost and rich resources, MySQL has become the preferred relational database for most Internet companies. Although excellent performance, but the so-called "good horse with a good saddle", how to better use it, has become a compulsory course for development engineers, we often see from the job description such as "Proficient in MySQL", "SQL statement optimization", "understand the principle of the database" and other requirements. We know that the general application system, reading and writing ratio of about 10:1, and insert operation and general update operations rarely appear performance problems, the most encountered, but also the most problematic, or some complex query operations, so the query statement optimization is obviously the most important.

This article aims to explain the principles of database indexing and how to optimize slow queries in terms of development engineers.

Thinking caused by a slow query

Select

COUNT (*)

From

Task

where

status=2

and operator_id=20839

and operate_time>1371169729

and operate_time<1371174603

and type=2;

The system user response has a function that is getting slower, so the engineer finds the SQL above.

And I was in the mood to find me, "This SQL needs to be optimized and give me an index of each field."

I was surprised to ask, "Why do I need to index every field?" ”

"It's quicker to index the fields of the query." Engineers are confident.

"This can be done with a federated index, because it is the leftmost prefix match, so operate_time needs to be put in the end, and other related queries need to be taken, and a comprehensive evaluation is needed." ”

"Federated index?" Most left prefix match? A comprehensive assessment? The engineer couldn't help falling into a deep meditation.

In most cases, we know that indexes can improve query efficiency, but how do we build an index? What is the order of the indexes? Many people only know about it. In fact, it is not difficult to understand these concepts, and the principle of indexing is far less complex than imagined.

MySQL indexing principle

Indexing purpose

The goal of the index is to improve query efficiency, you can analogy dictionary, if you want to check the word "MySQL", we definitely need to locate the M letter, and then down from the bottom to find the Y letter, and then find the remaining SQL. If there is no index, then you may need to look through all the words to find what you want, if I want to find the words at the beginning of M? Or the word ze begins? Do you think that if there is no index, this thing can not be completed?

Indexing principle

In addition to dictionaries, there are examples of indexes in life, such as train stations and catalogues of books. The principles are the same, by narrowing the range of data you want to get to the end of the desired result, and turning random events into sequential events, which means we always lock the data by the same way.

The same is true for databases, but it's obviously a lot more complicated, because not only are you facing equivalent queries, there are scope queries (>, <, between, in), Fuzzy queries (like), and set queries (or), and so on. How should the database choose the way to deal with all the problems? We recall the dictionary example, can we divide the data into segments and then segment the query? The simplest if 1000 pieces of data, 1 to 100 into the first paragraph, 101 to 200 into the second paragraph, 201 to 300 into the third paragraph ... This way to look at the No. 250 data, as long as the third paragraph can be, all of a sudden to the exception of 90% of invalid data. But what if it's a 10 million record and it's better to break into paragraphs? A little algorithm based students will think of the search tree, its average complexity is LGN, with good query performance. But here we have overlooked a key problem, the complexity model is based on each of the same operating costs to consider, the database implementation is more complex, data stored on disk, and in order to improve performance, each time you can read some of the data into memory to calculate, because we know that the cost of access to the disk is about 100,000 times times the access to memory, So the simple search tree is difficult to meet the complex application scenarios.

Disk IO and pre-read

The previous mention of access to the disk, so here is a brief introduction to disk IO and prefetching, disk reading data by mechanical motion, each read data can be divided into search time, rotation delay, transmission time three parts, seek time refers to the magnetic arm to move to the specified track time, Mainstream disk generally below 5ms; rotational delay is what we often hear about disk speed, such as a disk 7200 rpm, means that 7,200 times per minute, that is, 1 seconds to turn 120 times, rotation delay is 1/120/2 = 4.17ms ; Transmission time refers to the time that is read out from disk or written to disk, typically in a fraction milliseconds, which can be negligible relative to the first two times. Then the time to access the disk, that is, the time of disk IO is about equal to 5+4.17 = 9ms, sounds pretty good, but to know that a 500-mips machine can execute 500 million instructions per second, because the instructions rely on the nature of electricity, In other words, the time to perform an IO can execute 400,000 instructions, the database is 1.001 billion or even tens data, 9 milliseconds each time, is obviously a disaster. The diagram below is a comparison diagram of computer hardware delay for your reference:

Considering that disk IO is a very expensive operation, the computer operating system does some optimization, when an IO, not only the current disk address data, but also the adjacent data are read to the memory buffer, because the principle of local prefetching tells us that when the computer access to an address of the data, Data that is adjacent to it will be accessed quickly. Each time IO reads the data we call a page. A specific page of how much data related to the operating system, generally 4k or 8k, that is, we read a page of data, actually occurred once io, this theory for the data structure of the index design is very helpful.

Data structure of the index

Here's an example of the index of Life, the basic principles of indexing, the complexity of the database, and the knowledge of the operating system, is to let you know that any kind of data structure is not produced in a vacuum, there must be its background and use of the scene, we now conclude, we need this data structure can do something, In fact, it is simple: every time you look up the data, the number of disk IO control in a very small order of magnitude, preferably a constant order of magnitude. So what do we think if a highly controllable multi-channel search tree meets the needs? In this way, B + trees came into being.

Detailed B + Tree

As shown above, is a B + tree, the definition of B + tree can be seen in B + trees, here's a few points, light blue blocks we call a disk block, and you can see that each disk block contains several data items (shown in dark blue) and pointers (shown in yellow), such as disk Block 1 containing data items 17 and 35, containing pointers P1, P2 , P3,P1 represents a disk block less than 17, P2 represents a disk block between 17 and 35, and P3 represents a disk block greater than 35. The real data is found in leaf nodes, 3, 5, 9, 10, 13, 15, 28, 29, 36, 60, 75, 79, 90, 99. Non-leaf nodes do not store real data, only data items that guide the direction of the search, such as 17, 35, are not real in the datasheet.

The search process of B + Tree

As shown in the figure, if you are looking for data item 29, then the disk Block 1 is loaded into memory by the disk, at which time Io, in memory with a binary lookup to determine 29 between 17 and 35, locking disk Block 1 P2 pointer, memory time because very short (compared to disk IO) can be ignored, Disk Block 3 is loaded into memory by the disk address of the P2 pointer of disk Block 1. The second io,29 occurs between 26 and 30, locking the disk block 3 P2 pointer, loading the disk block 8 to memory via the pointer, taking a third Io, while in memory doing a binary lookup find 29, ending the query, totaling three times io. The real situation is that 3-storey B + trees can represent millions of data, if millions of data lookups only need three times IO, performance improvement will be huge, if there is no index, each data item will occur once IO, then a total of million Io, obviously the cost is very high.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More