With its excellent performance, low cost and rich resources, MySQL has become the preferred relational database for most Internet companies. Although performance is good, but the so-called "good Horse with good saddle", how to better use it, has become a required course for development engineers, we often see from the job description such as "Proficient MySQL", "SQL statement optimization", "Understand database principles" and other requirements. We know that the general application system, reading and writing ratio of about 10:1, and the insertion operation and general update operations rarely occur performance problems, encountered the most, is also the most prone to problems, or some complex query operations, so the optimization of query statements is obviously the most serious.
This article is intended to explain the principles of database indexing and how to optimize slow queries in the perspective of development engineers.
thinking caused by a slow query
Select
COUNT (*)
From
Task
where
status=2
and operator_id=20839
and operate_time>1371169729
and operate_time<1371174603
and type=2;
The system user responds with a function that is getting slower, so the engineer finds the SQL above.
And the interest dash found me, "This SQL needs to be optimized, give me the index of each field"
I was surprised and asked, "Why do I need to index every field?" ”
"It's faster to index the fields of the query." The engineer is full of confidence
"It's perfectly possible to build a federated index, because it's the leftmost prefix match, so the operate_time needs to be put to the end, and you need to take all the other related queries and need to do a comprehensive evaluation." ”
"Federated index?" The leftmost prefix matches? Comprehensive assessment? The engineer could not help being immersed in contemplation.
In most cases, we know that indexes can improve query efficiency, but how do we build indexes? What is the order of the indexes? A lot of people just know about it. In fact, understanding these concepts is not difficult, and the principle of indexing is far less complicated than imagined.
MySQL Indexing principle
Index Purpose
The goal of the index is to improve the efficiency of the query, you can analogy dictionary, if you want to check the word "MySQL", we definitely need to locate the letter M, and then find the Y letter from the bottom, and then find the remaining SQL. If there is no index, then you may need to look through all the words to find what you want, if I want to find the words that start with M? Or the words that start with ze? Do you think that if there is no index, this thing can not be completed at all?
Indexing principle
In addition to dictionaries, there are examples of indexes in life, such as train station schedules, book catalogs, and so on. They all work the same way, by shrinking the range of data they want to filter out the results they want, and by turning random events into sequential events, that is, we always lock data by the same search method.
The database is the same, but obviously much more complex, because not only is it facing the equivalent query, but also the scope query (>, <, between, in), Fuzzy query (like), the set query (or), and so on. How should the database choose the way to deal with all the problems? We recall the example of the dictionary, can we divide the data into segments and then query it in segments? The simplest if 1000 data, 1 to 100 is divided into the first paragraph, 101 to 200 is divided into the second paragraph, 201 to 300 is divided into the third paragraph ... This check No. 250 data, as long as the third paragraph can be, all of a sudden to remove 90% of invalid data. But what if it's a 10 million record and it's better to be divided into sections? A little algorithm based on the students will think of the search tree, its average complexity is LGN, with good query performance. But here we overlook a key problem, the complexity of the model is based on the same operating costs each time, the database implementation is more complex, the data is saved on disk, and in order to improve performance, each time you can read some of the data into memory to calculate, because we know that the cost of accessing the disk is about 100,000 times times the amount of access to memory, So a simple search tree is difficult to meet complex application scenarios.
disk IO and pre-read
Before referring to the access disk, then here is a brief introduction of disk IO and pre-reading, disk reading data by the mechanical movement, the time spent on each read data can be divided into the seek time, rotation delay, transmission time three parts, seek time refers to the magnetic arm moved to the specified track time, The main disk is generally below 5ms; rotation delay is what we often hear of disk speed, such as a disk 7200 rpm, that can be rotated 7,200 times per minute, that is, 1 seconds can go 120 times, rotation delay is 1/120/2 = 4.17ms Transmission time refers to the time that reads from disk or writes data to disk, typically in fraction milliseconds, and is negligible relative to the first two times. Then the time to access a disk, that is, a disk IO time is approximately equal to 5+4.17 = 9ms, sounds pretty good, but to know that a 500-mips machine can execute 500 million instructions per second, because the instruction depends on the nature of the electricity, In other words, the time to execute an IO can execute 400,000 instructions, the database with 1.001 billion or even tens data, each time 9 milliseconds, it is obviously a disaster. Is the computer hardware delay comparison chart, for your reference:
Considering that disk IO is a very expensive operation, the computer operating system does some optimization, when an IO, not only the current disk address data, but also the adjacent data are read into the memory buffer, because the local pre-reading principle tells us that when the computer access to the data of an address, The data adjacent to it will also be accessed quickly. Each IO reads the data we call a page. The specific page of how big the data is related to the operating system, generally 4k or 8k, that is, when we read the data in a page, actually occurred once io, this theory is very helpful for the data structure design of the index.
data structure of the index
In front of the example of Life Index, the basic principle of the index, the complexity of the database, and the relevant knowledge of the operating system, the purpose is to let everyone understand that any kind of data structure is not produced in a vacuum, there will be its background and use of the scene, we now summarize, we need this data structure can do something, In fact, it is very simple, that is: each time you look for data to control the number of disk IO in a very small order of magnitude, preferably a constant order of magnitude. Then we think if a highly controllable multi-path search tree can meet the needs? In this way, the B + Tree was born.
detailed B + Tree
For example, is a B + tree, the definition of B + tree can be seen in the B + tree, here is only a few points, the light blue block we call a disk block, you can see each disk block contains several data items (dark blue) and pointers (shown in yellow), such as disk Block 1 contains data items 17 and 35, including pointers P1, P3,P1 represents a disk block that is less than 17, P2 represents a disk block between 17 and 35, and P3 represents a disk block greater than 35. Real data exists at leaf nodes 3, 5, 9, 10, 13, 15, 28, 29, 36, 60, 75, 79, 90, 99. Non-leaf nodes do not store real data, only data items that guide the direction of the search, such as 17 and 35, do not exist in the data table.
The discovery process of B + trees
, if you want to find the data item 29, then the disk Block 1 is loaded into memory by disk, at this time Io, in memory with a binary lookup to determine 29 between 17 and 35, locking disk Block 1 P2 pointer, memory time because of very short (compared to the disk IO) can be negligible, Disk Block 1 through disk address of the P2 pointer to the disk block 3 is loaded into memory, the second io,29 between 26 and 30, locking disk block 3 of the P2 pointer, loading disk blocks 8 through the pointer to memory, a third Io, while in-memory binary find found 29, the end of the query, a total of three IO. The real situation is, the 3-tier B + tree can represent millions of data, if millions of data to find only three Io, the performance will be huge, if there is no index, each data item will occur once IO, then a total of millions of Io, it is obviously very expensive.
B + Tree Nature
1. Through the above analysis, we know that the number of IO depends on the height of B + H, assuming that the current data table data is N, the number of data items per disk block is M, then there is H=㏒ (m+1) n, when the amount of data n a certain case, m larger, h smaller, and m = size of disk block/data item The size of the disk block is also a data page size, is fixed, if the data items occupy less space, the more data items, the lower the height of the tree. This is why each data item, the index field, is as small as possible, such as an int accounting for 4 bytes, which is less than half the bigint8 byte. This is why the B + tree requires the real data to be placed on the leaf node instead of the inner node, and once placed in the inner node, the data items of the disk block will be greatly reduced, resulting in a higher tree. When the data item equals 1 o'clock, it will degenerate into a linear table.
2. When the data item of the B + tree is a composite data structure, such as (Name,age,sex), the B + number is based on the left-to-right order to establish the search tree, such as when the data (Zhang San, 20,f) is retrieved, the B + tree will first compare the name to determine the direction of the next search, If name is the same, then compare age and sex, and finally get the retrieved data, but when the (20,F) does not have the name of the data, B + tree does not know which node to check next, because the search tree when the name is the first comparison factor, You must search by name first to know where to go next. For example, when (Zhang San, F) such data to retrieve, B + tree can use name to specify the direction of the search, but the next field of age is missing, so only the name equal to Zhang San data are found, and then match the gender is the data of F, this is very important property, that is, the index of the leftmost matching characteristics.
Slow Query optimization
About MySQL indexing principle is a relatively boring thing, we just need to have a perceptual understanding, do not need to understand very thoroughly and deeply. We look back at the beginning of the slow query we said, after understanding the index principle, do you have any ideas? Let's summarize some of the basic principles of indexing
several principles of index building
1. The leftmost prefix matching principle, very important principle, MySQL will always match right until it encounters a range query (>, <, between, like) to stop the match, such as a = 1 and B = 2 and C > 3 and D = 4 if established (a,b,c,d The index of the order, D is not indexed, if the establishment (A,B,D,C) of the index can be used, the order of a,b,d can be arbitrarily adjusted.
2.= and in can be disorderly, such as a = 1 and B = 2 and c = 3 build (a,b,c) index can be arbitrary order, the MySQL query optimizer will help you to optimize the form of the index can be recognized
3. Try to choose a high-differentiated column as the index, the formula for the degree of sensitivity is count (distinct col)/count (*), indicating that the field does not repeat the scale, the greater the proportion of the number of records we scan, the difference between the unique key is 1, and some states, The gender field may be 0 in front of big data, and one might ask, what is the empirical value of this ratio? Using different scenarios, this value is also difficult to determine, generally need to join the field we are required to be more than 0.1, that is, the average 1 scan 10 records
4. The index column cannot participate in the calculation, keep the column "clean", such as from_unixtime (create_time) = ' 2014-05-29 ' can not be used to the index, the reason is simple, B + tree is stored in the Data table field values, but when the retrieval, You need to apply all the elements to the function to compare, obviously the cost is too large. So the statement should be written create_time = Unix_timestamp (' 2014-05-29 ');
5. Expand the index as much as possible and do not create a new index. For example, the table already has an index of a, now to add (A, b) of the index, then only need to modify the original index
back to the beginning of the slow query
Based on the leftmost matching principle, the index of the first SQL statement should be a federated index of status, operator_id, type, operate_time, where the order of status, operator_id, and type can be reversed, so I would say, All the related queries of this table are found and will be analyzed comprehensively;
For example, there are the following queries
SELECT * from task where status = 0 and type = 10 limit;
Select COUNT (*) from task where status = 0;
Then the index is established (status,type,operator_id,operate_time) is very correct, because it can be covered in all cases. This is the principle of using the leftmost match of the index.
Query optimization Artifact –explain command
About explain command believe everyone is not unfamiliar, specific usage and field meaning can refer to official website Explain-output, here need to emphasize rows is the core indicator, most of the rows small statement execution must be very fast (with exception, as described below). So the optimization statements are basically optimizing rows.
basic steps for slow query optimization
0. Run first to see if it is really slow, note the setting Sql_no_cache
1.where condition single Check, lock minimum return record table. This sentence means to apply the where of the query to the table the smallest number of records returned in the table began to look up, single table each field query, to see which field is the highest degree of distinction
2.explain View execution plan, consistent with 1 expected (start query from a table with fewer locked records)
3.order by limit SQL statement allows sorted tables to be prioritized
4. Understanding Business Party usage Scenarios
5. Index reference several principles of index construction
6. Observation results, non-conforming to the expected continuation from 0 analysis
Several slow query cases
The following examples explain in detail how to analyze and refine slow queries
Writing complex sentences
In many cases, we write SQL just to implement the function, this is only the first step, different statement writing methods for efficiency often have an essential difference, which requires us to the MySQL implementation plan and indexing principles have a very clear understanding, see the following statement
Select
Distinct cert.emp_id
From
Cm_log CL
INNER JOIN
(
Select
Emp.id as emp_id,
Emp_cert.id as cert_id
From
Employee EMP
Left Join
Emp_certificate Emp_cert
On emp.id = emp_cert.emp_id
where
Emp.is_deleted=0
) Cert
On (
Cl.ref_table= ' Employee '
and cl.ref_oid= cert.emp_id
)
or (
Cl.ref_table= ' Empcertificate '
and cl.ref_oid= cert.cert_id
)
where
Cl.last_upd_date >= ' 2013-11-07 15:03:00 '
and cl.last_upd_date<= ' 2013-11-08 16:00:00 '
0. Run it first, 53 records for 1.87 seconds, and no aggregate statement, relatively slow
53
rows
in
set
(
1.87
sec)
1.explain
Briefly describe the execution plan, first MySQL obtains 379 records according to the Idx_last_upd_date Index Scan Cm_log table, then scans the table to scan 63,727 records, divides into two parts, derived represents constructs the table, namely does not exist the table, It can be simply understood as a result set formed by a statement, followed by a number representing the statement's ID. DERIVED2 represents a query that has ID = 2 constructs a virtual table and returns 63,727 records. Let's take a look at the statement of id = 2, what exactly did it say? Return so much data, first scan the employee table 13,317 records in full table, and then according to Index Emp_certificate_empid Association emp_certificate table, rows = 1, Each association locks only one record, which is more efficient. Once obtained, the 379 records are then associated with cm_log according to the rules. From the execution process can be seen to return too much data, the data returned most cm_log are not used, because Cm_log only locked 379 records.
How to optimize it? Can we see that after we run or do join with Cm_log, could we join Cm_log before? Careful analysis of the statement is not difficult to find, the basic idea is if the Cm_log ref_table is empcertificate on the Association emp_certificate table, if Ref_table is employee on the associated employee table, We can completely split it into two parts and join it with union, notice that the Union is used instead of union all because the original statement has "distinct" to get a unique record, and the union happens to have this function. If there is no distinct in the original statement, we can use union all directly, because using the Union requires a heavy-weight action that can affect SQL performance.
The optimized statements are as follows
Select
Emp.id
From
Cm_log CL
INNER JOIN
Employee EMP
On cl.ref_table = ' Employee '
and cl.ref_oid = Emp.id
where
Cl.last_upd_date >= ' 2013-11-07 15:03:00 '
and cl.last_upd_date<= ' 2013-11-08 16:00:00 '
and emp.is_deleted = 0
Union
Select
Emp.id
From
Cm_log CL
INNER JOIN
Emp_certificate EC
On cl.ref_table = ' empcertificate '
and cl.ref_oid = Ec.id
INNER JOIN
Employee EMP
On emp.id = ec.emp_id
where
Cl.last_upd_date >= ' 2013-11-07 15:03:00 '
and cl.last_upd_date<= ' 2013-11-08 16:00:00 '
and emp.is_deleted = 0
4. Do not need to understand the business scenario, only need to transform the statement and the transformation before the statement to maintain consistent results
5. Existing index can be satisfied, do not need to build index
6. Experiment with the modified statement, only need to reduce the 10ms by nearly 200 times times!
Clear Application Scenarios
The purpose of this example is to subvert our perception of the distinction between columns, and generally we think that the more differentiated the columns, the easier it is to lock down fewer records, but in some special cases, the theory is limited.
Select
*
From
Stage_poi SP
where
Sp.accurate_result=1
and (
Sp.sync_status=0
or sp.sync_status=2
or sp.sync_status=4
);
0. First look at how long to run, 951 data 6.22 seconds, really slow
951
rows
in
set
(
6.22
sec)
1. First explain,rows reached 3.61 million, type = All indicates a full table scan
2. All fields apply the query returns the number of records, because it is a single-table query 0 has done 951
3. Keep Explain's rows as close as possible to 951
Take a look at the number of records Accurate_result = 1
We see accurate_result This field is very low, the entire table only -1,0,1 three values, plus the index can not lock a very small amount of data
Look at the case of the Sync_status field again
The same degree of distinction is also very low, according to the theory, also not suitable for indexing
The problem analysis to this, it seems that the table can not be optimized to the conclusion that the two columns are very low, even if the index can only adapt to this situation, it is difficult to do universal optimization, such as when the Sync_status 0, 3 distribution of the very average, then locked records are millions other
4. Find the business side to communicate and see the usage scenarios. The business side is so to use this SQL statement, every five minutes will scan the matching data, processing completed after the Sync_status this field into 1, five minutes to meet the criteria of the record number is not too much, about 1000. Once you understand the usage scenarios for business parties, it's easier to optimize this SQL because the business party guarantees data imbalance, and if you add an index, you can filter out most of the data you don't need.
5. Build the index using the following statement according to the indexing rule
ALTER TABLE STAGE_POI Add index Idx_acc_status (accurate_result,sync_status);
6. Observe the expected results and find that only 200ms is needed, more than 30 times faster.
952
rows
in
set
(
0.20
sec)
Let's review the process of analyzing the problem, the single-table query is relatively good optimization, most of the time only need to put the field in the where conditions in accordance with the rule index, if only this "no brain" optimization, obviously some very low-sensitivity columns, should not be indexed columns are also indexed, This can have a serious impact on insert, update performance, and may affect other query statements. So our 4th pace of SQL usage is critical, we only know this business scenario, to better assist us to better analyze and optimize query statements.
statements that cannot be optimized
Select
C.id,
C.name,
C.position,
C.sex,
C.phone,
C.office_phone,
C.feature_info,
C.birthday,
C.CREATOR_ID,
C.is_keyperson,
C.giveup_reason,
C.status,
C.data_source,
From_unixtime (C.created_time) as Created_time,
From_unixtime (c.last_modified) as last_modified,
c.last_modified_user_id
From
Contact C
INNER JOIN
Contact_branch CB
On c.id = cb.contact_id
INNER JOIN
Branch_user BU
On cb.branch_id = bu.branch_id
and Bu.status in (
1,
2)
INNER JOIN
Org_emp_info Oei
On oei.data_id = bu.user_id
and Oei.node_left >= 2875
and Oei.node_right <= 10802
and oei.org_category =-1
ORDER BY
C.created_time desc LIMIT 0,
10;
Still a few steps.
0. See how long the statement runs, 10 records for 13 seconds, have been intolerable
10
rows
in
set
(
13.06
sec)
1.explain
From the execution plan, MySQL first check the org_emp_info table Scan 8849 records, and then use the Index Idx_userid_status Association branch_user table, and then use the Index IDX_BRANCH_ID Association contact_branch table, The last primary key is associated with the Contact table.
Rows return very little and do not see any unusual conditions. We are looking at the statement, we found that there is an order by + limit combination, is it too big for the sort amount? So we simplified the SQL, removed the subsequent order by and limit, and looked at how many records were used to sort the
Select
COUNT (*)
From
Contact C
INNER JOIN
Contact_branch CB
On c.id = cb.contact_id
INNER JOIN
Branch_user BU
On cb.branch_id = bu.branch_id
and Bu.status in (
1,
2)
INNER JOIN
Org_emp_info Oei
On oei.data_id = bu.user_id
and Oei.node_left >= 2875
and Oei.node_right <= 10802
and oei.org_category =-1
+----------+
| COUNT (*) |
+----------+
| 778878 |
+----------+
1 row in Set (5.19 sec)
Find sort before actually locked 778,878 records, if for 700,000 of the result set sort, will be disastrous, no wonder so slow, then we can change a way of thinking, first according to contact Created_time Sort, and then join will be relatively fast?
The following statements can also be modified with straight_join to optimize
Select
C.id,
C.name,
C.position,
C.sex,
C.phone,
C.office_phone,
C.feature_info,
C.birthday,
C.CREATOR_ID,
C.is_keyperson,
C.giveup_reason,
C.status,
C.data_source,
From_unixtime (C.created_time) as Created_time,
From_unixtime (c.last_modified) as last_modified,
c.last_modified_user_id
From
Contact C
where
Exists (
Select
1
From
Contact_branch CB
INNER JOIN
Branch_user BU
On cb.branch_id = bu.branch_id
and Bu.status in (
1,
2)
INNER JOIN
Org_emp_info Oei
On oei.data_id = bu.user_id
and Oei.node_left >= 2875
and Oei.node_right <= 10802
and oei.org_category =-1
where
C.id = cb.contact_id
)
ORDER BY
C.created_time desc LIMIT 0,
10;
Verify that the effect is expected to increase by 13,000 times within 1ms!
10
rows
in
set
(
0.00
sec)
I thought this was gaocheng, but we missed a detail in the previous analysis, first sort, then join and then join and then sort the theory cost is the same, why raise so much because there is a limit! The approximate execution process is: MySQL first sorted by index to get the first 10 records, and then go to join filter, when found not enough 10, again to 10, again join, this obviously in the inner layer join filter data very much time, will be disastrous, extreme situation, the inner layer of a data can not be found, MySQL also silly to fetch 10, almost traverse the data table!
With different parameters of the SQL experiment
Select
Sql_no_cache C.id,
C.name,
C.position,
C.sex,
C.phone,
C.office_phone,
C.feature_info,
C.birthday,
C.CREATOR_ID,
C.is_keyperson,
C.giveup_reason,
C.status,
C.data_source,
From_unixtime (C.created_time) as Created_time,
From_unixtime (c.last_modified) as last_modified,
c.last_modified_user_id
From
Contact C
where
Exists (
Select
1
From
Contact_branch CB
INNER JOIN
Branch_user BU
On cb.branch_id = bu.branch_id
and Bu.status in (
1,
2)
INNER JOIN
Org_emp_info Oei
On oei.data_id = bu.user_id
and Oei.node_left >= 2875
and Oei.node_right <= 2875
and oei.org_category =-1
where
C.id = cb.contact_id
)
ORDER BY
C.created_time desc LIMIT 0,
10;
Empty Set (2 min 18.99 sec)
2 min 18.99 sec! It's a lot worse than it was before. Due to the nested loop mechanism of MySQL, it is basically impossible to optimize this situation. This statement can only be passed on to the application system to optimize its logic.
In this example, we can see that not all statements can be optimized, and often we optimize, because the SQL use case regression when the fall off some extreme situation, will cause more serious consequences than the original. So, first: don't expect all statements to be optimized with SQL, and second: Don't be overly confident, just optimize for specific cases, ignoring more complex situations.
The case of slow query is analyzed here, these are just some typical cases. We have encountered more than 1000 lines in the process of optimization, involving 16 table join "Garbage SQL", also encountered the line offline database differences caused the application to be slow query directly dragged dead, also encountered the varchar equivalent comparison did not write single quotation marks, also encountered Cartesian product query directly from the library to kill. A lot of cases are actually just some accumulation of experience, if we are familiar with the query optimizer, index of the internal principles, then the analysis of these cases becomes particularly simple.
write it in the back.
In this paper, a slow query case is introduced to the principles of MySQL indexing, the optimization of some methods of slow query, and the typical cases encountered in the detailed analysis. In fact, after such a long time of the sentence optimization found that any database level optimization is not up to the optimization of the application system, the same is MySQL, can be used to support the Google/facebook/taobao application, but may not even your personal site can't hold. Apply the recent popular words: "Query easy, optimization is not easy, and write and cherish!" ”
Dry foods: MySQL Indexing principle and slow query optimization