MySQL index principle overview and slow query optimization _ MySQL

Source: Internet
Author: User
Tags mysql index
MySQL index principle overview and slow query optimization practices MySQL has become the preferred relational database for most Internet companies thanks to its outstanding performance, low cost, and rich resources. Although the performance is outstanding, the so-called "good horse with good saddle", how to better use it, has become a required course for development engineers, we often see requirements from job descriptions such as "proficient in MySQL", "SQL statement optimization", and "understanding Database Principles. We know that the general application system has a read/write ratio of around, and insert operations and general update operations rarely encounter performance problems. The most common problems are also the most likely problems, as there are still some complex query operations, the optimization of query statements is obviously the top priority.

I have been optimizing slow queries in the core business System Department of Meituan since March 13, July. over 10 systems have been used to solve and accumulate hundreds of slow query cases. With the increase in business complexity, you may encounter a variety of strange and incredible problems. This article aims to explain the principles of database indexing and how to optimize slow queries from the perspective of developers.

1. thinking about a slow query
selectcount(*) fromtask wherestatus=2 and operator_id=20839 and operate_time>1371169729 and operate_time<1371174603 and type=2;

The system user responded that a feature was getting slower and slower, so the engineer found the above SQL.

I was excited and found out, "This SQL needs to be optimized. I need to add an index to each field"

I was surprised to ask, "Why do I need to add an index to each field ?"

"Adding all the queried fields to the index will make the query faster." The engineer is full of confidence.

"In this case, you can create a joint index. because it is the leftmost prefix match, operate_time needs to be placed at the end, and other related queries need to be used. a comprehensive evaluation is required."

"Joint Index? What is the leftmost prefix? Comprehensive Evaluation ?" Engineers can't help thinking.

In most cases, we know that indexes can improve query efficiency, but how should we create indexes? What is the order of indexes? Many people only know about it. In fact, it is not difficult to understand these concepts, and the indexing principle is far less complex than imagined.

2. MySQL index principle 2.1 index purpose

The purpose of indexing is to improve the query efficiency. it can be analogous to a dictionary. if you want to query the word "mysql", you must locate the m letter and find the y letter from the bottom down, find the remaining SQL. If there is no index, you may need to read all the words to find what you want. what if I want to find the words starting with m? Or a word starting with ze? Do you think this cannot be done without an index?

2.2 Indexing principles

In addition to dictionaries, examples of indexes that can be seen everywhere in daily life, such as train stations, trains, and directories of books. They work in the same way. by constantly narrowing down the scope of data to be obtained, we can filter the final results and turn random events into ordered events, that is, we always lock data through the same search method.

The same is true for databases, but it is much more complex because not only equi-type queries, but also range queries (>, <, between, in) and fuzzy queries (like) and Union query (or. What methods should the database choose to deal with all the problems? Let's look back at the dictionary example. can we divide the data into segments and then perform segmented queries? In the simplest case, 1 to 1000 are divided into the first segment, 100 to 101 are divided into the second segment, and 200 to 201 are divided into the third segment ...... in this way, you only need to find the third row of 250th data records. all of a sudden, you can delete 90% invalid data records. But what if it is a 10 million record? what is better to divide it? The average complexity of the search tree is lgN, which has good query performance. However, we ignore a key issue here. the complexity model is based on the same operation cost each time. the database implementation is complicated and the data is stored on the disk. to improve performance, you can read part of the data into the memory each time for computing. because we know that the disk access cost is about 100,000 times the memory access cost, it is difficult for a simple search tree to meet complex application scenarios.

2.2.1 disk I/O and pre-read

As mentioned above, access to a disk is a brief introduction of disk I/O and pre-reading. The data read from a disk depends on mechanical motion, the time spent reading data can be divided into three parts: track time, rotation delay, and transmission time. track time refers to the time required to move the magnetic arm to the specified track, mainstream disks are generally less than 5 ms. the rotation delay is the disk speed we often hear about. for example, if a disk is 7200 rpm, it means it can be converted 7200 times per minute, that is, it can be converted 120 times per second, the rotation delay is 1/120/2 = 4.17 ms. The transmission time refers to the time when data is read from the disk or written to the disk. generally, it is a few milliseconds at, which is negligible relative to the first two times. The time for one disk access, that is, the time for one disk IO is about 5 + 4.17 = 9 ms. it sounds pretty good, however, you need to know that a 500-MIPS machine can execute 0.5 billion commands per second, because the commands rely on the nature of electricity. In other words, the execution time of one IO can execute 0.4 million commands, databases tend to have 100,000 million or even tens of millions of data records. each time it takes 9 MS, it is clearly a disaster. It is a comparison of computer hardware latency for your reference:

Considering that disk I/O is a very high operation, the computer operating system has made some optimizations. when I/O is performed, not only the data of the current disk address, instead, the adjacent data is read to the memory buffer, because the local preread principle tells us that when the computer accesses the data of an address, the adjacent data will soon be accessed. Each data read by IO is called a page ). The specific size of data on a page is related to the operating system, which is generally 4 k or 8 k. that is, when we read data on a page, I/O actually occurs, this theory is very helpful for designing the index data structure.

2.2.2 index data structure

I have discussed the indexing examples in my life, the basic principles of indexing, the complexity of databases, and the knowledge about operating systems. The purpose is to let everyone know, any data structure is not created out of thin air. it must have its background and application scenarios. Let's summarize what we need to do with this data structure. it is actually very simple: the number of disk I/O operations is controlled to a very small order of magnitude every time data is searched, preferably the constant order of magnitude. Then we thought about whether a highly controllable multi-path search tree can meet our needs? In this way, the B + tree came into being.

2.2.2.1 details B + tree

To understand tree B, you must start with the Binary search tree.

The binary search tree is a highly efficient data structure with three features.

(1) each node has a maximum of two subtrees.

(2) the left subtree is smaller than the value of the parent node, and the right subtree is greater than the value of the parent node.

(3) find the target value in n nodes. Generally, only log (n) comparisons are required.

The binary search tree structure is not suitable for databases because its search efficiency is related to the number of layers. The more data is in the lower layer, the more times the comparison is required. In extreme cases, it takes n times for n pieces of data to be compared to find the target value. For a database, data is read from the hard disk every time it enters a layer. this is fatal because the hard disk read time is much longer than the data processing time, and the less the number of times the database reads the hard disk, the better.

Tree B is an improvement on the binary search tree. Its design philosophy is to integrate the relevant data as much as possible to read multiple data at a time, reducing the number of hard disk operations.

For example, it is a B + tree. for the definition of B + tree, see B + tree. here we only talk about some key points. The light blue block is called a disk block, each disk block contains several data items (in dark blue) and pointers (in yellow). For example, disk block 1 contains data items 17 and 35, including pointers P1, P2, and P3, p1 indicates a disk block smaller than 17, P2 indicates a disk block between 17 and 35, and P3 indicates a disk block larger than 35. Real data exists in leaf nodes: 3, 5, 9, 10, 13, 15, 28, 29, 36, 60, 75, 79, 90, and 99. Non-leaf nodes only store real data, but only data items directing the search direction. for example, data items 17 and 35 do not actually exist in the data table.

Tree B has three features.

(1) a node can accommodate multiple values. For example, two values are supported.

(2) no new layer will be added unless the data is filled up. That is to say, the less the B-tree pursues, the better.

(3) the value in the child node has a strict size correspondence with the value in the parent node. Generally, if the parent node has a value, there will be a + 1 subnode. For example, if a parent node has two values (17 and 35), it corresponds to three child nodes. the first child node is smaller than 17, and the last child node is greater than 35, the subnode in the middle is the value between 17 and 35.

2.2.2.2 search process of B + tree

If you want to query data item 29, Disk Block 1 is first loaded from the disk to the memory. at this time, IO occurs. in the memory, use binary search to determine that 29 is between 17 and 35, the P2 pointer of disk block 1 is locked. the memory time is negligible because it is very short (compared to the disk IO, the address of the P2 pointer of disk block 1 is used to load disk block 3 from disk to memory. when the second IO occurs, the P2 pointer of disk block 3 is locked between 26 and 30, the pointer is used to load the disk block 8 to the memory, and the third IO occurs. at the same time, the binary search is performed in the memory to locate 29 and end the query, with a total of three I/O operations. The real situation is that the layer-3 B + tree can represent millions of data (if it is changed to a binary search tree, it requires 20 layers !), If the query of millions of data requires only three IO operations, the performance improvement will be huge. If no index occurs and each data item requires one IO operation, a total of millions of IO operations are required, obviously, the cost is very high.

2.2.2.3 B + tree nature

(1) through the above analysis, we know that the number of I/O operations depends on the height h of the number of B +, assuming that the data in the current data table is N, and the number of data items in each disk block is m, h = partition (m + 1) N. when the data volume is N, the larger the value of m, the smaller the value of h. m = the size of the disk block/data item, the size of a disk block, that is, the size of a data page, is fixed. if the data items occupy less space, the more data items, and the lower the tree height. This is why each data item, that is, the index field should be as small as possible. for example, int occupies 4 bytes, which is half less than bigint8 bytes. This is also why the B + tree requires that the actual data be placed on the leaf node instead of the inner node. once placed on the inner node, the data items in the disk block will be greatly reduced, leading to an increase in the tree. When the data item is equal to 1, it is degraded into a linear table.

(2) when data items in the B + tree are composite data structures, such as (name, age, sex, B + numbers are used to create a search tree from left to right. for example, when data such as (Zhang San, 20, F) is retrieved, the B + tree compares the name to determine the direction of the next search. if the name is the same, compare age and sex in sequence to obtain the retrieved data. However, when (20, F) in this case, the B + tree does not know which node to query when no name data is available, because the name is the first comparison factor when the search tree is created, you must search by name before you can know where to query in the next step. For example, when data such as (Michael, F) is retrieved, the B + tree can use name to specify the search direction, but the next field age is missing, therefore, we can only find all the data whose name is equal to Michael Jacob and then match the data whose name is F. This is a very important attribute, that is, the leftmost matching feature of the index.

3. slow query optimization

The principle of MySQL indexing is boring. you only need to have a perceptual understanding, and you do not need to understand it thoroughly or deeply. Let's look back at the slow query we mentioned at the beginning. after learning about the indexing principle, do you have any ideas? Summarize the basic principles of indexes.

3.1 Indexing principles 3.1.1 leftmost prefix matching principles

It is very important that mysql always matches the right until a range query (>, <, between, and like) stops matching, for example, if a = 1 and B = 2 and c> 3 and d = 4 is created for an index in the order of (a, B, c, d), d cannot use the index, if an index (a, B, d, c) is created, the order of a, B, and d can be adjusted as needed.

3.1.2 = and in can be out of order

For example, a = 1 and B = 2 and c = 3 (a, B, c) indexes can be created in any order. mysql's query optimizer will help you optimize the indexes into identifiable forms.

3.1.3 use a column with a high degree of discrimination as an index

The differentiation formula is count (distinct col)/count (*), indicating the proportion of fields that are not repeated. the larger the proportion, the fewer records we scan. the differentiation of the unique key is 1, some status and gender fields may be 0 in front of big data, so some may ask, is there any experience in this proportion? Different use cases make it hard to determine this value. generally, we require more than 0.1 join fields, that is, to scan 10 records on average.

3.1.4 index columns cannot participate in calculation

Keep the column "clean". for example, if from_unixtime (create_time) = '2017-05-29 ', the index cannot be used. The reason is very simple. in the B + tree, fields in the data table are stored, however, when performing a search, you need to apply all the elements to the function for comparison. Obviously, the cost is too high. Therefore, the statement should be written as create_time = unix_timestamp ('2017-05-29 ');

3.1.5 extend the index as much as possible. do not create a new index.

For example, if the table already has an index of a and now you want to add an index of (a, B), you only need to modify the original index.

3.2 Return to the start slow query

According to the leftmost matching principle, the index of the first SQL statement should be the combined index of status, operator_id, type, and operate_time. The order of status, operator_id, and type can be reversed, so I will say that all the related queries of this table will be found and will be comprehensively analyzed;

For example, the following query

select * from task where status = 0 and type = 12 limit 10;select count(*) from task where status = 0 ;

Therefore, it is very correct to create an index (status, type, operator_id, operate_time) because it can cover all situations. This uses the leftmost matching principle of the index.

3.3 Query optimization artifact-explain command

I believe you are familiar with the explain command. for details about the usage and field meanings, refer to the explain-output command on the official website. here we need to emphasize that rows are the core indicator, and most of the small rows statements must be executed quickly (with exceptions, ). Therefore, optimization statements are basically optimizing rows.

3.4 basic steps for slow query optimization

(1) Run it first to see if it is really slow. pay attention to setting SQL _NO_CACHE

(2) query the where condition in a single table and lock the minimum returned record table. This statement applies the where statement of the query statement to the table with the minimum number of records returned from the table and starts to check. each field in a single table is queried separately to see which field has the highest degree of discrimination.

(3) explain to check whether the execution plan is as expected (query starting from the table with fewer locked Records)

(4) order by limit SQL statements give priority to SORTED tables

(5) understand business application scenarios

(6) index creation principles

(7) the observed results are not as expected and continue to be analyzed from 0

4. several slow query cases

The following examples illustrate how to analyze and optimize slow queries.

4.1 complex statement writing

In many cases, we write SQL only to implement functions. this is only the first step. different statement writing methods often have essential differences in efficiency, this requires us to have a clear understanding of the mysql execution plan and indexing principles. please refer to the following statements

selectdistinct cert.emp_id fromcm_log cl inner join(selectemp.id as emp_id,emp_cert.id as cert_id fromemployee emp left joinemp_certificate emp_cert on emp.id = emp_cert.emp_id whereemp.is_deleted=0) cert on (cl.ref_table='Employee' and cl.ref_oid= cert.emp_id) or (cl.ref_table='EmpCertificate' and cl.ref_oid= cert.cert_id) wherecl.last_upd_date >='2013-11-07 15:03:00' and cl.last_upd_date<='2013-11-08 16:00:00';

(1) run the command first. 53 records are 1.87 seconds old and no aggregate statements are used, which is slow.

53 rows in set (1.87 sec)

(2) explain

+----+-------------+------------+-------+---------------------------------+-----------------------+---------+-------------------+-------+--------------------------------+| id | select_type | table| type| possible_keys | key | key_len | ref | rows| Extra|+----+-------------+------------+-------+---------------------------------+-----------------------+---------+-------------------+-------+--------------------------------+|1 | PRIMARY | cl | range | cm_log_cls_id,idx_last_upd_date | idx_last_upd_date | 8 | NULL| 379 | Using where; Using temporary ||1 | PRIMARY | 
 
   | ALL | NULL| NULL| NULL| NULL| 63727 | Using where; Using join buffer ||2 | DERIVED | emp| ALL | NULL| NULL| NULL| NULL| 13317 | Using where||2 | DERIVED | emp_cert | ref | emp_certificate_empid | emp_certificate_empid | 4 | meituanorg.emp.id | 1 | Using index|+----+-------------+------------+-------+---------------------------------+-----------------------+---------+-------------------+-------+--------------------------------+
 

Briefly describe the execution plan. first, mysql scans the cm_log table to obtain 379 records based on the idx_last_upd_date index. then, the table looks up and scans 63727 records, which are divided into two parts. derived indicates the construction of the table, that is, a table that does not exist can be simply understood as a result set formed by a statement. the following number indicates the statement ID. Derived2 indicates that the query with ID = 2 constructs a virtual table and returns 63727 records. Let's take a look at what the statement ID = 2 has written and returned such a large amount of data. First, the full table scans 13317 records in the employee table, and then associates the emp_certificate_empid with the emp_certificate table according to the index, rows = 1 indicates that only one record is locked for each Association, which is more efficient. After obtaining the information, associate the information with the 379 records of cm_log according to the rules. From the execution process, we can see that too much data is returned, and most of the returned data cm_log is not used, because cm_log only locks 379 records.

How to optimize it? We can see that we still need to join cm_log after running. can we join cm_log before? It is not difficult to analyze the statement carefully. The basic idea is that if the ref_table of cm_log is EmpCertificate, it is associated with the emp_certificate table. if the ref_table is Employee, it can be split into two parts, join with union. Note that union is used here instead of union all because the original statement has "distinct" to obtain the unique record, and union exactly has this function. If the original statement does not have distinct and does not need to be duplicated, we can directly use union all, because using union requires deduplication, which will affect SQL performance.

The optimized statements are as follows:

selectemp.id fromcm_log cl inner joinemployee emp on cl.ref_table = 'Employee' and cl.ref_oid = emp.idwherecl.last_upd_date >='2013-11-07 15:03:00' and cl.last_upd_date<='2013-11-08 16:00:00' and emp.is_deleted = 0unionselectemp.id fromcm_log cl inner joinemp_certificate ec on cl.ref_table = 'EmpCertificate' and cl.ref_oid = ec.idinner joinemployee emp on emp.id = ec.emp_idwherecl.last_upd_date >='2013-11-07 15:03:00' and cl.last_upd_date<='2013-11-08 16:00:00' and emp.is_deleted = 0

(3) You do not need to understand the business scenario. you only need to modify the statement to keep the result consistent with the statement before the transformation.

(4) existing indexes can be satisfied without the need to create indexes

(5) experiment with the modified statement. it only takes 10 ms to reduce it by nearly 200 times!

+----+--------------+------------+--------+---------------------------------+-------------------+---------+-----------------------+------+-------------+| id | select_type| table| type | possible_keys | key | key_len | ref | rows | Extra |+----+--------------+------------+--------+---------------------------------+-------------------+---------+-----------------------+------+-------------+|1 | PRIMARY| cl | range| cm_log_cls_id,idx_last_upd_date | idx_last_upd_date | 8 | NULL|379 | Using where ||1 | PRIMARY| emp| eq_ref | PRIMARY | PRIMARY | 4 | meituanorg.cl.ref_oid |1 | Using where ||2 | UNION| cl | range| cm_log_cls_id,idx_last_upd_date | idx_last_upd_date | 8 | NULL|379 | Using where ||2 | UNION| ec | eq_ref | PRIMARY,emp_certificate_empid | PRIMARY | 4 | meituanorg.cl.ref_oid |1 | ||2 | UNION| emp| eq_ref | PRIMARY | PRIMARY | 4 | meituanorg.ec.emp_id|1 | Using where || NULL | UNION RESULT | 
 
   | ALL| NULL| NULL| NULL| NULL| NULL | |+----+--------------+------------+--------+---------------------------------+-------------------+---------+-----------------------+------+-------------+53 rows in set (0.01 sec)
 
4.2 define application scenarios

The purpose of this example is to subvert our cognition of column discrimination. In general, we think that columns with a higher degree of discrimination are more likely to lock fewer records, but in some special circumstances, this theory has limitations.

select* fromstage_poi sp wheresp.accurate_result=1 and (sp.sync_status=0 or sp.sync_status=2 or sp.sync_status=4);

(1) first, let's see how long it takes to run. 951 pieces of data are 6.22 seconds, which is really slow.

951 rows in set (6.22 sec)

(2) explain first, and rows reaches 3.61 million. type = ALL indicates full table scan.

+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+| id | select_type | table | type | possible_keys | key| key_len | ref| rows| Extra |+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+|1 | SIMPLE| sp| ALL| NULL| NULL | NULL| NULL | 3613155 | Using where |+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+

(3) the number of records returned by all fields is applied for query, because 951 records have been returned for a single table query.

(4) Let the explain rows approach 951 as much as possible. let's take a look at the number of records with accurate_result = 1.

select count(*),accurate_result from stage_poigroup by accurate_result;+----------+-----------------+| count(*) | accurate_result |+----------+-----------------+| 1023 |-1 ||2114655 | 0 || 972815 | 1 |+----------+-----------------+

We can see that the discrimination of the accurate_result field is very low. The entire table has only three values:-1, 0, and 1, and the index cannot lock a very small amount of data.

Let's take a look at the sync_status field.

select count(*),sync_status from stage_poigroup by sync_status;+----------+-------------+| count(*) | sync_status |+----------+-------------+| 3080 | 0 ||3085413 | 3 |+----------+-------------+

The same discrimination is also very low. According to theory, it is not suitable for index creation.

After the problem is analyzed, it seems that the table cannot be optimized. the discrimination between the two columns is very low. even if an index is added, it can only adapt to this situation, and it is difficult to make general optimization, for example, when the distribution of sync_status 0 and 3 is very average, the lock records are also millions of records.

(5) contact the business party to see the application scenarios. The business side uses this SQL statement to scan qualified data every five minutes. after processing, the field sync_status is changed to 1, and the number of qualified records in five minutes is not too large, about 1000. After understanding the application scenarios of the business side, it is easy to optimize this SQL statement because the business side ensures data imbalance. if you add an index, you can filter out most unnecessary data.

(6) Use the following statement to create an index based on the index creation rules:

alter table stage_poi add index idx_acc_status(accurate_result,sync_status);

(7) observe the expected results and find that only 200 ms is needed, which is more than 30 times faster.

952 rows in set (0.20 sec)

Let's review the problem analysis process. single-table queries are relatively Optimized. most of the time, you only need to add the fields in the where condition to the index according to the rules, if it is just such a "brainless" optimization, it is clear that some columns with low discrimination should not be indexed and will be indexed, which will seriously affect the performance of insertion and update, other query statements may also be affected. Therefore, it is critical that we use SQL statements with a low pace of 4th. only by knowing this business scenario can we better analyze and optimize query statements.

4.3 statements that cannot be optimized
selectc.id,c.name,c.position,c.sex,c.phone,c.office_phone,c.feature_info,c.birthday,c.creator_id,c.is_keyperson,c.giveup_reason,c.status,c.data_source,from_unixtime(c.created_time) as created_time,from_unixtime(c.last_modified) as last_modified,c.last_modified_user_idfromcontact cinner joincontact_branch cb onc.id = cb.contact_idinner joinbranch_user bu oncb.branch_id = bu.branch_id and bu.status in (1,2)inner joinorg_emp_info oei onoei.data_id = bu.user_id and oei.node_left >= 2875 and oei.node_right <= 10802 and oei.org_category = - 1order byc.created_time desclimit 0 ,10;

Several steps

(1) first, let's see how long the statement runs. it takes 13 seconds for 10 records, which is intolerable.

10 rows in set (13.06 sec)

(2) explain

+----+-------------+-------+--------+-------------------------------------+-------------------------+---------+--------------------------+------+----------------------------------------------+| id | select_type | table | type | possible_keys | key | key_len | ref| rows | Extra|+----+-------------+-------+--------+-------------------------------------+-------------------------+---------+--------------------------+------+----------------------------------------------+|1 | SIMPLE| oei | ref| idx_category_left_right,idx_data_id | idx_category_left_right | 5 | const| 8849 | Using where; Using temporary; Using filesort ||1 | SIMPLE| bu| ref| PRIMARY,idx_userid_status | idx_userid_status | 4 | meituancrm.oei.data_id | 76 | Using where; Using index ||1 | SIMPLE| cb| ref| idx_branch_id,idx_contact_branch_id | idx_branch_id | 4 | meituancrm.bu.branch_id|1 |||1 | SIMPLE| c | eq_ref | PRIMARY | PRIMARY | 108 | meituancrm.cb.contact_id |1 ||+----+-------------+-------+--------+-------------------------------------+-------------------------+---------+--------------------------+------+----------------------------------------------+

From the execution plan perspective, mysql first queries the org_emp_info table to scan 8849 records, then uses the index idx_userid_status to associate the branch_user table, uses the index idx_branch_id to associate the contact_branch table, and finally the primary key to associate the contact table.

The number of rows returned is very small, and no exception is found. Let's take a look at the statement and find the order by + limit combination following it. Will it be a big deal of sorting? So we simplified the SQL statement and removed the order by and limit values to see how many records are used for sorting.

selectcount(*)fromcontact cinner joincontact_branch cb onc.id = cb.contact_idinner joinbranch_user bu oncb.branch_id = bu.branch_id and bu.status in (1,2)inner joinorg_emp_info oei onoei.data_id = bu.user_id and oei.node_left >= 2875 and oei.node_right <= 10802 and oei.org_category = - 1+----------+| count(*) |+----------+|778878 |+----------+1 row in set (5.19 sec)

We found that 778878 records were locked before sorting. if we sorted 0.7 million result sets, it would be disastrous. no wonder it was so slow. can we change our mind to sort the records according to the created_time of the contact first, will join be faster?

Therefore, you can use straight_join to optimize the statement below.

SELECT c.id,c.name,c.position,c.sex,c.phone,c.office_phone,c.feature_info,c.birthday,c.creator_id,c.is_keyperson,c.giveup_reason,c.status,c.data_source,From_unixtime(c.created_time)AS created_time,From_unixtime(c.last_modified) AS last_modified,c.last_modified_user_idFROM contact cWHEREEXISTS (SELECT 1 FROM contact_branch cb INNER JOIN branch_user buON cb.branch_id = bu.branch_id AND bu.status IN ( 1, 2 ) INNER JOIN org_emp_info oeiON oei.data_id = bu.user_id AND oei.node_left >= 2875 AND oei.node_right <= 10802 AND oei.org_category = -1 WHEREc.id = cb.contact_id)ORDERBY c.created_time DESCLIMIT0, 10;

Verify that the effect is expected to increase by more than 13000 times within 1 ms!

10 rows in set (0.00 sec)

I thought this was a success, but we missed a detail in the previous analysis. In theory, the overhead is the same for sorting first, then joining first, and then sorting, why is there a limit! The general execution process is as follows: mysql first obtains the first 10 records by index sorting, and then performs join Filtering. when 10 records are found to be less than 10 records, it then goes to 10 records and joins again, this is obviously a disaster when there are a lot of data filtered by the internal join operation. in extreme cases, no data can be found on the inner layer, and mysql still takes 10 results at a time, almost traversed this data table!

In an SQL Test with different parameters

selectsql_no_cachec.id,c.name,c.position,c.sex,c.phone,c.office_phone,c.feature_info,c.birthday,c.creator_id,c.is_keyperson,c.giveup_reason,c.status,c.data_source,from_unixtime(c.created_time) as created_time,from_unixtime(c.last_modified) as last_modified,c.last_modified_user_id fromcontact cwhereexists (select1fromcontact_branch cbinner joinbranch_user buoncb.branch_id = bu.branch_idand bu.status in (1,2) inner joinorg_emp_info oeionoei.data_id = bu.user_idand oei.node_left >= 2875and oei.node_right <= 2875and oei.org_category = - 1 wherec.id = cb.contact_id)order byc.created_time desclimit 0 ,10;Empty set (2 min 18.99 sec)

2 min 18.99 sec! It is much worse than the previous situation. Due to mysql's nested loop mechanism, this situation is basically not optimized. This statement can only be handed over to the application system to optimize its own logic.

Through this example, we can see that not all statements can be optimized. However, when we optimize SQL use case regression, some extreme conditions may cause more serious consequences than the original ones. Therefore, first: Do not expect all statements to be optimized through SQL; second, do not be too confident. optimize statements only for specific cases, ignoring more complex situations.

The slow query cases are analyzed here. the above are just some typical cases. During the optimization process, we encountered more than 1000 rows, involving 16 Table join "spam SQL", and database differences when online or offline connections led to slow queries and application crashes, we have also encountered varchar equality comparison, not single quotation marks, and Cartesian product query, which directly killed the slave database. Many cases are actually just the accumulation of experience. if we are familiar with the internal principles of the query optimizer and index, it is very easy to analyze these cases.

5. what should be written later?

This article introduces MySQL indexing principles and some methodologies for optimizing slow queries using a slow query case. it also gives a detailed analysis of typical cases. In fact, after such a long period of statement optimization, we found that the optimization at any database level is not comparable to the optimization of the application system. it is also MySQL, which can be used to support Google/FaceBook/Taobao applications, however, your personal website may not be supported. Use the popular words recently: "query is easy, optimization is not easy, and you can write and cherish it !"

6. Reference

(1) High-Performance MySQL

(2) Data structure and algorithm analysis

(3) the simplest implementation of database http://www.ruanyifeng.com/blog/2014/07/database_implementation.html

(4) MySQL index principle and slow query optimization http://tech.meituan.com/mysql-index.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.