MySQL ORDER by,group by and distinct principle

Source: Internet
Author: User
Tags compact create index mysql query sorts

Preface

In addition to the regular join statements, there is a class of query statements that are also used more frequently, that is, orderby,group by and distinct. Considering that all three types of queries involve the sorting of data, I put them together, and the following is a basic analysis of these three types of query statements.

implementation and optimization of ORDER by

In MySQL, the following two types of implementations are available:

One is to get the ordered data directly through an orderly index, so that the ordered data can be returned to the client without any sort operation.

The other is to sort the data returned in the storage engine with the MySQL sorting algorithm and then return the sorted data to the client.

Let's make a simple analysis of these two implementations. First analyze the first kind of implementation without sorting. Let's also use examples to speak:

[Email protected]: example 09:48:41> explain-> SELECT m.id,m.subject,c.content-> from Group_message m,group_  Message_content c-> WHERE m.group_id = 1 and m.id = c.group_msg_id-> ORDER by m.user_id\g*************************** 1. Row ***************************id:1select_type:simpletable:mtype:refpossible_keys:primary,idx_group_message_ Gid_uidkey:idx_group_message_gid_uidkey_len:4ref:constrows:4extra:using where*************************** 2. Row ***************************id:1select_type:simpletable:ctype:refpossible_keys:group_message_content_msg_ Idkey:group_message_content_msg_idkey_len:4ref:example.m.idrows:11extra:

Look at the above query statement, there is an order by user_id, why there is no sorting operation in the execution plan? In fact, this is because MySQL Query Optimizer chooses an ordered index to access the data in the table (IDX_GROUP_MESSAGE_GID_UID), so that the data we get through group_id conditions is already in accordance with GROUP_ The ID and the user_id are sorted. While our ordering condition is only one user_id, our where condition determines that the group_id of the returned data is all the same, that is, the returned result set is exactly the same regardless of whether it is sorted by group_id.

We can use the following diagram to describe the entire execution process:

The TableA and TableB in the figure are the two tables, Group_message and gruop_message_content, in the query above.

This method of sorting data using indexes is a best practice for sorting results set in MySQL, which can completely avoid the resource consumption caused by the sort calculation. So, when we optimize the order of the query statement, we can use the existing indexes to avoid the actual sort calculation, and the performance of the operation is greatly improved. In some query optimization processes, even increasing the index field is worth the effort to adjust the order of indexed fields to avoid actual sorting operations. Of course, before adjusting the index, you also need to evaluate the impact of adjusting the index on other query to balance the overall gain and loss.

If there is no index utilization, how does MySQL implement the sort? At this point, MySQL cannot avoid the need to sort the data returned by the storage engine with the relevant sorting algorithm. Next, we will analyze this implementation method accordingly.

In MySQL second sort implementation way, must carry on the corresponding sorting algorithm to realize the data sorting. MySQL can now use two algorithms to sort the data.

Remove the fields that satisfy the filter criteria for sorting criteria and row pointer information that can be positioned directly to the row data, perform the actual sort operation in the sort buffer, and then return the data from the table to the other fields of the client request based on the row pointer information, and then return to the client;

Extracts the data for the sort field and all other fields requested by the client one time based on the filter condition, and stores the fields that do not need to be sorted in a single area of memory, and then sorts the sort field and row pointer information in Sortbuffer. Finally, the sorted row pointer is matched with the row pointer information stored in the memory area along with other fields to match the merged result set, which is then returned to the client in order.

The first sort algorithm above is the one that MySQL has always had, and the second is the improved sorting algorithm that was added from the MySQL4.1 version. The second algorithm compared with the first one, the main advantage is to reduce the data two times access. There is no need to go back to the table again after sorting to fetch data, saving IO operations. Of course, the second algorithm consumes more memory, which is a typical way of optimizing the time spent in the memory space. Here we also take a look at the execution plan when MySQL has to use the sort algorithm, just to change the sort field:

[Email protected]: example 10:09:06> explain-> select m.id,m.subject,c.content-> from Group_message m,group_  Message_content c-> WHERE m.group_id = 1 and m.id = c.group_msg_id-> ORDER by m.subject\g*************************** 1. Row ***************************id:1select_type:simpletable:mtype:refpossible_keys:primary,idx_group_message_ Gid_uidkey:idx_group_message_gid_uidkey_len:4ref:constrows:4extra:using where; Using filesort*************************** 2. Row ***************************id:1select_type:simpletable:ctype:refpossible_keys:group_message_content_msg_ Idkey:group_message_content_msg_idkey_len:4ref:example.m.idrows:11extra:

Probably a look, as if the entire implementation plan is no different? But careful reader friends may have found that in the Group_message table of extra information, a "Using filesort" information, in fact, this is the MySQL Query Optimizer told us, He needs a sort operation to return the ordered data as requested by the client. The execution diagram is as follows:

Here we see, MySQL. Once the data for the first table is obtained, the data is first Filesort, which is the sort operation, based on the sorting criteria. The second table is then accessed through the nested Loop join using the sorted result set as the driving result set. Of course, let's not misunderstand that this filesort is not about sorting through disk files, just telling us to do a sort operation.

Above, we see that the sort result set source is simply a simple filesort operation for a single table. In our actual application, many times our business requirements may not be the case, the fields that may need to be sorted exist in two tables at the same time, or MySQL will be sorted after a join. This sort of ordering in MySQL does not simply use sort buffer for sorting, but instead, you must first place the result set of the previous join into the staging table with a temporary table and take the data from the temporary table to the sort buffer. Let's follow the example of such an execution plan by changing the sorting requirements again, when we choose to sort through the content fields above the Group_message_content table:

[Email protected]: example 10:22:42> explain-> select m.id,m.subject,c.content-> from Group_message m,group_  Message_content c-> WHERE m.group_id = 1 and m.id = c.group_msg_id-> ORDER by c.content\g*************************** 1. Row ***************************id:1select_type:simpletable:mtype:refpossible_keys:primary,idx_group_message_ Gid_uidkey:idx_group_message_gid_uidkey_len:4ref:constrows:4extra:using temporary; Using filesort*************************** 2. Row ***************************id:1select_type:simpletable:ctype:refpossible_keys:group_message_content_msg_ Idkey:group_message_content_msg_idkey_len:4ref:example.m.idrows:11extra:

"Using temporary" appears in the execution plan at this time because our sort operations need to be done after two table joins, showing the execution of this query:

First, TableA and TableB join, then the result set enters the temporary table, then Filesort, and finally gets the ordered result set data back to the client.

Above, we show the implementation principle of two different examples when MySQL is unable to avoid sorting operations using the corresponding sorting algorithm. Although there are two kinds of sorting algorithms used in the sorting process, the internal implementation mechanism of the two sorts is roughly the same.

What can we do to optimize when we can't avoid sorting operations? Obviously, we should make MySQL choose to use the second algorithm for sorting as much as possible. This can reduce the number of random IO operations, greatly improving the efficiency of the sequencing work.

1. Increase the setting of max_length_for_sort_data parameters;

In MySQL, the decision to use the first old-fashioned sorting algorithm or the new improved algorithm is based on the parameter max_length_for_sort_data. When the maximum length of all of our returned fields is less than the value of this parameter, MySQL chooses the improved sorting algorithm and, conversely, the old-fashioned algorithm. So, if we have enough memory for MySQL to store the non-sorted fields that need to be returned, we can increase the value of this parameter to let MySQL choose to use the improved version of the sorting algorithm.

2. Remove unnecessary return fields;

When our memory is not very abundant, we can't simply force MySQL to use the improved sorting algorithm by forcing the above parameters, because if that could lead to MySQL having to divide the data into many segments and then use the order, the result might not be worth the candle. In this case, we need to get rid of the unnecessary return field and let our return length accommodate the limit of the Max_length_for_sort_data parameter.

3. Increase the sort_buffer_size parameter setting;

Increasing sort_buffer_size is not to allow MySQL to choose the improved sorting algorithm, but to allow MySQL to minimize the need to sort the data in the sorting process, because it will cause MySQL to use temporary tables to exchange sorting.

implementation and optimization of GROUP by

Because group by actually also needs to be sorted, and the group by is mostly just a sort of grouping operation, compared to order by. Of course, if you use some other aggregate functions when grouping, you also need to calculate some aggregate functions. So, in the implementation of group by, the index can be used as well as the by-pass.

In MySQL, the implementation of group by also has several (three) ways, two of which use existing index information to complete group by, and another for scenarios where the index is completely unusable. Let's do an analysis of these three implementations separately.

1. Using a loosely (Loose) index scan to implement GROUP by

What is loosely indexed scan implementation Group by? In fact, when MySQL takes full advantage of the index scan to implement group by, it does not need to scan all the index keys that satisfy the criteria to complete the operation.

In the following example, we describe a loosely indexed scan implementation Group BY, before the example we need to first adjust the index of the Group_message table and add the Gmt_create field to the index of the group_id and user_id fields:

[Email protected]: example 08:49:45> CREATE index idx_gid_uid_gc-> on Group_message (group_id,user_id,gmt_create) ;  Query OK, rows affected (0.03 sec) records:96 duplicates:0 warnings:0[email protected]: Example 09:07:30> DROP index Idx_group_message_gid_uid-> on Group_message; Query OK, affected (0.02 sec) records:96 duplicates:0 warnings:0

Then look at the following query execution plan:

[Email protected]: example 09:26:15> explain-> SELECT User_id,max (gmt_create), from Group_message-> WHERE g roup_id < 10-> GROUP by group_id,user_id\g*************************** 1. Row ***************************id:1select_type:simpletable:group_messagetype:rangepossible_keys:idx_gid_uid_ Gckey:idx_gid_uid_gckey_len:8ref:nullrows:4extra:using where; Using index for Group-by1 row in Set (0.00 sec)

We see information in the extra information for the execution plan that shows "Using Index for Group-by", which is actually telling us that Mysqlqueryoptimizer uses a loosely indexed scan to implement the group by operation we need.

The following picture depicts the approximate implementation of the scanning process: to use a loosely indexed scan to implement group by, you need to meet at least the following conditions:

The GROUP by condition field must be the first consecutive position in the same index;

While using group BY, only the two aggregate functions, Max and Min, can be used.

If a reference is made to a field condition other than the group by condition in the index, it must exist as a constant;

Why is the efficiency of a loose index scan high?

Because there is no WHERE clause, that is, a full index scan is required, the number of key values that a loose index scan needs to read is as many as the number of groups grouped, that is, much less than the number of key values that actually exist. When the WHERE clause contains a range-judged or equivalent expression, the loosely indexed scan finds the 1th keyword for each group that satisfies the scope criteria, and reads the minimum number of keywords again.

2. Use compact (tight) index scan to implement GROUP by

Compact index scanning the difference between a group by and a loose index scan is that he needs to read all the index keys that meet the criteria while scanning the index, and then complete the group by operation based on reading the bad data.

[Email protected]: example 08:55:14> explain-> SELECT max (gmt_create), from group_message-> WHERE group_id = 2-> GROUP by user_id\g*************************** 1. Row ***************************id:1select_type:simpletable:group_messagetype:refpossible_keys:idx_group_message _gid_uid,idx_gid_uid_gckey:idx_gid_uid_gckey_len:4ref:constrows:4extra:using where; Using index1 Row in Set (0.01 sec)

There is no "Using index for Group-by" in the extra information for the execution plan at this time, but it does not mean that the group by operation of MySQL is not done by index. It is simply the need to access all the index key information qualified by the Where condition before the result can be reached. This is achieved through a compact index scan to implement the group by's execution plan output information.

The following picture shows the approximate entire execution process:

In MySQL, MySQL Query Optimizer first chooses to attempt a group by operation with a loose index scan, and then attempts to implement it through a compact index scan when it finds that some cases do not meet the requirements of the group by for a loose index scan.

When the group by condition field is not contiguous or is not part of the index prefix, MySQL Query optimizer cannot use a loose index scan, and the set cannot complete the group by operation directly through the index because the missing index key information is not available. However, if there is a constant value in the query statement that references the missing index key, the group by operation can be accomplished using a compact index scan, because the constant fills the "gap" in the Search keyword to form a complete index prefix. These index prefixes can be used for index lookups. If you need to sort the group by result, and you can form the Search keyword for the index prefix, MySQL can also avoid additional sorting operations, because searching with the prefix of a sequential index retrieves all the keywords in order. 3. Using temporary tables to implement GROUP by

When MySQL is doing a group by operation, the fields that must satisfy the group by must be stored in the same index at the same time, and the index is an ordered index (such as a hash index that does not meet the requirements). And, not only that, the ability to use an index to implement group by is also related to the aggregate function used.

The first two group by implementations are used when there are available indexes, and when MySQL Query optimizer cannot find a suitable index to use, it has to read the required data and then complete the group by operation with the temporary table.

[Email protected]: example 09:02:40> explain-> SELECT max (gmt_create), from group_message-> WHERE group_id and Gt 1 and group_id < 10-> GROUP BY user_id\g*************************** 1. Row ***************************id:1select_type:simpletable:group_messagetype:rangepossible_keys:idx_group_ Message_gid_uid,idx_gid_uid_gckey:idx_gid_uid_gckey_len:4ref:nullrows:32extra:using where; Using index; Using temporary; Using Filesort

The execution plan is very obvious to us. MySQL finds the data we need by indexing, then creates a temporary table and sorts it to get the group by result we need. The entire execution process is probably as shown:

When MySQL Query Optimizer found that only the index scan was not able to directly get the result of group by, he had to choose to implement group by using temporary tables and then sorting.

In this example, this is the case. GROUP_ID is not a constant condition, but a range, and the group by field is user_id. So MySQL cannot help with the implementation of the Group by in the order of the indexes, it can only scan the required data by the index range, then save the data to the staging table, then sort and group the operations to complete group by.

For the above three kinds of MySQL processing group by method, we can draw the following two kinds of optimization ideas:

1. Make it possible for MySQL to use the index to complete group by operations, but it is best to have a loose index scan. If the system allows, we can adjust the index or adjust the query of the two ways to achieve the goal;

2. When the index cannot be used to complete group by, it is necessary to use the temporary table and need filesort, so we must have enough sort_buffer_size for MySQL sort, and try not to do the group by operation of large result set. Because if the temporary table data can be copied to disk before the system setting is exceeded, the performance of the sorting grouping operation will be decreased in order of magnitude.

As to how to make good use of these two ideas, we still need to try and test the results in our actual application scenarios, and finally get a better plan. In addition, there is a little trick in optimizing group by so that we can avoid filesort operations in the case of some inability to use the index, that is, to add a null-ordered clause at the end of the entire statement (order by NULL), you can try it out.

The realization and optimization of DISTINCT

Distinct is actually very similar to the group by operation, except that only one record is taken out of each group after group by. Therefore, the implementation of distinct and the implementation of group by is basically similar, there is not much difference. It can also be done by either a loose index scan or a compact index scan, and of course, MySQL can only be completed with a temporary table when the distinct can not be completed with just the index. However, a bit different from group by is that distinct does not need to be sorted. That is, in a query that is just a distinct operation, MySQL uses a temporary table to "cache" the data once it is not able to perform the operation just by using the index, but does not filesort the data in the staging table. Of course, if we use GROUP by and group when we're doing distinct, and we use aggregate functions like Max, we can't avoid filesort.

Let's take a look at some simple query examples to illustrate the implementation of distinct.

1. First look at the distinct operation done by a loose index scan:

[Email protected]:  example 11:03:41> EXPLAIN SELECT  DISTINCT group_id->  from  group_message\g* 1.  Row  ***************************id:  1select_type:simpletable:group_messagetype:rangepossible_keys: Nullkey:  idx_gid_uid_gckey_len:4ref:  nullrows:10extra:using  index for  group-by1  Row  In  Set  (0.00 sec)

We can clearly see that the extra information in the execution plan is "Using index for group-by", what does this mean? Why am I not doing a group by operation when the execution plan tells me that the group by IS indexed? In fact, this is the implementation of the principle of distinct related, in the implementation of the distinct process, the same needs to be grouped, and then from each set of data out of a return to the client. The extra information here tells us that MySQL has completed the entire operation with a loose index scan. Of course, if MySQL Query optimizer to be able to do a little more humane to change the information here to "Using index for distinct" that is better and easier to understand, hehe.

2. Let's take a look at an example of a compact index scan:

[Email protected]: example 11:03:53> EXPLAIN SELECT DISTINCT user_id-> from group_message-> WHERE group_id = 2\g 1. Row ***************************id:1select_type:simpletable:group_messagetype:refpossible_keys:idx_gid_uid_gckey : Idx_gid_uid_gckey_len:4ref:constrows:4extra:using WHERE; Using index1 Row in Set (0.00 sec)

The display here and the implementation of Group by with compact index scanning are also exactly the same. In fact, in the implementation of this query, MySQL will let the storage engine scan all of group_id=2 's index keys, draw all the user_id, and then take advantage of the index's sorted characteristics, each time you replace a USER_ID index key value to keep a message, The entire distinct operation can be completed when all Gruop_id=2 index keys are scanned.

3. Let's look at what happens when you can't use the index alone to complete distinct:

[Email protected]: example 11:04:40> EXPLAIN SELECT DISTINCT user_id-> from group_message-> WHERE group_id > 1 and group_id < 10\g*************************** 1. Row ***************************id:1select_type:simpletable:group_messagetype:rangepossible_keys:idx_gid_uid_ Gckey:idx_gid_uid_gckey_len:4ref:nullrows:32extra:using WHERE; Using index; Using temporary1 Row in Set (0.00 sec)

When MySQL is unable to rely solely on the index to complete the distinct operation, it will have to use a temporary table for the appropriate operation. But we can see that when MySQL uses temporal tables to complete distinct, there is a little difference between handling group by and Filesort. In fact, in the MySQL grouping algorithm, it does not have to be sorted to complete the grouping operation, which I have already mentioned in the group by optimization tips above. In fact, here MySQL is in no sort of case to achieve the end of the group distinct operation, so less filesort this sort operation.

4. Finally, try the group by combination:

[Email protected]: example 11:05:06> EXPLAIN SELECT DISTINCT Max (user_id), from group_message-> WHERE group_id > 1 and group_id < 10-> GROUP BY group_id\g*************************** 1. Row ***************************id:1select_type:simpletable:group_messagetype:rangepossible_keys:idx_gid_uid_ Gckey:idx_gid_uid_gckey_len:4ref:nullrows:32extra:using WHERE; Using index; Using temporary; Using Filesort1 Row in Set (0.00 sec)

Finally, let's take a look at this and group by using an example with an aggregate function, as compared to the third example above, you can see that there are already more filesort sort operations, because we used the MAX function.

For distinct optimization, and group by is basically the same idea, the key is to use good index, when the index is not available, make sure not to do distinct operation on the large result set, the disk above the IO operation and in-memory IO operation performance is not an order of magnitude difference.

Summary

This chapter focuses on some of the ideas and methods of performance tuning related to MySQL query statements, as well as some examples, hoping to help readers broaden their thinking in the actual work. While this chapter covers the initial index design, some principles for writing efficient query statements, and the final debugging of statements, the query statement is much more than just tuning. A lot of tuning skills, only in the actual tuning experience will really experience, really grasp the essence of it. Therefore, I hope that you readers can do more experiments, based on the theory, the facts as the basis, only in this way, can continue to improve their understanding of the query tuning.

MySQL ORDER by,group by and distinct principle

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.