Under what circumstances will the "INNER Join mode optimization paging algorithm" in MySQL paging optimization take effect?

Last Update:2017-06-13 Source: Internet

Author: User

Tags sql meaning

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The source of this article: http://www.cnblogs.com/wy123/p/7003157.html

Recently inadvertently saw a MySQL paging optimization test case, and did not very specifically describe the scenario of the test scenario, given a classic scenario,
Because many situations in reality are not fixed, can be summed up the practice of universality or the law, is to consider a lot of scenes,
At the same time, in the face of the way to achieve optimization to pursue its reasons, the same approach, changed a scene, to achieve the optimization effect, but also to investigate the cause.
The individual is skeptical about this scenario, and then tests it himself, finding some problems and confirming some of the expected ideas.
This article on MySQL paging optimization, from the most simple situation, to do a simple analysis.

Another: This article test environment is the lowest configuration of the cloud server, relatively speaking, the server hardware environment is limited, but for different statements (writing) should be "equal"

MySQL classic pagination "optimization" practice

MySQL paging optimization, there is a classic problem, the more "back" of the query data slower (depending on the index type on the table, for the B-tree structure of the index, the same in SQL Server)
SELECT * FROM T order by ID limit m,n.
That is, with the increase of M, querying the same number of data, it will be more and more slow
In the face of this problem, there is a classic approach, similar to (or variant) the following notation
is to first find out the ID in the paging range, and then associate it with the base table, and finally query the data needed
SELECT * FROM t
INNER JOIN (select ID from T order by ID of limit m,n) t1 on t1.id = t.id

Is this practice not always effective, or under what circumstances will the latter be able to achieve the goal of optimization? Have you ever done something that was invalid or even slowed after rewriting?

At the same time, the vast majority of queries are filtered,
If there is a condition for the filter,
The SQL statement becomes a SELECT * from t where * * * is an ORDER by ID limit m,n
If you do, change to a similar
SELECT * FROM t
INNER JOIN (SELECT ID from t where * * * ORDER BY ID limit m,n) t1 on t1.id = t.id
In this case, can the rewritten SQL statement still be optimized?

Test environment Setup

　　The test data is simple, and the test data is written by the stored procedure, and the InnoDB engine table of the test table is tested.

Note Here is that the log write mode must be modified to Sync_binlog = 0, otherwise, by default, 500w data, estimated that the day is not complete, this is related to the log write mode, not much to say,

Reasons for paging query optimization

First or first look at this classic question, when paging, the more "lean back" query corresponding slower situation

Test one: Query 第1-20 rows of data, 0.01 seconds

Also query 20 rows of data, query relative "back" data, such as here from 4900001-4900020 rows of data, spents 1.97 seconds.

As can be seen in the case of the same query conditions, the more backward query, the lower the efficiency of the query, you can simply understand: the same search 20 rows of data, the more the data, the higher the query cost.
As for why the latter is less efficient, the latter will be analyzed slowly.

　Test environment is CentOS 7, MySQL 5.7, test table data is 500W

Reproduce the classic paging "optimization", when there are no filter criteria, sorting is not improved when listed as a clustered index

Here's the next two ways to do this when a clustered index column is used as a sort condition
SELECT * FROM T order by ID limit m,n.
SELECT * FROM t
INNER JOIN (SELECT ID from T order by ID of limit m,n) t1 on t1.id = t.id

The first type of notation:

SELECT * from Test_table1 ORDER by ID ASC limit 4900000, 20, test results see, execution time is 8.31 seconds

The second type of rewrite:

Select t1.* from Test_table1 t1
INNER JOIN (SELECT ID from Test_table1 ORDER by ID of limit 4900000,20) t2 on t1.id = t2.id; execution time is 8.43 seconds

It is clear here that, after rewriting through the classic rewriting method, performance will not improve, even a little bit slower,
The actual test shows that there is no obvious linear difference between the two in performance, and the landlord has done many tests.

I personally see a similar conclusion should not be measured, this thing can not rely on Mongolia, or by luck or anything, can improve the efficiency is why, can not improve and why.

So why hasn't the rewrite been written like the legendary lifting performance?
What causes the current overwrite to not reach the purpose of improving performance?
What is the principle that the latter can improve performance?

First look at the table structure of the test table, there is an index on the list, it is not a problem, the key is that the index on this column is the primary key (clustered index).

Why does the "optimized" SQL not achieve "optimization" when the column is a clustered index?

In the case of a sorted column of clustered indexes, both are sequential scan tables to implement the query for qualifying data.
The latter, while driving a subquery, then drives the main table with the result of the subquery,
However, the subquery does not change the "sequential scan table to achieve the query of eligible data" approach, but in the former case, even the rewriting of the practice seems to be superfluous

Refer to the following two execution plans, the first one for the execution plan, and the third line with the rewritten SQL execution plan (ID =2 that line), basically the same.

When there are no filter criteria, sorting is listed as a clustered index when paged query, so-called paged query optimization is just superfluous

At present, query the above data, both methods are very slow, if you want to query the above data, how to do?
Or to see why slow, first of all to understand the B-number of the balance of structure, in my own rough understanding, such as,
When the data of the query "back", is actually deviating in the B-tree index in one direction, as shown in the following two of the target data
In fact, the balance tree data, there is no so-called "test before" and "after", "Test" and "back" are relative to each other, or from the direction of scanning to see
Looking at the "back" data from One Direction is "front" and not absolute.

The following two is a rough representation of the index structure of B-tree, if the location of the target data is fixed, the so-called "back" is relative to the left-to-right;

If you look right-to-left, the previously so-called data is actually "up front".

As long as the data is up front, it is still possible to find this part of the data efficiently and cheaply. There should also be a practice in MySQL similar to forward (forwarded) and reverse scan (backward) in SQL Server.

If the back of the data, using a reverse scan, you should be able to quickly find this part of the data, and then the data found in the re-order (ASC), the result should be the same,
First look at the effect: The result is exactly the same as the above query, where it takes only 0.07 seconds , before the two formulations are more than 8 seconds, the efficiency of a hundredfold.

As for this is why, I think according to the above explanation, I should be able to realize that, here is attached to this SQL.
If you often query the so-called back-up data, such as the ID of the larger data, or the time dimension of the newer data, you can use a Flashback scan index method to achieve efficient paging query

Select *  from (    Select*fromorderbydesc99980,  - Order by ID;

when there are no filter conditions, the sort column is a nonclustered index, which improves

The following changes are made to the test table Test_table1
1, add a id_2 column,
2, a unique index is created on this field,
3, the field is populated with the corresponding primary key ID

The above tests are sorted by the primary key index (clustered index), which is now sorted by the nonclustered index, which is the new column id_2, to test the two paging methods mentioned at the beginning.

First of all, look at the wording.

SELECT * from Test_table1 ORDER by id_2 ASC limit 4900000, 20, execution time is 1 minutes more, it is 60 seconds

The second type of notation

Select t1.* from Test_table1 t1
INNER JOIN (SELECT ID from Test_table1 ORDER by id_2 limit 4900000,20) t2 on t1.id = t2.id; execution time 1.67 seconds

In this case, that is, when the sort is listed as a nonclustered index column, the latter is really a great way to improve efficiency. Almost 40 times times the lift.
So what's the reason?
First of all, consider the execution plan of a writing, which can be simply understood as the execution of this SQL after doing a full table scan, and then re-sorted according to Id_2, and finally take the top 20 data.
The first full-table scan is a very time-consuming process, and sorting is a very expensive one, so performance is very low.

Take a look at the execution plan of the latter, he is the first sub-subquery, according to the index order on the Id_2 scan, and then use the eligible primary key ID to query the table data
In this case, avoid querying out a large amount of data and then reordering (Using filesort)
If you understand the SQL Server execution plan, the latter should avoid frequent return tables (the process called key lookup or bookmark lookup in SQL Server).
It can be thought that the process of sub-query driven outer table query conforms to the 20 data of the condition is a batch, one-time.

In fact, the rewritten SQL can improve the efficiency of paged queries only in the current situation, that is, when sorting is listed as a nonclustered index column.
Even so, this approach is "optimized" over the paging statement, or with the following wording of the paging efficiency is more significant difference
Also see above, return the same data, such as the following query is 0.07 seconds, than here 1.67 seconds or higher than 2 orders of magnitude

Select *  from (    Select*fromorderbydesc99980,  - Order by ID;

The other thing that you want to mention is that if you have a recurring paging query, and in some order, why not create a clustered index on this column.
For example, if the statement self-increment ID, or time + other fields to ensure uniqueness, MySQL will automatically create a clustered index on the primary key.
Then there is the clustered index, "front" and "back" is only a relatively logical concept, if most of the time is to get "back" or newer data, you can use the above notation,

Optimization of paged queries when there are filter conditions

This part of the thinking, the situation is too complicated, it is difficult to summarize a very representative case, so do not test too much.
SELECT * FROM t where * * * ORDER BY ID limit m,n
1, such as the brush selection condition itself is very efficient, a filter out only a small portion of the data, then the change does not overwrite the SQL meaning is not very small, because the filter itself can be very efficient filter
2, such as the brush selection condition itself is not very important (filtered data is still huge), this situation is actually back to the situation of no filter conditions, but also depends on how to sort, the positive sequence or flashbacks, etc.
3, such as the screening condition itself is not very useful (filtered data volume is still huge), to consider a very practical problem is the data distribution,
The distribution of data also affects the efficiency of SQL execution (the experience in SQL Server, MySQL should not be very different)
4, it is difficult to find a way to achieve high efficiency in the case of query complexity.

The more complex the situation, the more difficult to summarize a universal law or method, everything should be looked at in concrete circumstances, it is difficult to next conclusion.
Here the query with the filter conditions, do not do a single analysis, but to be sure, out of the actual scene, there is certainly no cure.

In addition, for querying the current page data, using the maximum value of the previous page to do the filter, you can quickly drop to find the current page of data, so of course, there is no problem, but this is another approach, is not discussed in this article.

Summarize

Paging query, the more slow the situation, but in fact, for the B-tree index, front and back is a logical relative concept, performance differences, based on the B-tree index structure and scanning methods.
If you add a filter, the situation becomes more complex, and the same is true in SQL Server, which is also tested in SQL Server, and is not duplicated here.
In this case, the row sequence is not necessarily, the query condition is not necessarily, the data distribution is not necessarily, it is difficult to use a specific method to achieve "optimization", and also play a side effect of the superfluous.
Therefore, in doing the paging optimization, must be based on the specific scene to do the analysis, the method is not necessarily only one, divorced from the actual scene of the conclusion, are pull Duzi.
Only by figuring out the ins and outs of the problem can we be able to do it.
Therefore the individual to the data "optimization" conclusion, must be specific problem specific analysis, is very taboo summed up a set of rules (rule 1,2,3,4,5) to people "apply", in view of I also very dish, even dare to sum up some dogma.

Under what circumstances will the "INNER Join mode optimization paging algorithm" in MySQL paging optimization take effect?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More