In MySQL paging optimization, when does the "inner join mode optimization paging algorithm" take effect ?, Innerjoin

Last Update:2017-06-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Source: http://www.cnblogs.com/wy123/p/7003157.html

I recently accidentally saw a MySQL paging optimization test case. I did not explain the test scenario very specifically. I gave a classic solution,
In reality, many situations are not fixed, so we need to consider a lot of scenarios when we can summarize common practices or rules,
At the same time, in the face of the way to achieve optimization, we should investigate the cause. In the same way, if we change the scenario and fail to achieve the optimization effect, we should also investigate the cause.
I am skeptical about this scenario, and I tested it myself. I found some problems and confirmed some expected ideas.
This article is about MySQL paging optimization, starting from the simplest situation, to do a simple analysis.

In addition, the testing environment in this article is the least configured ECs, which is relatively limited in the server hardware environment, but different statements (writing) should be "equal"

MySQL classic paging "optimization" practices

In MySQL paging optimization, there is a classic problem. The slower the query is, the slower the data is (depending on the index type of the table. For B-tree indexes, the same is true for SQL Server)
Select * from t order by id limit m, n.
That is, as M increases, querying the same amount of data slows down.
In the face of this problem, a classic practice is created, similar to (or variant) writing as follows:
First, the id in the paging range is located separately, and then associated with the base table. Finally, the required data is queried.
Select * from t
Inner join (select id from t order by id limit m, n) t1 on t1.id = t. id

Is this practice always effective, or under what circumstances can the latter be optimized? Is there a situation where the rewrite is ineffective or even slow?

At the same time, most queries have filtering conditions,
If filtering conditions exist,
The SQL statement becomes select * from t where ** order by id limit m, n
If this is the case, rewrite it to something similar
Select * from t
Inner join (select id from t where *** order by id limit m, n) t1 on t1.id = t. id
In this case, can the modified SQL statement be optimized?

Test Environment setup

　　Test data is relatively simple. Test Data is written cyclically through the stored procedure to test the InnoDB Engine table of the table.

Note that the log writing mode must be changed to sync_binlog = 0. Otherwise, million data records are generated by default. It is estimated that the data cannot be written in one day. This is related to the log writing mode, let's just say that,

Why paging Query Optimization

First, let's take a look at this classic problem. When querying by page, the more slow the query will be.

Test 1: query data in rows 1-20 in 0.01 seconds

It is also used to query 20 rows of data, which is relatively "back-to-back". For example, it takes 1.97 seconds to query data from 49100001-4900020 rows.

As you can see, when the query conditions remain unchanged, the query efficiency is lower when the query conditions remain unchanged. It can be simply understood that 20 rows of data are also searched, and the closer the data is to the back, the higher the query cost.
As to why the latter is less efficient, we will analyze it later.

　The test environment is centos 7, mysql 5.7, and the data in the test table is 500

Reproduce the classic page "optimized". When no filtering conditions are available, the sorting column will not improve the clustered index.

Here, the performance of the two statements in the next day when the clustered index column is used as the sorting Condition
Select * from t order by id limit m, n.
Select * from t
Inner join (select id from t order by id limit m, n) t1 on t1.id = t. id

Method 1:

Select * from test_table1 order by id asc limit 4900000,20; for test results, see: the execution time is 8.31 seconds.

The second rewrite method is as follows:

Select t1. * from test_table1 t1
Inner join (select id from test_table1 order by id limit 4900000,20) t2 on t1.id = t2.id; execution time: 8.43 seconds

It is clear that the performance will not be improved after the rewriting through the classic method, or even a little slower,
The actual test shows that there is no obvious linear difference between the two in terms of performance.

I personally see that similar conclusions have to be tested. The reason for improving efficiency is that it cannot be improved.

So why isn't the rewrite writing like the legendary one improving performance?
Why does the current rewrite fail to improve performance?
What is the principle of the latter's performance improvement?

First, let's take a look at the table structure of the test table. There is an index in the sorting column, which is no problem. The key is that the index in the sorting column is the primary key (clustered index ).

Why can't the SQL statements written after "optimized" cannot achieve the goal of "optimized" when sorting columns are clustered indexes?

When sorting columns are clustered index columns, both of them scan tables sequentially to query data that meets the conditions.
Although the latter is a pioneer in a subquery, and then use the results of the subquery to drive the main table,
However, the subquery does not change the "sequential table scan to query Qualified Data" method. However, in the past, even the modified method is superfluous.

Refer to the following two execution plans. The line of the first execution plan is basically the same as the line of the modified SQL Execution Plan (the line with id = 2.

If there is no filtering condition, the sorting column is used for querying by page when the clustered index. The optimization of the so-called paging query is just a superfluous addition.

Currently, the two methods of querying the above data are very slow. What should I do if I want to query the above data?
To see why it is slow, we must first understand the balance structure of the number B. In my own rough understanding, for example,
When the queried data is "back-to-back", it actually deviates from the direction of the B-tree index, as shown in the following two target data:
In fact, there is no so-called "pre-examination" or "back-to-back" data on the Balance Tree. "back-to-back" and "back-to-back" are both relative to each other, or from the Scanning direction.
Looking at "back-to-back" data in one direction is "front-to-back", not absolute.

The following two forms are rough representations of the B-tree index structure. If the position of the target data is fixed, the so-called "back-to-back" is relative to the left-to-right structure;

If you look from the right to the left, the so-called back-to-back data is actually "Front.

As long as the data is on the top, you can find this part of data efficiently. Mysql should also have practices similar to forward (forwarded) and reverse scanning (backward) in sqlserver.

If reverse scanning is used for the back-end data, you can quickly find this part of the data and then sort the data again (asc). The result should be the same,
First, let's look at the results: the results are exactly the same as the above query. It takes only 0.07 seconds. The previous two methods both exceed 8 seconds, and the efficiency is hundreds of times different.

As for the reason, I want to see what I should be able to understand based on the above explanation. Here I will attach this SQL statement.
If you frequently query the so-called back-to-back data, such as the data with a large Id, or the data with a relatively new time dimension, you can use inverted scan index to achieve efficient paging query.

select* from(    select * from test_table1 order by id desc limit 99980,20    ) t order by id;

When no filtering condition exists and the sorting column is not clustered index, it will improve.

Here, the test_table1 test table is changed as follows:
1. Add an id_2 column,
2. Create a unique index for this field,
3. Fill the field with the corresponding primary key Id

In the above test, the index is sorted by the primary key index (clustered Index). Now we are sorting by the non-clustered index, that is, the newly added column id_2, test the two paging methods mentioned at the beginning.

First, let's look at the first writing method.

Select * from test_table1 order by id_2 asc limit 4900000,20; the execution time is a little more than 1 minute, and it is recognized as 60 seconds for the moment

Method 2

Select t1. * from test_table1 t1
Inner join (select id from test_table1 order by id_2 limit 4900000,20) t2 on t1.id = t2.id; execution time: 1.67 seconds

In this case, when sorting columns are non-clustered index columns, the latter method can greatly improve the efficiency. There are almost 40 times of improvement.
So why?
First, let's look at the execution plan in the first writing method. It can be simply understood that after the full table scan is performed during the execution of the SQL statement, the SQL statements are sorted by id_2 again, and the first 20 data records are obtained.
First, full table scan is a very time-consuming process, and sorting is also a very high cost, so the performance is very low.

Next, let's look at the latter's execution plan. First, in the subquery, The subquery scans the data according to the index sequence on id_2, and then uses the qualified primary key Id to query the data in the table.
In this way, you do not need to query a large amount of data and then sort it again (Using filesort)
If you understand the SQL Server execution plan, the latter should avoid frequent back-to-table queries (key lookup or bookmarks in SQL Server ).
It can be considered that the process of driving the outer table to query 20 pieces of data that meet the conditions is a batch and one-time process.

In fact, only in the current situation, that is, when sorting columns are not clustered index columns, the rewritten SQL statement can improve the efficiency of paging query.
Even so, the paging statements that have been "optimized" in this method are still significantly different from the paging efficiency described below.
As shown above, the same data is returned. The following query is 0.07 seconds, which is two orders of magnitude higher than the 1.67 seconds.

select* from(    select * from test_table1 order by id desc limit 99980,20    ) t order by id;

Another question to be raised is: if regular paging queries are performed in a certain order, why not create a clustered index on this column.
For example, if the statement auto-increment Id or time + other fields ensure uniqueness, mysql will automatically create a clustered index on the primary key.
Then with clustered indexes, "front" and "back" are just a relative logical concept. If you want to get "back" or newer data, you can use the above statement,

Optimization of paging Query when filtering conditions exist

This part is too complicated to come up with a representative case. Therefore, it is not necessary to perform tests in many places.
Select * from t where *** order by id limit m, n
1. For example, the filtering condition itself is very efficient. If only a small portion of data is left after filtering, it is of little significance not to rewrite the SQL statement, because the filtering condition itself can achieve efficient filtering.
2. For example, the filtering condition does not play a major role (the data size after filtering is still huge). In this case, the filtering condition does not exist. It also depends on how to sort, whether it is in the forward or reverse order, and so on.
3. For example, the filtering condition itself does not play a major role (the data size after filtering is still huge). A very practical problem to consider is data distribution,
The distribution of data also affects the execution efficiency of SQL statements (experience in SQL Server, mysql should be slightly different)
4. When the query itself is complex, it is hard to say that the query can be efficient in some way.

The more complex the situation is, the more difficult it is to summarize a general rule or method. It is difficult to draw a conclusion that everything should be viewed based on specific circumstances.
Here, we will not analyze the query conditions one by one, but it is certain that there is no fixed solution out of the actual scenario.

In addition, when querying the current page data, you can use the maximum value of the previous page to filter the data, and you can quickly find the data on the current page. Of course there is no problem, but this is another practice, not discussed in this article.

Summary

In paging query, the more backward the query, the more slow the query. In fact, for B-tree indexes, the front and back are logically relative concepts and performance differences, it is based on the B-Tree Index Structure and scan method.
If filtering conditions are added, the situation will become more complex. The principle of this problem is the same in SQL Server. It was originally tested in SQL Server and won't be repeated here.
In this case, the sorting column is not necessarily, the query conditions are not certain, and the data distribution is not necessarily, it is difficult to use a specific method to achieve "optimization ", if this problem is not solved, it will also play a side effect.
Therefore, when optimizing pages, we must analyze the results based on specific scenarios. There is not necessarily only one method. from the conclusion of the actual scenario, we are all talking about it.
Only by figuring out the ins and outs of this problem can we be able to solve it easily.
Therefore, my personal conclusions on Data "optimization" must be specific analysis of specific issues. It is very taboo to come up with a set of rules (rules 1, 2, 3, 4, 5) for people to "Apply ", since I am also a good cook, I dare not sum up some dogmas.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

In MySQL paging optimization, when does the "inner join mode optimization paging algorithm" take effect ?, Innerjoin

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

In MySQL paging optimization, when does the "inner join mode optimization paging algorithm" take effect ?, Innerjoin

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support