Mysql limit large data volume paging Optimization Method

Last Update:2013-11-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Mysql optimization is very important. The other most commonly used and most needed optimization is limit. Mysql limit greatly facilitates paging. However, when the data volume is large, the performance of limit decreases sharply.

10 data records are also retrieved.

Select * from yanxue8_visit limit, 10 and

Select * from yanxue8_visit limit 0, 10

It is not an order of magnitude.

There are also many five limit optimization guidelines on the Internet, which are translated from the Mysql manual. They are correct but not practical. Today I found an article about limit optimization, which is quite good.

Instead of using limit directly, we first get the offset id and then use limit size to get data. Based on his data, it is much better to use limit directly. Here I use data in two cases for testing. (Test environment: win2033 + p4 dual-core (3 GHZ) + 4G memory Mysql 5.0.19)

1. When the offset value is small.

Select * from yanxue8_visit limit 10, 10

Run multiple times and keep http://www.zhutiai.com between 0.0004-0.0005

Select * From yanxue8_visit Where vid> = (

Select vid From yanxue8_visit Order By vid limit 10, 1

) Limit 10

Run multiple times, and the time is kept between 0.0005-0.0006, mainly 0.0006

Conclusion: When the offset is small, it is better to use limit directly. This is obviously the cause of subquery.

2. When the offset value is large.

Select * from yanxue8_visit limit, 10

Run multiple times and keep the time at around 0.0187

Select * From yanxue8_visit Where vid> = (

Select vid From yanxue8_visit Order By vid limit, 1

) Limit

Run multiple times, with a time of around 0.0061, only 1/3 of the former. It can be predicted that the larger the offset, the higher the latter.

Pay attention to correct your limit statement and optimize Mysql later.

According to the table collect (id, title, info, vtype) for these four fields, the title uses a fixed length, info uses text, id
Gradually, vtype is tinyint, and vtype is index. This is a simple model of a basic news system. Now fill in the data and 0.1 million news articles.

Finally, collect contains 0.1 million records, and the database tutorial table occupies 1.6 GB of hard disk space. OK. Check the following SQL statement:

Select id, title from collect limit 0.01, 10; very fast; basically seconds on OK, then look at the following

Select id, title from collect limit 90 thousand, 10; pagination starts from. What is the result?

8-9 seconds. What's wrong with my god ???? In fact, we need to optimize this data and find the answer online. Let's look at the following statement:

Select id from collect order by id limit 0.04, 10; soon, seconds OK.
Why? Because it is faster to use the id Primary Key for indexing. The online modification method is:

Select id, title from collect where id> = (select id from collect order by id
Limit 90000,1) limit 10;

This is the result of indexing with id. But the problem is a little complicated, and it's all done. See the following statement.

Select id from collect where vtype = 1 order by id limit 90000,10;

It took 8 to 9 seconds!

At this point, I believe many people will feel the same as me! What if the vtype has been indexed? How can it be slow? Vtype performs an index. You can directly select id from
Collect where vtype = 1 limit 1000,10;
It is very fast, basically 0.05 seconds, but it is increased by 90 times. From 90 thousand, it is the speed of 0.05*90 = 4.5 seconds. And the test result is 8-9 seconds to an order of magnitude. Some people have put forward the idea of table sharding from here. This is the same as discuz.
Forums share the same idea. The idea is as follows:

Create an index table: t (id, title, vtype) and set it to a fixed length. Then, perform pagination, and retrieve the results by page in collect.
Is it feasible? The experiment is complete.

0.1 million records are recorded in t (id, title, vtype), and the data table size is about 20 mb. Use

Select id from t where vtype = 1 order by id limit 90000,10;

Soon. It can basically be completed in 0.1-0.2 seconds. Why? I guess it is because there is too much collect data, so paging takes a long time. Limit
It is completely related to the data table size. In fact, this is still a full table scan, only because the data volume is small and only 0.1 million is fast. OK. Here is a crazy experiment, which adds 1 million entries to test the performance.

After adding 10 times of data, the t table will reach more than 200 MB and the length is fixed. Or the query statement just now. The time is 0.1-0.2 seconds! Is table sharding performance okay? Error! Because our limit is still 90 thousand, so fast. Big, starting from 0.9 million

Select id from t where vtype = 1 order by id limit 900000,10; check the result. The time is 1-2 seconds!

Why ?? The table sharding time is still so long and depressing! Some people say that the fixed length will improve the limit performance. At first I thought that because the length of a record is fixed, mysql tutorial
It should be possible to calculate the location of 0.9 million, right? However, we overestimated the intelligence of mysql. It is not a business database. It turns out that fixed length and non-fixed length have little impact on limit? No wonder someone says
When discuz reaches 1 million records, it will be very slow. I believe this is true. This is related to database design!

Can't MySQL exceed the 1 million limit ??? When the page reaches 1 million, the limit is reached ???

The answer is: NO !!!!
Why can't I break through 1 million because I won't design mysql. Next we will introduce the non-Table sharding Method for a crazy test! One Table handles 1 million records and 10 GB
Database, how to quickly paging!

Now, our test goes back to the collect table and the test conclusion is:
0.3 million data. It is feasible to use the table sharding method. If it exceeds 0.3 million, it will slow down and you cannot bear it! Of course, if you use the sub-table + me method, it is absolutely perfect. However, after using this method, it can be perfectly solved without table sharding!

The answer is: Composite Index! When I designed a mysql index, I accidentally found that the index name can be any one, and several fields can be selected. What is the purpose? Select id from
Collect order by id limit, 10; so fast is because the index is taken, but if the where clause is added, the index will not be taken. I added the idea of trying it out.
Search (vtype, id) indexes. Then test
Select id from collect where vtype = 1 limit 90000,10; very fast! Finished in 0.04 seconds!

Test again: select id, title from collect where vtype = 1 limit 90000,10;
Sorry, I did not go through the search index in 8-9 seconds!

Test again: search (id, vtype) or select id statement, also very sorry, 0.5 seconds.
To sum up, if you want to reference limit with the where condition, you must design an index
Put First, the primary key used by limit is put 2nd bits, and only the select primary key can be used!

Solved the paging problem perfectly. If you can quickly return the id, there is hope to optimize the limit. According to this logic, millions of limit can be completed in seconds. It seems that mysql
Statement optimization and indexing are very important!
Now, back to the original question, how can we quickly apply the above research to development? If a composite query is used, my Lightweight Framework will be useless. You have to write the paging string by yourself. How much trouble is that? Let's look at another example. The idea is coming out:
Select * from collect where id in (9000,12, 50,7000); you can check it in 0 seconds!
Mygod and mysql indexes are equally valid for in statements! It seems that it is wrong to say that in cannot use indexes on the Internet!
With this conclusion, we can easily apply it to a lightweight framework:
The Code is as follows:

Copy the Code as follows:

$ Db = dblink ();
$ Db-> pagesize = 20;
$ SQL = "select id from collect where vtype = $ vtype ";
$ Db-> execute ($ SQL );
$ Strpage = $ db-> strpage ();
// Save the paging string in a temporary variable to facilitate output
While ($ rs = $ db-> fetch_array ()){
$ Strid. = $ rs ['id']. ',';
}
$ Strid = substr ($ strid, 0, strlen ($ strid)-1 );
// Construct the id string
$ Db-> pagesize = 0;
// It is critical that the page is cleared without canceling the class. In this way, you only need to connect to the database once and do not need to open it again;
$ Db-> execute ("select
Id, title, url, sTime, gTime, vtype, tag from collect where id in ($ strid )");
<? Php tutorial while ($ rs = $ db-> fetch_array ():?>
<Tr>
<Td> <? Php echo $ rs ['id'];?> </Td>
<Td> <? Php echo $ rs ['url'];?> </Td>
<Td> <? Php echo $ rs ['stime'];?> </Td>
<Td> <? Php echo $ rs ['gtime'];?> </Td>
<Td> <? Php echo $ rs ['vtype'];?> </Td>
<Td> <a href = "? Act = show & id = <? Php echo $ rs ['id'];?> "
Target = "_ blank"> <? Php echo $ rs ['title'];?> </A> </td>
<Td> <? Php echo
$ Rs ['tag'];?> </Td>
</Tr>
<? Php endwhile;
?>
</Table>
<? Php
Echo $ strpage;
?>

Through simple transformation, the idea is actually very simple: 1) through the optimization of the index, find the id, and spell it into a string like "12000. 2) 2nd queries to find the results.
With a small index and a few changes, mysql can support efficient paging with millions or even tens of millions of pages!
Through the example here, I have reflected on one point: for large systems, PHP cannot use frameworks, especially those that cannot even be seen by SQL statements! At first, I almost collapsed my Lightweight Framework! It is only suitable for rapid development of small applications. For ERP, OA, and large websites, the data layer, including the logic layer, cannot use frameworks. If programmers lose control over SQL statements, the risk of the project will increase exponentially! Especially with mysql
Mysql must be a professional dba to achieve its optimal performance. A single index may result in a performance difference of thousands of times!

Performance Optimization:
Based on the high performance of limit in MySQL5.0, I have a new understanding of data paging.

1.
Select * From cyclopedia Where ID> = (
Select Max (ID) From (
Select ID From cyclopedia Order By ID limit 90001
) As tmp
) Limit 100;

2.
Select * From cyclopedia Where ID> = (
Select Max (ID) From (
Select ID From cyclopedia Order By ID limit 90000,1
) As tmp
) Limit 100;

Is it faster to get 90000 records after 100, 1st sentences or 2nd sentences?
The first 1st records are obtained first, and the largest ID value is used as the start ID. Then, the first 90001 records can be quickly located.
The 2nd clause selects only the first record after the first record, and then takes the ID value as the starting marker to locate the next 90000 records.
1st statement execution result. 100 rows in set (0.23) sec
2nd statement execution result. 100 rows in set (0.19) sec

Obviously, 2nd sentences won. it seems that limit does not seem as much as I previously imagined to do a full table scan and return limit offset + length records, so it seems that limit is much better than the Top performance of the MS-SQL.

In fact, 2nd sentences can be simplified

Select * From cyclopedia Where ID> = (
Select ID From cyclopedia limit 90000,1
) Limit 100;

Using the IDs of 90,000th records directly does not require the Max operation. In this way, the theoretical efficiency is higher, but the results are almost invisible in actual use, because the ID returned by the positioning itself is a record, Max can get the result almost without operation, but this write is clearer, saving the time to draw a snake.

However, since MySQL has a limit that can directly control the location where records are retrieved, why not simply use Select * From cyclopedia limit, 1? Isn't it more concise?
I thought it would be wrong. After I tried it, I knew that the result was: 1 row in set (8.88) sec. How is it so scary, it reminds me of the "high score" I had in 4.1 yesterday ". select * it is best not to use it casually. Based on the principle of "what to use" and "what to choose", the more Select fields, the larger the field data volume, the slower the speed. which of the above two paging methods is much better than the single-write method? Although it seems that the number of queries is more, it is actually at a lower cost in exchange for efficient performance, it is very worthwhile.

The 1st schemes can also be used for MS-SQL, and may be the best, because it is always the fastest to locate the start segment by the primary key ID.

Select Top 100 * From cyclopedia Where ID> = (
Select Top 90001 Max (ID) From (
Select ID From cyclopedia Order By ID
) As tmp
)

But whether the implementation method is stored in the process or the direct code, the bottleneck is always that the TOP of the MS-SQL is always to return the first N records, this situation in the amount of data is not deep, but if hundreds of thousands, efficiency will definitely be low. in contrast, MySQL limit has many advantages. Execute:

Select ID From cyclopedia limit 90000
Select ID From cyclopedia limit 90000,1

The results are as follows:
90000 rows in set (0.36) sec
1 row in set (0.06) sec
The MS-SQL can only use Select Top 90000 ID From cyclopedia for execution time is 390 ms, the same operation time is less than MySQL 360 ms.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More