Query Optimization and paging algorithm solution for mysql massive databases

Source: Internet
Author: User

Select * from table1 where name = 'hangsan' and tID> 10000
And execution:
Select * from table1 where tID> 10000 and name = 'hangsan'
Some people do not know whether the execution efficiency of the preceding two statements is the same, because the two statements are indeed different from the statement sequence. If tID is an aggregate index, then, the next sentence can only be searched from the 10000 records in the table, while the previous sentence should first look up from the full table to see how many names = 'hangsan, then, the query results are presented based on the condition tID> 10000.
In fact, such a worry is unnecessary. SQL SERVER has a "query analysis optimizer", which can calculate the search conditions in the where clause and determine which index can narrow the search space of the table scan, that is, it can achieve automatic optimization.
Although the query optimizer can automatically optimize queries based on the where clause, it is still necessary for you to understandWorkIn this case, the query optimizer does not perform a quick query according to your intention.
In the query and analysis phase, the query optimizer checks each stage of the query and determines whether the data volume to be scanned is useful. If a phase can be used as a scan parameter (SARG), it is called an Optimized phase and the required data can be quickly obtained using the index.
Definition of SARG: An operation used to restrict search, because it usually refers to a specific match, a matching within a worthy range or an AND connection between two or more conditions. The format is as follows:
Column name operator <constant or variable>
Or
<Constant or variable> operator column name
The column name can appear on one side of the operator, while the constant or variable appears on the other side of the operator. For example:
Name = 'zhang san'
Price> 5000
5000 <price
Name = 'zhang san' and price> 5000
If an expression cannot meet the form of SARG, it cannot limit the search range, that is, SQL SERVER must determine whether each row meets all the conditions in the Where clause. Therefore, an index is useless for expressions that do not meet the SARG format.
After introducing SARG, let'sSummaryUse SARG and draw different experiences from some materials in practice:

1. Whether the Like statement belongs to the SARG depends on the type of the wildcard used.

For example, name like 'zhang % ', which belongs to SARG

However, name like '% Zhang' does not belong to SARG.

The reason is that the wildcard % is enabled in the string so that the index cannot be used.

2. or may cause full table scan.

Name = 'zhang san' and price> 5000 symbol SARG, but: Name = 'zhang san' or price> 5000 does not match SARG. Using or will cause a full table scan.

3. statements not in the SARG format caused by non-operators and functions

The most typical case for statements that do NOT meet the requirements of the SARG format is to include non-operator statements, such as: NOT ,! =, <> ,! <,!> , Not exists, not in, not like, and other functions. Below are several examples that do not meet the SARG format:

ABS (price) <5000

Name like '% 3'

Some expressions, such:

Where price * 2> 5000

SQL server will also regard it as SARG, and SQL SERVER will convert this formula:

Where price> 2500/2

However, we do not recommend this because sometimes SQL SERVER cannot guarantee that this conversion is completely equivalent to the original expression.

4. The role of IN is equal to OR.

Statement:

Select * from table1 where tid in (2, 3)

And

Select * from table1 where tid = 2 or tid = 3

If they are the same, they will cause a full table scan. If there is an index on the tid, its index will also become invalid.

5. Use NOT as few as possible

6. The execution efficiency of exists and in is the same.

A lot of data shows that exists is more efficient than in, and not exists should be used as much as possible to replace not in. But as a matter of fact, I tried it and found that the execution efficiency between the two is the same, whether the front is not or not. Because subqueries are involved, we tried to use the pubs database that comes with SQL SERVER this time. Before running SQL server, we can open the statistics I/O status of SQL SERVER.

(1) select title, price from titles where title_id in (select title_id from sales where qty> 30)

The execution result of this sentence is:

Table 'sales '. The scan count is 18, logical reads are 56, physical reads are 0, and pre-read is 0.

Table 'titles '. Scan count 1, logical read 2, physical read 0, pre-read 0.

(2) select title, price from titles where exists (select * from sales where sales. title_id = titles. title_id and qty> 30)

The execution result of the second sentence is:

Table 'sales '. The scan count is 18, logical reads are 56, physical reads are 0, and pre-read is 0.

Table 'titles '. Scan count 1, logical read 2, physical read 0, pre-read 0.

We can see that the execution efficiency of exists and in is the same.

7. The usage efficiency of the charindex () function is the same as that of the LIKE function with the wildcard %.

As mentioned above, if we add wildcard % before LIKE, it will cause a full table scan, so its execution efficiency is low. However, according to some documents, the use of the charindex () function to replace LIKE will greatly improve the speed. After my experiments, I found this explanation is also incorrect:

Select gid, title, fariqi, reader from tgongwen where charindex ('criminal investigation detachment ', reader)> 0 and fariqi> '2017-5-5'

Time: 7 seconds, 4 scans, 7155 logical reads, 0 physical reads, and 0 pre-reads.

Select gid, title, fariqi, reader from tgongwen where reader like '%' + 'criminal investigation detachment '+' % 'and fariqi> '2017-5-5'

Time: 7 seconds, 4 scans, 7155 logical reads, 0 physical reads, and 0 pre-reads.

8. union is not necessarily more efficient than or.

We have already mentioned that using or in the where clause will cause full table scanning. In general, all the materials I have seen are recommended to use union here to replace or. Facts have proved that most of these statements are applicable.

Select gid, fariqi, neibuyonghu, reader, title from Tgongwen where fariqi = '2017-9-16 'or gid> 2004

Time: 68 seconds. 1 scan count, 404008 logical reads, 283 physical reads, and 392163 pre-reads.

Select gid, fariqi, neibuyonghu, reader, title from Tgongwen where fariqi = '2017-9-16'

Union

Select gid, fariqi, neibuyonghu, reader, title from Tgongwen where gid> 9990000

Time: 9 seconds. The number of scans is 8, the number of logical reads is 67489, the number of physical reads is 216, and the number of pre-reads is 7499.

In general, union is much more efficient than or.

However, after the experiment, I found that if the query columns on both sides of or are the same, the execution speed of union is much lower than that of or. Although union scans indexes here, or scans the entire table.

Select gid, fariqi, neibuyonghu, reader, title from Tgongwen where fariqi = '2017-9-16 'or fariqi = '2017-2-5'

Time: 6423 milliseconds. Scan count 2, logical read 14726, physical read 1, pre-read 7176.

Select gid, fariqi, neibuyonghu, reader, title from Tgongwen where fariqi = '2017-9-16'

Union

Select gid, fariqi, neibuyonghu, reader, title from Tgongwen where fariqi = '2017-2-5'

Time: 11640 milliseconds. The number of scans is 8, the number of logical reads is 14806, the number of physical reads is 108, and the number of pre-reads is 1144.

9. Field extraction should follow the principle of "How much is required and how much is requested" to avoid "select *"

Let's do a test:

Select top 10000 gid, fariqi, reader, title from tgongwen order by gid desc

Time: 4673 milliseconds

Select top 10000 gid, fariqi, title from tgongwen order by gid desc

Time: 1376 milliseconds

Select top 10000 gid, fariqi from tgongwen order by gid desc

Time: 80 ms

From this point of view, each time we extract one field less, the data extraction speed will be correspondingly improved. The speed of improvement depends on the size of the discarded field.

10. count (*) is no slower than count (field ).

In some documents, * is used to count all columns, which is obviously less efficient than column names in a world. This statement is actually unfounded. Let's take a look:

Select count (*) from Tgongwen

Time: 1500 milliseconds

Select count (gid) from Tgongwen

Time: 1483 milliseconds

Select count (fariqi) from Tgongwen

Time: 3140 milliseconds

Select count (title) from Tgongwen

Time: 52050 milliseconds

From the above, we can see that if count (*) and count (primary key) are used at the same speed, while count (*) it is faster than other fields except the primary key. The longer the field, the slower the summary speed. I think, if count (*) is used, SQL SERVER may automatically search for the smallest field for summary. Of course, if you write the count (primary key) directly, it will be more direct.

11. order by sorting by clustered index columns with the highest efficiency

Let's look: (gid is the primary key, and fariqi is the aggregate index column)

Select top 10000 gid, fariqi, reader, title from tgongwen

Time: 196 milliseconds. Scan count 1, logical read 289, physical read 1, and pre-read 1527.

Select top 10000 gid, fariqi, reader, title from tgongwen order by gid asc

Time: 4720 milliseconds. 1 scan count, 41956 logical reads, 0 physical reads, and 1287 pre-reads.

Select top 10000 gid, fariqi, reader, title from tgongwen order by gid desc

Time: 4736 milliseconds. 1 scan count, 55350 logical reads, 10 physical reads, and 775 pre-reads.

Select top 10000 gid, fariqi, reader, title from tgongwen order by fariqi asc

Time: 173 milliseconds. Scan count 1, logical read 290, physical read 0, pre-read 0.

Select top 10000 gid, fariqi, reader, title from tgongwen order by fariqi desc

Time: 156 milliseconds. Scan count 1, logical read 289, physical read 0, pre-read 0.

From the above, we can see that the speed of not sorting and the number of logical reads are equivalent to the speed of "order by clustered index columns, however, these queries are much faster than those of "order by non-clustered index columns.

At the same time, when sorting by a field, whether in positive or reverse order, the speed is basically equivalent.

12. Efficient TOP

In fact, when querying and extracting ultra-large data setsDatabaseThe biggest factor in response time is not data search, but physical I/0 operations. For example:

Select top 10 * from (

Select top 10000 gid, fariqi, title from tgongwen

Where neibuyonghu = 'Office'

Order by gid desc) as

Order by gid asc

Theoretically, the execution time of the entire statement is longer than that of the clause, but the opposite is true. Because 10000 records are returned after the sub-statement is executed, and only 10 statements are returned for the entire statement, the biggest factor affecting the database response time is physical I/O operations. Physical I/O operations are most effective here.MethodOne is to use TOP keywords. The TOP keyword is a system-optimized term in SQL SERVER used to extract the first few or the first few percentage data entries. Through the application of the author in practice, it is found that TOP is indeed very useful and efficient. However, this word does not exist in another large database ORACLE. This is not a pity, although other methods (such as rownumber) can be used in ORACLE. We will use the TOP keyword in future discussions about "display stored procedures by page for tens of millions of data records.

So far, we have discussed how to quickly query the data you need from a large database. Of course, the methods we introduced are all "soft" methods. In practice, we also need to consider various "hard" factors, such as network performance, server performance, and operating system performance, or even network adapters and switches.

Iii. General paging display and storage process for small data volumes and massive data

Creating a web application requires paging. This problem is very common in database processing. The typical data paging method is the ADO record set paging method, that is, using the paging function provided by ADO (using the cursor) to implement paging. However, this paging method is only applicable to small data volumes, because the cursor itself has a disadvantage: the cursor is stored in the memory and is very memory-intensive. When the game tag is set up, the related records are locked until the cursor is canceled. A cursor provides a method to scan data row by row in a specific set. Generally, a cursor is used to traverse data row by row and perform different operations based on different data conditions. For the multi-table and Big Table-defined cursors (large data sets) loop, it is easy for the program to enter a long wait or even crash.

More importantly, for a very large data model, it is a waste of resources to load the entire data source every time during paging retrieval. Currently, the popular paging method is to retrieve data in the block area of the page size, instead of all the data, and then execute the current row in one step.

The earliest method to extract data based on the page size and page number is probably the "Russian stored procedure ". This stored procedure uses a cursor. Due to the limitations of the cursor, this method has not been widely recognized.

Later, someone modified the stored procedure on the Internet. The following stored procedure is a paging stored procedure written in conjunction with our office automation instance:

Create procedure pagination1

(@ Pagesize int, -- page size, such as storing 20 records per page

@ Pageindex int -- current page number

)

As

Set nocount on

Begin

Declare @ indextable table (id int identity (), nid int) -- Define table Variables

Declare @ PageLowerBound int -- defines the bottom code of this page.

Declare @ PageUpperBound int -- defines the top code of this page.

Set @ PageLowerBound = (@ pageindex-1) * @ pagesize

Set @ PageUpperBound = @ PageLowerBound + @ pagesize

Set rowcount @ PageUpperBound

Insert into @ indextable (nid) select gid from TGongwen where fariqi> dateadd (day,-365, getdate () order by fariqi desc

Select O. gid, O. mid, O. title, O. fadanwei, O. fariqi from TGongwen O, @ indextable t where O. gid = t. nid

And t. id> @ PageLowerBound and t. id <= @ PageUpperBound order by t. id

End

Set nocount off

The above stored procedures use the latest SQL SERVERTechnology

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.