Massive database query optimization and paging algorithm solution 2

Source: Internet
Author: User
(1) select title, price from titles where title_id in (select title_id from sales where qty> 30)

The execution result of this sentence is:

Table 'sales '. The scan count is 18, logical reads are 56, physical reads are 0, and pre-read is 0.

Table 'titles '. Scan count 1, logical read 2, physical read 0, pre-read 0.

(2) select title, price from titles where exists (select * from sales where sales. title_id = titles. title_id and qty> 30)

The execution result of the second sentence is:

Table 'sales '. The scan count is 18, logical reads are 56, physical reads are 0, and pre-read is 0.

Table 'titles '. Scan count 1, logical read 2, physical read 0, pre-read 0.

We can see that the execution efficiency of exists and in is the same.

7. The usage efficiency of the charindex () function is the same as that of the LIKE function with the wildcard %.

As mentioned above, if we add wildcard % before LIKE, it will cause a full table scan, so its execution efficiency is low. However, according to some documents, the use of the charindex () function to replace LIKE will greatly improve the speed. After my experiments, I found this explanation is also incorrect:

Select gid, title, fariqi, reader from tgongwen where charindex ('criminal investigation detachment ', reader)> 0 and fariqi> '2017-5-5'

Time: 7 seconds, 4 scans, 7155 logical reads, 0 physical reads, and 0 pre-reads.

Select gid, title, fariqi, reader from tgongwen where reader like '%' + 'criminal investigation detachment '+' % 'and fariqi> '2017-5-5'

Time: 7 seconds, 4 scans, 7155 logical reads, 0 physical reads, and 0 pre-reads.

8. union is not necessarily more efficient than or.

We have already mentioned that using or in the where clause will cause full table scanning. In general, all the materials I have seen are recommended to use union here to replace or. Facts have proved that most of these statements are applicable.

Select gid, fariqi, neibuyonghu, reader, title from Tgongwen where fariqi = '2017-9-16 'or gid> 2004

Time: 68 seconds. 1 scan count, 404008 logical reads, 283 physical reads, and 392163 pre-reads.

Select gid, fariqi, neibuyonghu, reader, title from Tgongwen where fariqi = '2017-9-16'

Union

Select gid, fariqi, neibuyonghu, reader, title from Tgongwen where gid> 9990000

Time: 9 seconds. The number of scans is 8, the number of logical reads is 67489, the number of physical reads is 216, and the number of pre-reads is 7499.

In general, union is much more efficient than or.

However, after the experiment, I found that if the query columns on both sides of or are the same, the execution speed of union is much lower than that of or. Although union scans indexes here, or scans the entire table.

Select gid, fariqi, neibuyonghu, reader, title from Tgongwen where fariqi = '2017-9-16 'or fariqi = '2017-2-5'

Time: 6423 milliseconds. Scan count 2, logical read 14726, physical read 1, pre-read 7176.

Select gid, fariqi, neibuyonghu, reader, title from Tgongwen where fariqi = '2017-9-16'

Union

Select gid, fariqi, neibuyonghu, reader, title from Tgongwen where fariqi = '2017-2-5'

Time: 11640 milliseconds. The number of scans is 8, the number of logical reads is 14806, the number of physical reads is 108, and the number of pre-reads is 1144.

9. Field extraction should follow the principle of "How much is required and how much is requested" to avoid "select *"

Let's do a test:

Select top 10000 gid, fariqi, reader, title from tgongwen order by gid desc

Time: 4673 milliseconds

Select top 10000 gid, fariqi, title from tgongwen order by gid desc

Time: 1376 milliseconds

Select top 10000 gid, fariqi from tgongwen order by gid desc

Time: 80 ms

From this point of view, each time we extract one field less, the data extraction speed will be correspondingly improved. The speed of improvement depends on the size of the discarded field.

10. count (*) is no slower than count (field ).

In some documents, * is used to count all columns, which is obviously less efficient than column names in a world. This statement is actually unfounded. Let's take a look:

Select count (*) from Tgongwen

Time: 1500 milliseconds

Select count (gid) from Tgongwen

Time: 1483 milliseconds

Select count (fariqi) from Tgongwen

Time: 3140 milliseconds

Select count (title) from Tgongwen

Time: 52050 milliseconds

From the above, we can see that if count (*) and count (primary key) are used at the same speed, while count (*) it is faster than other fields except the primary key. The longer the field, the slower the summary speed. I think, if count (*) is used, SQL SERVER may automatically search for the smallest field for summary. Of course, if you write the count (primary key) directly, it will be more direct.

11. order by sorting by clustered index columns with the highest efficiency

Let's look: (gid is the primary key, and fariqi is the aggregate index column)

Select top 10000 gid, fariqi, reader, title from tgongwen

Time: 196 milliseconds. Scan count 1, logical read 289, physical read 1, and pre-read 1527.

Select top 10000 gid, fariqi, reader, title from tgongwen order by gid asc

Time: 4720 milliseconds. 1 scan count, 41956 logical reads, 0 physical reads, and 1287 pre-reads.

Select top 10000 gid, fariqi, reader, title from tgongwen order by gid desc

Time: 4736 milliseconds. 1 scan count, 55350 logical reads, 10 physical reads, and 775 pre-reads.

Select top 10000 gid, fariqi, reader, title from tgongwen order by fariqi asc

Time: 173 milliseconds. Scan count 1, logical read 290, physical read 0, pre-read 0.

Select top 10000 gid, fariqi, reader, title from tgongwen order by fariqi desc

Time: 156 milliseconds. Scan count 1, logical read 289, physical read 0, pre-read 0.

From the above, we can see that the speed of not sorting and the number of logical reads are equivalent to the speed of "order by clustered index columns, however, these queries are much faster than those of "order by non-clustered index columns.

At the same time, when sorting by a field, whether in positive or reverse order, the speed is basically equivalent.

12. Efficient TOP

In fact, when querying and extracting ultra-large data sets, the biggest factor affecting the database response time is not data search, but physical I/0 operations. For example:

Select top 10 * from (

Select top 10000 gid, fariqi, title from tgongwen

Where neibuyonghu = 'Office'

Order by gid desc) as

Order by gid asc

Theoretically, the execution time of the entire statement is longer than that of the clause, but the opposite is true. Because 10000 records are returned after the sub-statement is executed, and only 10 statements are returned for the entire statement, the biggest factor affecting the database response time is physical I/O operations. One of the most effective ways to restrict physical I/O operations is to use TOP keywords. The TOP keyword is a system-optimized term in SQL SERVER used to extract the first few or the first few percentage data entries. Through the application of the author in practice, it is found that TOP is indeed very useful and efficient. However, this word does not exist in another large database ORACLE. This is not a pity, although other methods (such as rownumber) can be used in ORACLE. We will use the TOP keyword in future discussions about "display stored procedures by page for tens of millions of data records.

So far, we have discussed how to quickly query the data you need from a large database. Of course, the methods we introduced are all "soft" methods. In practice, we also need to consider various "hard" factors, such as network performance, server performance, and operating system performance, or even network adapters and switches.

  Iii. General paging display and storage process for small data volumes and massive data

Creating a web application requires paging. This problem is very common in database processing. The typical data paging method is the ADO record set paging method, that is, using the paging function provided by ADO (using the cursor) to implement paging. However, this paging method is only applicable to small data volumes, because the cursor itself has a disadvantage: the cursor is stored in the memory and is very memory-intensive. When the game tag is set up, the related records are locked until the cursor is canceled. A cursor provides a method to scan data row by row in a specific set. Generally, a cursor is used to traverse data row by row and perform different operations based on different data conditions. For the multi-table and Big Table-defined cursors (large data sets) loop, it is easy for the program to enter a long wait or even crash.

More importantly, for a very large data model, it is a waste of resources to load the entire data source every time during paging retrieval. Currently, the popular paging method is to retrieve data in the block area of the page size, instead of all the data, and then execute the current row in one step.

The earliest method to extract data based on the page size and page number is probably the "Russian stored procedure ". This stored procedure uses a cursor. Due to the limitations of the cursor, this method has not been widely recognized.

Later, someone modified the stored procedure on the Internet. The following stored procedure is a paging stored procedure written in conjunction with our office automation instance:

CREATE procedure pagination1

(@ Pagesize int, -- page size, such as storing 20 records per page

@ Pageindex int -- current page number

)

As

Set nocount on

Begin

Declare @ indextable table (id int identity (), nid int) -- Define table Variables

Declare @ PageLowerBound int -- defines the bottom code of this page.

Declare @ PageUpperBound int -- defines the top code of this page.

Set @ PageLowerBound = (@ pageindex-1) * @ pagesize

Set @ PageUpperBound = @ PageLowerBound + @ pagesize

Set rowcount @ PageUpperBound

Insert into @ indextable (nid) select gid from TGongwen where fariqi> dateadd (day,-365, getdate () order by fariqi desc

Select O. gid, O. mid, O. title, O. fadanwei, O. fariqi from TGongwen O, @ indextable t where O. gid = t. nid

And t. id> @ PageLowerBound and t. id <= @ PageUpperBound order by t. id

End

Set nocount off

The above stored procedures use the latest SQL SERVER technology-Table variables. It should be said that this stored procedure is also a very good paging stored procedure. Of course, in this process, you can also write the TABLE variables as temporary tables: create table # Temp. But it is obvious that in SQL SERVER, using temporary tables does not use Quick table variables. So when I first started using this stored procedure, I felt very good and the speed was better than that of the original ADO. But later, I found a better method than this method.

The author once saw a short article on the Internet, "how to retrieve records from the data table from the nth to the nth". The full text is as follows:

Retrieve the n to m records from the publish table:
Select top m-n + 1 *
FROM publish
WHERE (id NOT IN
(Select top n-1 id
FROM publish ))

Keywords with id publish table

When I saw this article at the time, I was truly inspired by the spirit and thought that the idea was very good. Later, I was working on an office automation system (ASP. NET + C # + SQL server), suddenly remembered this article, I think if you modify this statement, this may be a very good paging storage process. So I searched for this article on the Internet. I did not expect this article to be found, but I found a paging storage process written according to this statement, this storage process is also a popular paging storage process. I regret that I didn't rush to transform this text into a storage process:

Create procedure pagination2
(
@ SQL nVARCHAR (4000), -- SQL statements without sorting statements
@ Page int, -- Page number
@ RecsPerPage int, -- number of records per page
@ Id varchar (255), -- Unique ID to be sorted
@ Sort VARCHAR (255) -- Sort fields and rules
)
AS

DECLARE @ Str nVARCHAR (4000)

SET @ Str = 'select TOP '+ CAST (@ RecsPerPage as varchar (20) +' * FROM ('+ @ SQL +') T WHERE T. '+ @ ID +' NOT IN
(Select top '+ CAST (@ RecsPerPage * (@ Page-1) as varchar (20 )) + ''+ @ ID + 'FROM (' + @ SQL + ') T9 ORDER BY' + @ Sort + ') ORDER BY' + @ Sort

PRINT @ Str

EXEC sp_ExecuteSql @ Str
GO

In fact, the preceding statement can be simplified:

Select top page size *

FROM Table1

WHERE (ID NOT IN

(Select top page size * Page id

FROM table

Order by id ))

ORDER BY ID

However, this stored procedure has a fatal drawback, that it contains not in words. Although I can transform it:

Select top page size *

FROM Table1

WHERE not exists

(Select * from (select top (page size * pages) * from table1 order by id) B where B. id = a. id)

Order by id

That is to say, not exists is used to replace not in, but we have already discussed that there is no difference in the execution efficiency between the two.

IN this case, the combination of TOP and not in is faster than using a cursor.

Although using not exists does not save the efficiency of the last stored procedure, using TOP keywords in SQL SERVER is a wise choice. Because the ultimate goal of paging optimization is to avoid generating too many record sets, we have mentioned the TOP advantage in the previous sections. Using TOP, we can control the data volume.

IN paging algorithms, there are two key factors that affect the query speed: TOP and not in. TOP can increase our query speed, while not in will slow down our query speed. Therefore, to increase the speed of our entire paging algorithm, we need to completely transform not in, replace it with other methods.

We know that we can use max (field) or min (field) to extract the maximum or minimum values of almost any field, so if this field is not repeated, then, we can use the max or min of these non-repeated fields as the watershed to make them a reference object for separating each page in the paging algorithm. Here, we can use the operator ">" or "<" to accomplish this mission, so that the query statement conforms to the SARG format. For example:

Select top 10 * from table1 where id> 200

The following paging solution is available:

Select top page size *

From table1

Where id>

(Select max (id) from

(Select top (page number-1) * page size) id from table1 order by id) as T

)

Order by id

When selecting a column with no repeated values and easy to tell the size, we usually select a primary key. The following table lists the tables in the office automation system with 10 million data .) For sorting columns and extracting gid, fariqi, and title fields, take pages 1st, 10, 100, 500, 1000, 10 thousand, 0.1 million, and 0.25 million as examples, test the execution speed of the preceding three paging schemes: (unit: milliseconds)

Page code solution 1 solution 2 solution 3
 
1 60 30 76
 
10 46 16 63
 
100 1076 720 130
 
500 540 12943 83
 
1000 17110 470 250
 
10 thousand 24796 4500 140
 
0.1 million 38326 42283 1553
 
0.25 million 28140 128720 2330
 
0.5 million 121686 127846 7168

From the table above, we can see that the three stored procedures can be trusted when executing paging commands below 100 pages, and the speed is good. However, in the first solution, after more than 1000 pages are executed, the speed will decrease. The second solution is that the speed starts to decrease after more than 10 thousand pages are executed. However, the third solution has never been greatly downgraded, And the stamina is still very strong.

After determining the third paging scheme, we can write a stored procedure accordingly. As you know, the stored procedure of SQL server is compiled in advance, and its execution efficiency is higher than that of SQL statements sent through WEB pages. The following stored procedure not only contains the paging scheme, but also determines whether to make statistics on the total number of data based on the parameters sent from the page.

-- Get data on a specified page

Create procedure pagination3

@ TblName varchar (255), -- table name

@ StrGetFields varchar (1000) = '*', -- the column to be returned

@ FldName varchar (255) = '', -- Name of the sorted Field

@ PageSize int = 10, -- page size

@ PageIndex int = 1, -- page number

@ DoCount bit = 0, -- returns the total number of records. If the value is not 0, the system returns

@ OrderType bit = 0, -- set the sorting type. If the value is not 0, the sorting type is descending.

@ StrWhere varchar (1500) = ''-- Query condition (Note: Do not add where)

AS

Declare @ strSQL varchar (5000) -- subject sentence

Declare @ strTmp varchar (110) -- Temporary Variable

Declare @ strOrder varchar (400) -- sort type

If @ doCount! = 0

Begin

If @ strWhere! =''

Set @ strSQL = "select count (*) as Total from [" + @ tblName + "] where" + @ strWhere

Else

Set @ strSQL = "select count (*) as Total from [" + @ tblName + "]"

End

-- The above Code indicates that if @ doCount is not passed over 0, the total number of statistics will be executed. All the code below is 0 @ doCount

Else

Begin

If @ OrderType! = 0

Begin

Set @ strTmp = "<(select min"

Set @ strOrder = "order by [" + @ fldName + "] desc"

-- If @ OrderType is not 0, execute the descending order. This sentence is very important!

End

Else

Begin

Set @ strTmp = "> (select max"

Set @ strOrder = "order by [" + @ fldName + "] asc"

End

If @ PageIndex = 1

Begin

If @ strWhere! =''

Set @ strSQL = "select top" + str (@ PageSize) + "" + @ strGetFields + "from [" + @ tblName + "] where" + @ strWhere + "" + @ strOrder

Else

Set @ strSQL = "select top" + str (@ PageSize) + "" + @ strGetFields + "from [" + @ tblName + "]" + @ strOrder

-- Execute the above Code on the first page, which will speed up the execution.

End

Else

Begin

-- The following code gives @ strSQL the SQL code to be actually executed

Set @ strSQL = "select top" + str (@ PageSize) + "" + @ strGetFields + "from ["

+ @ TblName + "] where [" + @ fldName + "]" + @ strTmp + "([" + @ fldName + "]) from (select top "+ str (@ PageIndex-1) * @ PageSize) + "[" + @ fldName + "] from [" + @ tblName + "]" + @ strOrder + ") as tblTmp)" + @ strOrder

If @ strWhere! =''

Set @ strSQL = "select top" + str (@ PageSize) + "" + @ strGetFields + "from ["

+ @ TblName + "] where [" + @ fldName + "]" + @ strTmp + "(["

+ @ FldName + "]) from (select top" + str (@ PageIndex-1) * @ PageSize) + "["

+ @ FldName + "] from [" + @ tblName + "] where" + @ strWhere + ""

+ @ StrOrder + ") as tblTmp) and" + @ strWhere + "" + @ strOrder

End

End

Exec (@ strSQL)

GO

The above stored procedure is a general stored procedure, and its annotations have been written in it.

In the case of large data volumes, especially when querying the last few pages, the query time generally does not exceed 9 seconds. Other stored procedures may cause timeout in practice, therefore, this stored procedure is very suitable for queries of large-capacity databases.

I hope that through the analysis of the above stored procedures, we can provide some inspiration and improve the efficiency of our work. At the same time, I hope our peers can propose better real-time data paging algorithms.

  4. Importance of clustered indexes and how to select clustered Indexes

In the title of the previous section, I wrote: The general paging display stored process for small data volumes and massive data. This is because when we apply this stored procedure to the "office automation" system, the author finds that the third stored procedure has the following phenomena when there is a small amount of data:

1. The paging speed is generally between 1 second and 3 seconds.

2. When querying the last page, the speed is generally 5 to 8 seconds, even if the total number of pages is only 3 or 0.3 million pages.

Although the implementation of this paging process is very fast in the case of ultra-large capacity, but in the first few pages, this 1-3 second speed is slower than the first non-optimized paging method. In the user's words, it is "No ACCESS database is faster ", this recognition is sufficient to prevent users from using your developed system.

I have analyzed this. The crux of this problem is so simple, but so important: the sorting field is not a clustered index!

The title of this article is "query optimization and paging algorithm solution ". The author does not put together the topics of "query optimization" and "paging algorithm" because both of them need a very important thing-clustered index.

As we mentioned earlier, clustered indexes have two major advantages:

1. Narrow the query range as quickly as possible.

2. Sort fields as quickly as possible.

1st are mostly used for query optimization, while 2nd are mostly used for data sorting during paging.

Clustered indexes can only be created in one table, which makes clustered indexes more important. The selection of clustered indexes can be said to be the most critical factor for "query optimization" and "efficient paging.

However, clustering index columns must meet both the needs of query columns and the needs of sorting columns. This is usually a contradiction.

In my previous "Index" discussion, I used fariqi, that is, the user's published date as the starting column of the clustered index. The date accuracy is "day ". The advantages of this method have been mentioned earlier. In the quick query of the time range, it is more advantageous than using the ID Primary Key column.

However, because duplicate records exist in the clustered index column during pagination, max or min cannot be used as the most paging reference object, thus making sorting more efficient. If the ID Primary Key column is used as the clustered index, the clustered index is useless except for sorting. In fact, this is a waste of valuable resources.

To solve this problem, I later added a date column, whose default value is getdate (). When a user writes a record, this column automatically writes the current time, accurate to milliseconds. Even so, to avoid a small possibility of overlap, you also need to create a UNIQUE constraint on this column. Use this date column as the clustered index column.

With this time-based clustered index column, you can use this column to query a certain period of time when you insert data, and use it as a unique column to implement max or min, it is a reference object for paging algorithms.

After such optimization, the author found that the paging speed is usually dozens of milliseconds or even 0 milliseconds in the case of large data volumes or small data volumes. However, the query speed for narrowing the range by date segments is no slower than the original query speed.

Clustered index is so important and precious, so I have summarized that clustered index must be built on:

1. The most frequently used field to narrow the query scope;

2. The most frequently used field to be sorted.

  Conclusion:

This article brings together my recent experiences in using databases, and is an accumulation of practical experience in "office automation" systems. I hope this article will not only help you in your work, but also help you understand the problem analysis methods. The most important thing is that I hope this article will be helpful, we are setting off everyone's interest in learning and discussion, so as to jointly promote and jointly make our best efforts for the public security, technology, and police business and the Golden Shield Project.

Finally, it should be noted that during the experiment, I found that the biggest impact on the database speed is not the memory size, but the CPU. When I tried it on my P4 2.4 machine, I checked "Resource Manager". The CPU usage often continued to reach 100%, but the memory usage did not change or changed significantly. Even on our hp ml 350 G3 server, the peak CPU usage reaches 90%, typically around 70%.

The experimental data in this article is from our hp ml 350 server. Server Configuration: Dual-Inter Xeon hyper-threading CPU 2.4 GB, memory 1 GB, operating system Windows Server 2003 Enterprise Edition, Database SQL Server 2000 SP3.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.