Query optimization of massive database and the scheme collection of paging algorithm 2/2_ database other

Source: Internet
Author: User
Tags access database
From publish
WHERE (id not in
(SELECT top n-1 ID
from publish))
Keyword with ID publish table
I saw this article at that time, really is the spirit spirits, think the train of thought very good. By the time I was working on the office automation system (asp.net+ C#+sql SERVER), I suddenly remembered this article, and I thought it might be a very good paging storage process if I changed the statement. So I looked over the internet for this article, unexpectedly, the article has not found, but found a according to this statement written a paging stored procedure, this stored procedure is currently a more popular one of the paging stored procedures, I regret not rushed to the text into a stored procedure:
CREATE PROCEDURE Pagination2
(
@SQL NVARCHAR (4000),--SQL statements without a sort statement
@Page int,--page number
@RecsPerPage int,--number of records to hold per page
@ID VARCHAR (255),--A duplicate ID number that needs to be sorted
@Sort VARCHAR (255)--Sort fields and rules
)
As
DECLARE @Str NVARCHAR (4000)
SET @Str = ' SELECT top ' +cast (@RecsPerPage as VARCHAR) + ' * "(' + @SQL + ') t WHERE T. ' + @ID + ' not in
(SELECT top ' +cast (@RecsPerPage * (@Page-1)) as VARCHAR) + "+ @ID + ' from (' + @SQL +") T9 order BY ' + @Sort + ') ORDER BY ' + @Sort
PRINT @Str
EXEC sp_executesql @Str
Go
In fact, the above statement can be simplified to:
SELECT Top Page Size *
From Table1
WHERE (ID not in
(SELECT top Page size * Pages ID
From table
Order by ID)
ORDER BY ID
But this stored procedure has a fatal disadvantage, that is, it contains not in the word. Although I can change it to:
SELECT Top Page Size *
From Table1
WHERE NOT EXISTS
(SELECT * FROM (page size * pages) * FROM table1 ORDER by id) b where b.id=a.id)
ORDER BY ID
That is, using not exists instead of not in, but as we've already talked about, the execution efficiency of the two is virtually indistinguishable.
Even so, this method of combining the top with not is a bit quicker than using a cursor.
Although using not exists does not save the efficiency of the last stored procedure, it is a wise choice to use the top keyword in SQL Server. Because the ultimate goal of paging optimization is to avoid the creation of too large recordsets, we have already mentioned the top advantage, through top to achieve the control of the volume of data.
In the paging algorithm, the key factors that affect our query speed are two points: top and not. Top can improve our query speed, and not in will slow down our query speed, so to improve the speed of our entire paging algorithm, we need to completely transform not in, and replace it with other methods.
We know that in almost any field, we can extract the maximum or minimum value from a field by Max (field) or min (field). So if this field does not repeat, then you can use the max or min of these distinct fields as a watershed, making it a reference point in the paging algorithm that separates each page. Here, we can use the operator ">" or "<" number to complete this mission, so that the query statement conforms to the Sarg form. Such as:
Select Top * FROM table1 where id>200
So there is the following paging scheme:
Select Top Page Size *
From table1
where id>
(select Max (ID) from
(Select Top ((page 1) * page size) ID from table1 ORDER by ID as T
)
ORDER BY ID
When choosing a column that is not repeating values and is easy to distinguish between sizes, we usually select a primary key. The following table lists the tables in the office automation system with 10 million data in which the GID is the primary key, but not the clustered index. ) for the row sequence, extraction gid,fariqi,title fields, with 1th, 10, 100, 500, 1000, 10,000, 100,000, 250,000, 500,000 pages for example, to test the execution speed of the above three paging scheme: (Unit: milliseconds)
Page
Programme 1
Programme 2
Programme 3
1
60
30
76
10
46
16
63
100
1076
720
130
500
540
12943
83
1000
17110
470
250
10,000
24796
4500
140
100,000
38326
42283
1553
250,000
28140
128720
2330
500,000
121686
127846
7168
From the table above, we can see that the three stored procedures are trustworthy and of great speed in executing the paging command under page 100. However, the first scheme after the implementation of paging more than 1000 pages, the speed down. The second scenario is approximately 10,000 page after the execution of the page, and the speed begins to fall. But the third scheme has never been a big drop, the stamina is still very sufficient.
After we have identified a third paging scenario, we can write a stored procedure accordingly. You know that SQL Server stored procedures are compiled SQL statements that are more efficient than the execution of SQL statements from Web pages. The following stored procedures not only contain a paging scheme, but also determine whether to count data totals based on the parameters from the page.
--Gets the data for the specified page
CREATE PROCEDURE Pagination3
@tblName varchar (255),--table name
@strGetFields varchar (1000) = ' * ',--columns to be returned
@fldName varchar (255) = ',--sorted field name
@PageSize int = 10,--page size
@PageIndex int = 1,--page number
@doCount bit = 0--Returns the total number of records, not 0 values.
@OrderType bit = 0,--set sort type, not 0 value descending
@strWhere varchar (1500) = '--Query criteria (note: Do not add where)
As
DECLARE @strSQL varchar (5000)--subject sentence
DECLARE @strTmp varchar (110)--Temporary variable
DECLARE @strOrder varchar (400)--Sort type
If @doCount!= 0
Begin
If @strWhere!= '
Set @strSQL = "SELECT count (*) as Total from [" + @tblName + "] where" + @strWhere
Else
Set @strSQL = "SELECT count (*) as Total from [" + @tblName + "]"
End
--The above code means that if @docount passes over 0, the total count is executed. All of the following code is @docount 0
Else
Begin
If @OrderType!= 0
Begin
Set @strTmp = "< (select Min"
Set @strOrder = "ORDER BY [" + @fldName + "] desc"
If @ordertype is not 0, it is important to perform descending order.
End
Else
Begin
Set @strTmp = "> (select Max"
Set @strOrder = "ORDER BY [" + @fldName + "] ASC"
End
If @PageIndex = 1
Begin
If @strWhere!= '
Set @strSQL = "SELECT top" + str (@PageSize) + "" + @strGetFields + "from [+ @tblName +] where" + @strWhere + "" + @s Trorder
Else
Set @strSQL = "SELECT top" + str (@PageSize) + "" + @strGetFields + "from [" + @tblName + "]" + @strOrder
--If the first page executes the above code, this will speed up execution
End
Else
Begin
--The following code gives @strsql the SQL code to actually execute
Set @strSQL = "SELECT top" + str (@PageSize) + "" + @strGetFields + "from ["
+ @tblName + "] where [" + @fldName + "]" + @strTmp + "([" + @fldName + "]) from (select Top + str (@PageIndex-1) * @PageSi Ze) + "[" + @fldName + "] from [+ @tblName +"] "+ @strOrder +") as Tbltmp) "+ @strOrder
If @strWhere!= '
Set @strSQL = "SELECT top" + str (@PageSize) + "" + @strGetFields + "from ["
+ @tblName + "] where [" + @fldName + "]" + @strTmp + "(["
+ @fldName + "]) from (select Top + str (@PageIndex-1) * @PageSize) +" ["
+ @fldName + "] from [" + @tblName + "] where" + @strWhere + ""
+ @strOrder + ") as Tbltmp) and" + @strWhere + "" + @strOrder
End
End
EXEC (@strSQL)
Go
The above stored procedure is a common stored procedure with comments written in it.
In the case of large amount of data, especially when querying the last few pages, the query time is generally not more than 9 seconds, while other stored procedures, in practice, will cause timeouts, so this stored procedure is very suitable for large-capacity database queries.
I hope that through the above storage process analysis, can give us some inspiration, and to work to bring some efficiency improvement, and hope that the peer to come up with better real-time data paging algorithm.
Iv. importance of clustered indexes and how to select clustered Indexes
In the title of the previous section, I write a general paging display stored procedure that implements small data and massive data. This is because in the practice of applying this stored procedure to the "Office automation" system, the author discovers that this third kind of stored procedure has the following phenomena in the case of small data quantity:
1, paging speed generally maintained between 1 seconds and 3 seconds.
2, in the query last page, the speed is generally 5 seconds to 8 seconds, even if the total number of pages only 3 pages or 300,000 pages.
Although in the case of super large capacity, this paging implementation is very fast, but in the first few pages, this 1-3-second speed is slower than the first even not optimized paging method, the user's words is "not yet Access database speed", This awareness is enough to cause the user to discard the system you are developing.
The author of this analysis, the original cause of this phenomenon is so simple, but also so important: the sorted field is not a clustered index!
The title of this article is: "Query optimization and pagination algorithm scheme." The author only so that "query optimization" and "pagination algorithm" These two links are not very big topic together, because both need a very important thing--clustered index.
As we mentioned earlier in the discussion, clustered indexes have two of the biggest advantages:
1, the fastest speed to narrow the scope of the query.
2, the fastest speed for the field sorting.
The 1th article is more used in query optimization, and the 2nd is used more for sorting data when paging.
Clustered indexes can only be set up in each table, which makes the clustered index more important. The selection of clustered indexes can be said to be the most critical factor in implementing "Query Optimization" and "efficient paging".
However, it is often a contradiction to make the clustered index columns conform to the needs of both the query column and the row sequence.
The author of the previous "index" in the discussion, will be Fariqi, that is, the user issued a date as the starting column of the clustered index, the date is the precision of "day." The advantages of this approach have already been mentioned, in the time period of the quick query, compared with the ID primary key column has a great advantage.
When paging, however, because this clustered index column has duplicate records, it is not possible to use Max or Min as the most paginated reference, thus making it impossible to achieve more efficient sorting. If the ID primary key column is used as a clustered index, then the clustered index, in addition to ordering, is useless, and in fact it wastes the valuable resource of the clustered index.
To resolve this contradiction, the author later added a date column with the default value of GETDATE (). When a user writes a record, the column is automatically written to the time, in milliseconds. Even so, to avoid a very small coincidence, create a unique constraint on this column. Use this date column as a clustered index column.
With this time clustered index column, the user can use this column to find a user's query for a period of time when inserting data, and to implement Max or Min as a unique column, as a reference to the paging algorithm.
After such optimization, the author found that, whether it is a large amount of data in the case or small data, paging speed is generally dozens of milliseconds, or even 0 milliseconds. The speed of the query to narrow the range of dates is no slower than it used to be.
Clustered indexes are so important and precious that the author concludes that the clustered index must be built on:
1, you most frequently used, to narrow the scope of the query on the field;
2, the fields that you use most frequently and that need to be sorted.
Conclusion:
This article brings together the author's recent experience in the use of databases, is in the "office automation" system, the accumulation of practical experiences. I hope this article will not only bring some help to everyone's work, also hope that we can understand the way to analyze the problem, the most important thing is, I hope this article can give a start, the interest of everyone's study and discussion, in order to jointly promote and jointly for police science and technology strong police cause and the Shield project to make their biggest efforts.
Finally, in the experiment, I found that the user in the large data query, the most impact on the database speed is not the memory size, but the CPU. When I was experimenting on my P4 2.4 machine, looking at the "explorer", the CPU was constantly up to 100%, while the amount of memory was not changed or there was no big change. Even when tested on our HP ML G3 server, CPU peaks can reach 90%, typically at around 70%.
The test data for this article are from our HP ML 350 server. Server configuration: Dual Inter Xeon Hyper-threading CPU 2.4G, Memory 1G, operating system Windows Server 2003 Enterprise Edition, database SQL Server SP3.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.