Mass database query statement _mssql

Source: Internet
Author: User
Tags constant data structures datetime getdate access database
create  table [dbo].[ tgongwen]  (     --tgongwen is red-headed table name
   [Gid] [int]  identity  (1, 1)   not  null ,
--The ID number of this table, also the primary key
   [ title] [varchar]  ( collate chinese_prc_ci_as ); null ,  
--Red-headed title
   [fariqi] [datetime]  null ,
--release date
    [neibuYonghu] [varchar]   collate chinese_prc_ci_as  null ,
--Publish user
   [reader] [varchar]  (900)  collate chinese_prc_ci_as   null ,
-users who need to be browsed. Separator between each user "," "separate
)  on [primary] textimage_on [primary]
go

Below, we add 1000 data to the database :

Declare@i int


Set@i=1


While@i<=250000


Begin


InsertIntoTgongwen (Fariqi,neibuyonghu,reader,title)Values(' 2004-2-5 ',' Communications section ', ' Communications section, office, director Wang, director Liu, director Zhang, admin, Criminal investigation Detachment, Secret Service detachment, patrol Detachment, investigation Detachment, Family Administration Branch, Security Detachment, Foreign Affairs Section ', ' This is the first 250,000 records ')
      set @i=@i+1
end
go


declare @i int

set @i=1

while @i<=250000

begin

    INSERT INTO Tgongwen (fariqi,neibuyonghu,reader,title) VALUES (' 2004-9-16 ', ' office ', ' Office, Communications Section, Director Wang, Secretary Liu, Secretary Zhang, admin, Criminal Investigation Detachment, Secret Service Detachment, Patrol Detachment, Investigation Detachment, Family Administration Branch , Foreign Affairs Section ', ' This is the middle of the million Records ') ,

    set @i=@i+1

end

Go

Declare@h int


Set@h=1


While@h&lt;=100


Begin


Declare@i int


Set@i=2002


While@i&lt;=2003


Begin


Declare@j int


Set@j=0


While@j&lt;50


Begin


Declare@k int


Set@k=0


while @k<50
             Begin
     insert  Into tgongwen (fariqi,neibuyonghu,reader,title) &NBSP Values (cast (@i  As varchar (4)) + ' -8-15 3: ' +cast (@j  as  varchar (2) + ': ' +cast (@j  As varchar (2)), ' Communications section ', ' Office, communications section, director Wang, director Liu, director Zhang, admin, Criminal investigation Detachment , the Secret Service detachment, the patrol detachment, through the investigation detachment, the Family Administration Branch, the Foreign Affairs Section ', ' This is the last 500,000 records ')
              set @k=@k+1
             End
set @j=@j+1
        , end
Set  @i=@i+1
End
set @h=@h+1
end
go
declare @i int
set @i=1
while @i<=9000000
begin
     insert  Into tgongwen (fariqi,neibuyonghu,reader,title)   values ( ' 2004-5-5 ', ' Communications section ', ' communications section, office, director Wang, director Liu, Secretary Zhang, admin, Criminal investigation Detachment, Secret Service detachment, patrol detachment, through the investigation detachment, the Family Administration Branch, the security detachment, the Foreign Affairs Section ', ' This is the last 9 million records added ']
     set @i=@i+1000000
end
go

Through the statements above, we have created a record of 2-month 5 - Day Records published by the Communications division million by the office in the year 9 months 6 days released Records,2002 years and 2003 years of the same date, different minutes of the 2500 issued by the Communications section of the records ( total million), as well as by the communications section in A the 5 - Month 5 - day 900 million records, totaling 1000 million.

First, the establishment of an "appropriate" index due to the situation

Establishing "appropriate" index is the first prerequisite to realize query optimization.

Indexis another important, user-defined data structure stored on physical media other than tables. When you search for data based on the value of an index code, the index provides fast access to the data. In fact, without an index, the database can also successfully retrieve the results from the Select statement, but as the table becomes larger, the use of an "appropriate" index is becoming more apparent. Note that in this sentence, we use the word "appropriate" because, if the index is used without careful consideration of its implementation process, the index can both improve and damage the performance of the database.

(i) Understanding the index structure in simple and simple sense

In fact, you can interpret the index as a special kind of directory. Microsoft's SQL SERVER provides two indexes: a clustered index (clustered index, also known as a clustered, clustered index) and a nonclustered index (nonclustered index, also known as a nonclustered and nonclustered index). For example, let's take a look at the difference between a clustered index and a nonclustered index:

In fact, the text of our Chinese dictionary is itself a clustered index. For example, if we look at the word "Ann", we will naturally open the first few pages of the dictionary, because the pinyin of "an" is "an", and the dictionary of Chinese characters according to Pinyin is the beginning of the English letter "a" and ending with "Z", then the word "Ann" is naturally ranked in the front of the dictionary. If you end up with all the "a" parts you still can't find the word, so you don't have the word in your dictionary; Similarly, if you look at the word "Zhang", you will also turn your dictionary to the last part, because "Zhang" Pinyin is "Zhang". That is, the body part of the dictionary itself is a directory, and you don't need to look up other catalogs to find what you need to find.

We refer to the content of the body as a "clustered index", which is arranged according to certain rules.

If you know a word, you can quickly find the word from automatic. But you may also encounter words you don't know, do not know its pronunciation, at this time, you can not follow the way you have just found the word you want to search, and need to go according to the "radical" to find the word you are looking for, and then according to the number of the word after the page directly to find the word you are looking for. But the sort of word you find in conjunction with the "Radical catalog" and "CJK ideographs table" is not really the sort of method of body text, for example, you look up the "Zhang" word, we can see in the CJK ideographs table after looking at the "Zhang" page number is 672 page, CJK ideographs Table "Zhang" above is "Chi" word, but the page number is a page, " The following is the "crossbow" word, the page is 390 page. Obviously, these words are not really in the "Zhang" word of the upper and lower side, now you see the continuous "Chi, Zhang, crossbow" three words is actually their sorting in the nonclustered index, is the dictionary body of words in the nonclustered index mapping. We can find the word you need in this way, but it takes two processes to find the result in the TOC and then turn to the page number you want.

We refer to this catalogue as purely a directory, and the text is simply a sort of text that is called a nonclustered index.

From the above example, we can understand what is "clustered index" and "nonclustered index".

Further, we can easily understand that each table can have only one clustered index, because the catalog can only be sorted in one way.

(ii) When to use clustered or nonclustered indexes

The following table summarizes when to use clustered or nonclustered indexes (very important).

Action Description

Using Clustered Indexes

Using Nonclustered indexes

Columns are often sorted in groups

Should

Should

Returns data in a range

Should

should not be

One or very few different values

should not be

should not be

A small number of different values

Should

should not be

A large number of different values

should not be

Should

frequently updated columns

should not be

Should

FOREIGN key columns

Should

Should

Primary key columns

Should

Should

Frequently modify index columns

should not be

Should

In fact, we can understand the table above by using examples of the definitions of the previous clustered and nonclustered indexes. For example, returns a range of data items. For example, if you have a table with a time column that happens when you set up the aggregate index in the column, you are looking for all the data between 1 months, 1 days and 1 days of the year , and this will be very fast. , because the text of your dictionary is sorted by date, the clustering index only needs to find the beginning and end data in all the data to be retrieved, and unlike nonclustered indexes, you must first look up the page number of each item in the table of contents, and then find the specific content based on the page number.

(three) combining with practice, talking about the misunderstanding of index use

The purpose of the theory is to apply. Although we have just listed when a clustered index or nonclustered index should be used, the above rules are easily overlooked in practice or cannot be analyzed in the light of actual circumstances. Below we will talk about the use of the index based on the actual problems encountered in practice, so as to facilitate the understanding of the method of index establishment.

1, the primary key is the clustered index

This idea is extremely wrong and is a waste of the clustered index. Although SQL SERVER creates a clustered index on the primary key by default.

Typically, we create an ID column in each table to distinguish each piece of data, and the ID column is automatically enlarged, and the step size is typically 1. This is the case for the column GID in our example of office automation . At this point, if we set this column as the primary key, SQL SERVER will think of this Lieme as a clustered index. The good thing about this is that you can physically sort your data in the database by ID , but I don't think it makes much sense.

Obviously, the advantages of clustered indexes are obvious, and each table can have only one clustered index rule, which makes the clustered index more valuable.

From the definition of the clustered index we talked about earlier, we can see that the biggest benefit of using a clustered index is the ability to quickly narrow the scope of the query to avoid full table scans, based on query requirements. In practice, because the ID number is generated automatically, we do not know the ID number of each record , so it is very difficult to use the ID number for the query. This makes the ID number the primary key as a clustered index a waste of resources. Second, having a field with a different ID number as a clustered index does not conform to the rule that the aggregate index should not be established in the case of a large number of different values; Of course, this is only a negative effect on the user's frequent modification of the record, especially the index, but has no effect on the query speed.

In the office automation system, whether the system home page to show the need for the user to sign the file, the meeting or the user file query, and so on in any case data query can not be separated from the field is "date" and the user's own "user name."

Usually, the home page of office automation will display documents or meetings that have not been signed by each user. Although our where statement can only limit the situation that the current user has not signed, but if your system has been established for a long time, and the volume of data is very large, then every time each user opened the first page of the full table scan, it does not make sense, the vast majority of users 1 A few months ago, the files have been browsed, this can only increase the cost of the database. In fact, we can completely allow users to open the System home page, the database only query the user nearly 3 months to unread files, through the "Date" field to limit the table scan, improve query speed. If your office automation system has been established for 2 years, then your home page display speed will theoretically be 8 times faster , or even quicker.

The word "theoretically" is mentioned here because if your clustered index is blindly built on the ID this primary key, your query speed is not so high, even if you set up an index (not an aggregate index) on the Date field. Now let's take a look at the speed performance of various queries in the case of 1000 data ( data for 3 months ):

(1) A clustered index is established on the primary key and does not divide the time period:

Select GID, fariqi,neibuyonghu,title from Tgongwen

Spents: 128470 milliseconds (that is, 128 seconds)

(2) A clustered index is established on the primary key and a nonclustered index is established on the Fariq:

Select Gid,fariqi,neibuyonghu,title from Tgongwen
where Fariqi> DateAdd (Day,-90,getdate ())

Spents:53763 millisecond ( seconds)

(3) set up the aggregate index on the date column (Fariqi):

Select Gid,fariqi,neibuyonghu,title from Tgongwen
where Fariqi> DateAdd (Day,-90,getdate ())

Spents:2423 milliseconds (2 seconds)

Although each statement extracts data, the differences are huge, especially when the clustered index is set up in a date column. In fact, if your database really has 1000 million capacity, the primary key on the ID column, like the above 1th, 2 of cases, the performance on the Web page is timed out, simply can not be displayed. This is one of the most important factors that I discard the ID column as a clustered index.

The way to get the above speed is to add:declare @d datetime before each SELECT statement

Set @d=getdate ()

And Add after the SELECT statement:

Select statement execution takes time ( ms)]=datediff (Ms,@d,getdate ())

2, as long as the establishment of the index can significantly improve query speed

In fact, we can see that in the above example, the 2ndand 3 statements are exactly the same, and the fields indexed are the same; the difference is that the first is a non-aggregate index established on the Fariqi field, and the latter is an aggregated index on this field. But the speed of the query is vastly different. Therefore, it is not easy to build indexes on any field to improve query speed.

From the statement of the table , we can see that the Fariqi field in the table with 1000 million data has 5003 different records. It is more appropriate to establish an aggregated index on this field. In reality, we send a few files every day, and the documents are issued on the same date, which is exactly the same as setting up a clustered index: "Neither the vast majority nor the very few are the same" rule. As a result, it is important that we build an "appropriate" aggregate index to improve our query speed.

3, all need to improve the query speed of the fields are added to the clustered index to improve query speed

As mentioned above, the data query can not be separated from the field is "date" and the user's own "user name." Now that these two fields are so important, we can combine them to create a composite index (compound index).

Many people think that as long as you add any field to the clustered index, you can increase the speed of the query, and others are puzzled: if the composite clustered index field separate query, then the query speed will slow down? With this question, let's take a look at the following query speed (result set is all data): (Date column Fariqi first row in the composite clustered index, username Neibuyonghu in the latter column)

(1) Select Gid,fariqi,neibuyonghu,title from Tgongwen where fariqi> ' 2004-5-5 '

Query speed: 2513 ms

(2) Select Gid,fariqi,neibuyonghu,title from Tgongwen where fariqi> ' 2004-5-5 ' and neibuyonghu= ' office '

Query speed: 2516 ms

(3) Select Gid,fariqi,neibuyonghu,title from Tgongwen where neibuyonghu= ' office '

Query speed: 60280 ms

From the above experiment, we can see that if only the starting column of the clustered index is used as the query and the query speed of all the columns using the composite clustered index is almost the same, even faster than using all the composite indexed columns (in the case of the number of query result sets) , and this index has no effect if only the non-starting column of a composite clustered index is used as a query condition. Of course, the query speed of statements 1 and 2 is the same as the number of entries in the query, if all the columns of the composite index are used and the query results are few, this will result in an "index overlay", thus achieving optimal performance. Also, keep in mind that no matter how often you use other columns that aggregate indexes, the leading columns must be the most frequently used columns.

(iv) Summary of the experience of indexing used in other books

1, the use of aggregate index than the index is not aggregated faster than the primary key

Here is the instance statement: (All data are extracted )

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen where fariqi= ' 2004-9-16 '

Usage Time: 3326 ms

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen where gid<=250000

Usage Time: 4470 ms

Here, using an aggregated index is faster than using a primary key that is not an aggregate index. Nearly 1/4.

2, with the aggregate index than the normal primary key for the order by when the speed, especially in small amount of data

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen ORDER by Fariqi

Spents: 12936

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen ORDER by GID

Spents: 18843

Here, the 3/10 is faster when using an aggregate index than using a normal primary key for order by. In fact, if the amount of data is very small, using a clustered index as a row sequence is much faster than using a nonclustered index, and if the amount of data is large, such as more than ten, the speed difference between the two is not obvious.

3, using the time period within the aggregation index, the search time will be reduced by the percentage of the data in the entire data table, regardless of the number of aggregated indexes used

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen where fariqi> ' 2004-1-1 '

Spents: 6343 milliseconds (extract million)

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen where fariqi> ' 2004-6-6 '

Spents: 3170 milliseconds (extract million)

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen where fariqi= ' 2004-9-16 '

Spents: 3326 milliseconds (identical to the result of the previous sentence). If the number of samples is the same, then the greater than and equal number is the same.

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen where fariqi> ' 2004-1-1 ' and fariqi< ' 2004-6-6 '

Spents: 3280 ms

4 , the date column will not be due to the minutes and seconds of input to slow down the query speed

The following example, a total of tens of thousands of data, 1 months and 1 days after the data have million, but only two different dates, date accurate to the day; before there are data million, there are 5000 a different date, Date is accurate to seconds.

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen where fariqi> ' 2004-1-1 ' ORDER by Fariqi

Spents: 6390 ms

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen where fariqi< ' 2004-1-1 ' ORDER by Fariqi

Spents: 6453 ms

(v) Other considerations

"Water can carry a boat, can also overturn", the index is the same. Indexing can help improve retrieval performance, but too much or improper indexing can lead to inefficient systems. Because the user adds an index to each table, the database does more work. Too many indexes can even cause index fragmentation.

So, we want to build a "proper" indexing system, especially for the creation of aggregated indexes, but also should strive for excellence, so that your database can be high-performance play.

Of course, in practice, as a conscientious database administrator, you need to test a few more scenarios to find out which is the most efficient and effective.

II. Improving SQL statements

Many people do not know How SQL statements are executed in SQL Server , fearing that their written SQL statements will be misunderstood by SQL Server . For example:

SELECT * FROM table1 where name= ' Zhangsan ' and TID > 10000

and implementation:

SELECT * FROM table1 where TID > 10000 and name= ' Zhangsan '

Some people do not know the execution efficiency of the above two statements is the same, because if simple from the statement, the two statements are indeed different, if Tid is an aggregation index, then the following sentence only from the table after the 10000 of the record to find the line , and the previous sentence to look up from the whole table to see a few name= ' Zhangsan ' , and then based on the constraints of the conditions tid>10000 to provide query results.

In fact, such fears are unnecessary. There is a query analysis optimizer in SQL SERVER that calculates the search criteria in the WHERE clause and determines which index reduces the search space for the table scan, that is, it enables automatic optimization.

Although the query optimizer can automatically query optimization based on the WHERE clause, it is still necessary to understand how the query optimizer works, and if not, sometimes the query optimizer does not follow your intent to make a quick query.

During the query analysis phase, the query optimizer looks at each stage of the query and determines whether it is useful to limit the amount of data that needs to be scanned. If a phase can be used as a scan parameter (SARG), it is called an optimization and can be used to quickly obtain the required data.

SARG Definition: An operation that restricts the search because it usually refers to a specific match, a match that is worth the range, or a connection of more than two conditions . The form is as follows:

Column name operator < constant or variable >

Or

< constant or variable > operator column name

The column name can appear on one side of the operator, and the constant or variable appears on the other side of the operator. Such as:

Name=' John '

Price >5000

5000< Price

Name=' John ' and price >5000

If an expression does not satisfy the Sarg form, it cannot limit the scope of the search, that is, SQL SERVER must determine for each row whether it satisfies all the conditions in the WHERE clause. So an index is useless for expressions that do not satisfy the Sarg form.

After introducing the Sarg , let's summarize the experience of using SARG and the different conclusions encountered in practice and some of the data:

1. Whether the like statement belongs to Sarg depends on the type of wildcard character used

such as: Name like ' Zhang%' , this belongs to Sarg

And: Name like '% Zhang ', does not belong to Sarg.

The reason is that the wildcard% is not available for indexing when the string is opened.

2, or will cause full table scan

Name=' John ' and price >5000 symbol Sarg, and: Name=' John ' or price >5000 does not conform to Sarg. Using or can cause a full table scan.

3, non-operator, function caused by not satisfied with the Sarg form of statements

Statements that do not satisfy the Sarg form are typically those that include non-operator statements, such as not,!=, <>,!<,!>, notEXISTS,don't in, Not like, and other functions. Here are a few examples that do not satisfy the Sarg form:

ABS ( price) <5000

Name like '% three '

Some expressions, such as:

WHERE Price *2>5000

SQL Server is also considered Sarg, and SQL Server converts this into:

WHERE Price >2500/2

This is not recommended, however, because sometimes SQL SERVER does not guarantee that the conversion is completely equivalent to the original expression.

4, in the effect is quite with or

Statement:

Select * FROM table1 where Tid in (2, 3)

And

Select * FROM table1 where tid=2 or tid=3

Is the same, it will cause a full table scan, and its index will be invalidated if there is an index on the TID.

5, as little as possible with not

6, exists and in the implementation of the same efficiency is the same

Many of the data show that exists is more efficient than in , and should be used not to exists as much as possible instead of in. But in fact, I experimented with it and found that neither the front band nor not, the execution efficiency of the two is the same. Because of the subquery involved, we experimented with the pubs database with SQL SERVER . We can open the statistics I/O State of SQL SERVER before running .

(1) select Title,price from titles where title_id into (select title_id from Sales where qty>30)

The result of the sentence is:

Table ' Sales '. Scan Count, logical reading times, physics read 0 times, pre-read 0 times.

Table ' titles '. Scan count 1, logical read 2 times, physics read 0 times, pre-read 0 times.

(2) select Title,price from titles where exists (select * from sales where sales.title_id=titles.title_id and qty>30 )

The results of the second sentence are:

Table ' Sales '. Scan Count, logical reading times, physics read 0 times, pre-read 0 times.

Table ' titles '. Scan count 1, logical read 2 times, physics read 0 times, pre-read 0 times.

From this we can see that the execution efficiency of using exists and in is the same.

7, with the function charindex () and the preceding plus wildcard character% of like execution efficiency

Earlier, we talked about the fact that if you add a wildcard character to thelike, it will cause a full table scan, so its execution efficiency is low. But some of the information said that the use of function charindex () to replace like speed will have a big upgrade, after my trial, found that this explanation is also wrong:

Select Gid,title,fariqi,reader from Tgongwen where CHARINDEX (' forensic Detachment ', reader) >0 and fariqi> ' 2004-5-5 '

Spents: 7 seconds, in addition: Scan count 4, logical read 7155 times, physics read 0 times, pre-read 0 times.

Select Gid,title,fariqi,reader from Tgongwen where reader like '% ' + ' criminal Investigation detachment ' + '% ' and fariqi> ' 2004-5-5 '

Spents: 7 seconds, in addition: Scan count 4, logical read 7155 times, physics read 0 times, pre-read 0 times.

8, Union is not absolutely more efficient than the execution of or

We've talked about the use of or in a WHERE clause that causes a full table scan, and, generally, the information I've seen is a recommendation to use Union instead of or. As it turns out, this is true for most of these claims.

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen where fariqi= ' 2004-9-16 ' or gid>9990000

Spents: seconds. Scan count 1, logical read 404008 times, physics read 283 times, read 392163 times.

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen where fariqi= ' 2004-9-16 '

Union

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen where gid>9990000

Spents: 9 seconds. Scan Count 8, logical read 67489 times, physics read 216 times, read 7499 times.

It seems that using union is much more efficient than using or in general.

But after the experiment, the author found that if or both of the query column is the same, then the Union and the execution speed with or is much worse, although here is the union Scan Index, and or scan is the whole table.

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen where fariqi= ' 2004-9-16 ' or fariqi= ' 2004-2-5 '

Spents: 6423 milliseconds. Scan Count 2, logical read 14726 times, Physics read 1 times , read 7176 times.

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen where fariqi= ' 2004-9-16 '

Union

Select Gid,fariqi,neibuyonghu,reader,title from Tgongwen where fariqi= ' 2004-2-5 '

Spents: 11640 milliseconds. Scan Count 8, logical read 14806 times, Physics read 108 times, read 1144 times.

9, the field extraction according to "How much, how much" principle, avoid "select *"

Let's do a test:

Select top 10000 gid,fariqi,reader,title to Tgongwen ORDER by gid Desc

Spents: 4673 ms

Select top 10000 gid,fariqi,title to Tgongwen ORDER by gid Desc

Spents: 1376 ms

Select top 10000 Gid,fariqi to Tgongwen ORDER by gid Desc

Spents: Ms

From this, we have to extract a field each less, the data extraction speed will have a corresponding upgrade. The speed of ascension depends on the size of the field you have discarded.

Count (*) is no slower than count ( field)

Some sources say that it is obviously less efficient than the list of names in the world to use the * to count all the columns. Such a statement is in fact unfounded. We look at:

Select COUNT (*) from Tgongwen

Spents: 1500 ms

Select COUNT (GID) from Tgongwen

Spents: 1483 ms

Select COUNT (Fariqi) from Tgongwen

Spents: 3140 ms

Select COUNT (title) from Tgongwen

Spents: 52050 ms

As you can see from the above, if you use COUNT (*) and the speed of count ( primary key) are equivalent, and count (*) is faster than any other field except the primary key, and the longer the field, the slower the rollup. I think that if you use COUNT (*), SQL SERVER may automatically find the smallest field to summarize. Of course, if you write the count (primary key) directly, it will come more directly.

One, order by clustered indexed columns most efficient

Let's see: (GID is a primary key, Fariqi is an aggregated indexed column)

Select top 10000 gid,fariqi,reader,title from Tgongwen

Spents: 196 milliseconds. Scan count 1, logical read 289 times, physics read 1 times , read 1527 times.

Select top 10000 gid,fariqi,reader,title to Tgongwen ORDER by GID ASC

Spents: 4720 milliseconds. Scan count 1, logical read 41956 times, physics read 0 times , read 1287 times.

Select top 10000 gid,fariqi,reader,title to Tgongwen ORDER by gid Desc

Spents: 4736 milliseconds. Scan count 1, logical read 55350 times, physics read ten times, pre-read 775 times.

Select top 10000 gid,fariqi,reader,title to Tgongwen ORDER by Fariqi ASC

Spents: 173 milliseconds. Scan count 1, logical read 290 times, physics read 0 times, pre-read 0 times.

Select top 10000 gid,fariqi,reader,title from Tgongwen ORDER BY Fariqi Desc

Spents: 156 milliseconds. Scan count 1, logical read 289 times, physics read 0 times, pre-read 0 times.

As we can see from the above, the speed of the unordered and the number of logical reads are equivalent to the speed of the "ORDER by Clustered index column" , but these are much faster than the "order by nonclustered index column" query speed.

At the same time, according to a certain field to sort, whether the positive sequence or reverse, the speed is basically equivalent.

high-efficient top

In fact, when querying and extracting super large volume dataset, the biggest factor that affects the database response time is not the data lookup, but the physical i/0 operation. such as:

Select Top * FROM (

Select top 10000 gid,fariqi,title from Tgongwen

Where neibuyonghu= ' office '

ORDER BY gid Desc) as a

ORDER BY GID ASC

This statement, theoretically, the execution time of the entire statement should be longer than the execution time of the clause, but the opposite is true. Because, after the clause is executed, the 10000 record is returned , and the entire statement returns only ten statements, so the most important factor that affects the database response time is physical I/O operations. one of the most effective ways to limit physical I/O operations here is to use top keywords. The top keyword is a system-optimized word used in SQL SERVER to extract the first or previous percentages of data. The application of the author in practice, found that top is really good and efficient. But this word is not in another large database Oracle , which cannot be said to be a pity, although it can be solved in Oracle in other ways, such as: RowNumber. We'll use the top keyword in a future discussion of "Implementing TENS data paging display stored procedures ."

So far, we've discussed how to quickly query out the data methods you need from a large-capacity database. Of course, we introduce these methods are "soft" method, in practice, we also have to consider a variety of "hard" factors, such as: Network performance, server performance, operating system performance, even network adapters, switches and so on.

Three, realize the small data quantity and the massive Data General page display stored procedure

To create a Web application, paging browsing is essential. This problem is a very common problem in database processing. The classic data paging method is: ADO records set pagination, that is, the use of ADO with the paging function (using cursors) to achieve pagination. However, this paging method applies only to situations where the data is small, because the cursor itself has a disadvantage: the cursor is stored in memory and is very time-consuming. Once the cursor is established, the associated record is locked until the cursor is canceled. Cursors provide a step-by-step scan of a particular set, typically using cursors to traverse data line by row, and to perform different operations depending on how the data is removed. For multiple tables and large tables, the cursor (large data set) loop is easy to get the program into a long wait or even crash.

More importantly, for a very large data model, it is a waste of resources to use the traditional method of loading the entire data source every time when paging is retrieved. The popular method of paging is generally to retrieve the chunk of the page size, rather than retrieving all the data, and then stepping through the current row.

The earliest way to implement this method of extracting data based on page size and page number is probably "Russian stored procedures." This stored procedure uses cursors, and because of the limitations of the cursor, this method is not universally accepted.

Later, someone on the Web modified this stored procedure, the following stored procedure is a combination of our office automation instance of the paging stored procedures:

CREATEProcedurePagination1


(@pagesize int,--page size, such as 20 records per page storage


@pageindex int--Current page number


)


As


SetNocount ON


Begin


Declare@indextableTable(ID int identity (1,1), nid int)--Define Table variables


Declare@PageLowerBound int--Defines the page's Deder
declare  @PageUpperBound  int   --defines the top code for this page
set @ Pagelowerbound= (@pageindex-1) * @pagesize
set  @PageUpperBound = @PageLowerBound + @pagesize
set rowcount  @PageUpperBound
insert  into  @indextable (nid)   Select  gid  from tgongwen  where fariqi >dateadd (Day,-365,getdate ())   order  by fariqi  desc
Select o.gid,o.mid,o.title, o.fadanwei,o.fariqi  From tgongwen o, @indextable  t  Where o.gid=t.nid
and t.id> @PageLowerBound   and t.id<= @PageUpperBound   order  by t.id
end
set nocount off

The above stored procedures use the latest technology of SQL SERVER-table variables. It should be said that this stored procedure is also a very good paging stored procedure. Of course, in this process, you can also write the table variables in the temporary table:CREATE table #Temp. But obviously, in SQL SERVER , using a temporary table is not as fast as using a table variable. So the author just started using this stored procedure, feel very good, speed is better than the original ADO . But then, I found a better way than this.

The author has seen a small short essay on the Internet The method of taking out the records of Nth to article m from the datasheet , as follows:

Take the records from the Publish table to section n to article M :

SELECT Top M-n+1 *
From publish
WHERE (id not in
(SELECT top n-1 ID
from publish))

keyword with ID publish table

I saw this article at that time, really is the spirit spirits, think the train of thought very good. By the time I was working on the office automation system (asp.net+ C #+sql SERVER), I suddenly remembered this article, and I thought it might be a very good paging stored procedure if I changed the statement. So I looked over the internet for this article, unexpectedly, the article has not found, but found a according to this statement written a paging stored procedure, this stored procedure is currently a more popular one of the paging stored procedures, I regret not rushed to the text into a stored procedure:

CREATEPROCEDUREPagination2


(


@SQL NVARCHAR (4000),--SQL statement with no sort statement


@Page int,--page number


@RecsPerPage int,-Number of records per page
  @ID  varchar (255),         --Duplicate ID numbers that need to be sorted
  @Sort  varchar (255)        --sort fields and rules
)
as
declare  @Str  nvarchar (4000)
set  @Str = ' select   top  ' + CAST (@RecsPerPage   as varchar) + '  * FROM  (' + @SQL + ')  T  Where t. ' + @ID + ' not in 
( select   top  ' +cast (@ recsperpage* (@Page-1))  as varchar + '   ' + @ID + '   from  ( ' + @SQL + ')  t9  order  by  ' + @Sort + ')   order  by  ' +@ Sort
print  @Str
exec sp_executesql  @Str
go

In fact, the above statement can be simplified to:

SELECT Top Page Size *
From Table1
WHERE (ID not in
(SELECT Top Page size * Pages ID
From table
ORDER by ID)
ORDER BY ID

But this stored procedure has a fatal disadvantage, that is, it contains not in the Word. Although I can change it to:

SELECT Top Page Size *
From Table1
WHERE not exists
(select * from (page Size * pages) * FROM table1 ORDER by id) b where b.id=a.id)
ORDER by ID

That is, using NOT exists instead ofnot in, but as we've already talked about, the execution efficiency of the two is virtually indistinguishable.

Even so, this method of combining the top with not is a bit quicker than using a cursor.

Although using not EXISTS does not save the efficiency of the last stored procedure, it is a wise choice to use the top keyword in sql SERVER. Because the ultimate goal of paging optimization is to avoid the creation of too large recordsets, we have already mentioned the top advantage, through top to achieve the control of the volume of data.

In the paging algorithm, the key factors that affect our query speed are two points: top and not. Top can improve our query speed, and not in will slow down our query speed, so to improve the speed of our entire paging algorithm, we need to completely transform not in, and replace it with other methods.

We know that in almost any field, we can extract the maximum or minimum value of a field by Max (field) or Min ( field) , so if the field does not repeat, then the max or Min of the fields that are not duplicated can be used. as a watershed, it becomes the reference point of each page in the pagination algorithm. Here, we can use the operator ">" or "<" number to complete this mission, so that the query statement conforms to the Sarg form. such as:

Select Top * FROM table1 where id>200

So there is the following paging scheme:

select top  page size  *
from table1 
Where id>
       ( select  max  (id)   from 
        ( select top  (page 1) * page size)  id  From table1   order  by id   as t
       ) &NBSP;&NBSP;&NBSP;&NBSP;&NBSP
   order  by id

When choosing a column that is not repeating values and is easy to distinguish between sizes, we usually select a primary key. The following table lists the tables in the office automation system with 1000 data, with gid(gid is the primary key, but not the clustered index). ) is the row sequence, extracts the gid,fariqi,title field, respectively, in the first 1,ten,500 ,1000,1 million, million, For example, test the execution speed of the above three paging schemes: (in milliseconds)

Page Code Scenario 1 Programme 2 Programme 3

1

Ten

1076 720 130

540 12943

1000 17110 470

1 million 24796 4500 140

Ten million 38326 42283 1553

Wan 28140 128720 2330

million 121686 127846 7168

From the table above, we can see that the three stored procedures are trustworthy and of good speed when executing the paging commands below page. However, the first scheme after the execution of pagination 1000 page, the speed down. The second scenario is approximately 1 pages above the execution page and the speed begins to fall. But the third scheme has never been a big drop, the stamina is still very sufficient.

After we have identified a third paging scenario, we can write a stored procedure accordingly. You know that SQL SERVER stored procedures are compiled SQL statements that are more efficient than the execution of SQL statements from Web pages . The following stored procedures not only contain a paging scheme, but also determine whether to count data totals based on the parameters from the page.

-- Gets the data for the specified page

CREATEPROCEDUREPagination3


@tblName varchar (255),--Table name


@strGetFields varchar (1000) ='*',--columns that need to be returned


@fldName varchar (255) ='',--The sorted field name


@PageSize int = 10,--Page size


@PageIndex int = 1,--page number


@doCount bit = 0,--Returns the total number of records, not 0 value returns


@OrderType bit = 0,--Set sort type, not 0 value descending


@strWhere varchar (1500) =''--Query criteria (note: Do not add where)


As


Declare@strSQL varchar (5000)--The subject sentence


Declare@strTmp varchar (110)--Temporary variables


Declare@strOrder varchar (400)--Sort Type





If@doCount!= 0


Begin


If@strWhere!=''


Set@strSQL ="SELECT COUNT (*) as Total from ["+ @tblName +"] where"+ @strWhere


Else


Set@strSQL ="SELECT COUNT (*) as Total from ["+ @tblName +"]"


End


--The above code means that if @docount passes over 0, the total count is executed. All of the following code is @docount 0


Else


Begin





If@OrderType!= 0


Begin


Set@strTmp ="&lt; (select Min"


Set@strOrder ="ORDER BY ["+ @fldName +"] desc"


If @ordertype is not 0, it is important to perform descending order.


End


Else


Begin


Set@strTmp ="&gt; (SELECT Max"


Set@strOrder ="ORDER BY ["+ @fldName +"] ASC"


End





If@PageIndex = 1


Begin


If@strWhere!=''


Set@strSQL ="SELECT Top"+ STR (@PageSize) +" "+ @strGetFields +"From ["+ @tblName +"] where"+ @strWhere +" "+ @strOrder


Else


Set@strSQL ="SELECT Top"+ STR (@PageSize) +" "+ @strGetFields +"From ["+ @tblName +"] "+ @strOrder


--If the first page executes the above code, this will speed up execution


End


Else


Begin


--The following code gives @strsql the SQL code to actually execute


Set@strSQL ="SELECT Top"+ STR (@PageSize) +" "+ @strGetFields +"From ["


+ @tblName +"] WHERE ["+ @fldName +"]"+ @strTmp +"(["+ @fldName +"]) from (select top)+ STR ((@PageIndex-1) * @PageSize) +" ["+ @fldName +"] FROM ["+ @tblName +"]"+ @strOrder +") as Tbltmp)"+ @strOrder





If@strWhere!=''


Set@strSQL ="SELECT Top"+ STR (@PageSize) +" "+ @strGetFields +"From ["


+ @tblName +"] WHERE [" + @fldName + "]" + @strTmp + "(["
+ @fldName + "]) from (select Top + str (@PageIndex-1) * @PageSize) + " ["
+ @fldName + "] from [" + @tblName + "] where" + @strWhere + ""
+ @strOrder + ") as Tbltmp) and" + @strWhere + " " + @strOrder
End
End
EXEC (@strSQL)
Go

The above stored procedure is a common stored procedure with comments written in it.

In the case of large amount of data, especially when querying the last few pages, the query time is generally not more than 9 seconds, while other stored procedures, in practice, will cause timeouts, so this stored procedure is very suitable for large-capacity database queries.

I hope that through the above storage process analysis, can give us some inspiration, and to work to bring some efficiency improvement, and hope that the peer to come up with better real-time data paging algorithm.

Iv. importance of clustered indexes and how to select clustered Indexes

In the title of the previous section, I write a general paging display stored procedure that implements small data and massive data. This is because in the practice of applying this stored procedure to the "Office automation" system, the author discovers that this third kind of stored procedure has the following phenomena in the case of small data quantity:

1, paging speed generally maintained between 1 seconds and 3 seconds.

2, in the query last page, the speed is generally 5 seconds to 8 seconds, even if the total number of pages only 3 pages or million pages.

Although in the case of super large capacity, this paging implementation is very fast, but in the first few pages, this 1-3 seconds faster than the first even not optimized paging method speed is slower, by the user's words is "Access database is not fast", This awareness is enough to cause the user to discard the system you are developing.

The author of this analysis, the original cause of this phenomenon is so simple, but also so important: the sorted field is not a clustered index!

The title of this article is: "Query optimization and pagination algorithm scheme." The author only so that "query optimization" and "pagination algorithm" These two links are not very big topic together, because both need a very important thing--clustered index.

As we mentioned earlier in the discussion, clustered indexes have two of the biggest advantages:

1, the fastest speed to narrow the scope of the query.

2, the fastest speed for the field sorting.

The 1th article is more used in query optimization, and the 2nd is used more for sorting data when paging.

Clustered indexes can only be set up in each table, which makes the clustered index more important. The selection of clustered indexes can be said to be the most critical factor in implementing "Query Optimization" and "efficient paging".

However, it is often a contradiction to make the clustered index columns conform to the needs of both the query column and the row sequence.

The author of the previous "index" in the discussion, will be Fariqi, that is, the user issued a date as the starting column of the clustered index, the date is the precision of "day." The advantages of this approach have already been mentioned, in the time period of the quick query, compared with the ID primary key column has a great advantage.

When paging, however, because this clustered index column has duplicate records, it is not possible to use Max or min as the most paginated reference, thus making it impossible to achieve more efficient sorting. If the ID primary key column is used as a clustered index, then the clustered index, in addition to ordering, is useless, and in fact it wastes the valuable resource of the clustered index.

To resolve this contradiction, the author later added a date column with the default value of GETDATE (). When a user writes a record, the column is automatically written to the time, in milliseconds. Even so, to avoid a very small coincidence, create a unique constraint on this column . Use this date column as a clustered index column.

With this time clustered index column, the user can use this column to find a user's query for a period of time when inserting data, and to implement Max or Min as a unique column, as a reference to the paging algorithm.

After such optimization, the author found that, whether it is a large amount of data in the case or small data, paging speed is generally dozens of milliseconds, or even 0 milliseconds. The speed of the query to narrow the range of dates is no slower than it used to be.

Clustered indexes are so important and precious that the author concludes that the clustered index must be built on:

1, you most frequently used, to narrow the scope of the query on the field;

2, the fields that you use most frequently and that need to be sorted.

If that's what I saw in csdn, there are a few places where it's problematic.
Like what
"PRIMARY key is clustered index
This idea is extremely wrong and is a waste of the clustered index. Although SQL Server creates a clustered index on the primary key by default. ”
I think we should add a clustered index to the primary key.

His test data were also problematic, with 250,000 records in the same day and 9 million records in May 5, 2004.
Such test data leads to the universality of the test results.
In other words, his test results may only be suitable for 10 million records, uneven distribution of the situation, but not suitable for hundreds of thousands of records, the average distribution.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.