Clustered index and non-clustered index (sorting)

Source: Internet
Author: User

From: Http://www.cnblogs.com/aspnethot/articles/1504082.html

Official statement:

 

Clustered Index

An index. The logical order of the key values in the index determines the physical order of the corresponding rows in the table.
The clustered index determines the physical sequence of data in the table. The clustered index is similar to the phone book, which sorts data by last name. Because clustered indexes specify the physical storage sequence of data in a table, a table can only contain one clustered index. However, this index can contain multiple columns (composite indexes), just as the phone book is organized by the last name and name.

Clustered indexes are particularly effective for columns that frequently search for range values. When a clustered index is used to locate the row that contains the first value, you can ensure that the row that contains the subsequent index value is physically adjacent. For example, if an application executes a query that often retrieves records within a certain date range, the clustered index can be used to quickly locate the rows containing the start date, and then retrieve all adjacent rows in the table, until the end date is reached. This helps improve the performance of this type of query. Similarly, if a column is frequently used to sort the data retrieved from a table, the table can be clustered (physically sorted) on the column ), avoid sorting this column each time, saving costs.

When the index value is unique, it is efficient to use clustered indexes to find specific rows. For example, the quickest way to use the unique employee ID column emp_id to find a specific employee is to create a clustered index or primary key constraint on the emp_id column.

 

 

 

Non-clustered Index

An index. The logical order of the index is different from that of the physical storage on the disk.

 

 

The index is described by the data structure of the binary tree. We can understand the clustering index as follows: the leaf node of the index is the data node. The leaf node without clustering index is still an index node, but there is a pointer pointing to the corresponding data block. For example:

 

 

 

(Non-clustered index)

 

 

 

 

 

 

 

(Clustered index)

 

 

I. An in-depth understanding of the index structure

In fact, you can think of indexes as a special directory. Microsoft's SQL Server provides two types of indexes: clustered index (also called clustered index and clustered index) and non-clustered index (nonclustered index, also known as non-clustering index and non-cluster index ). The following is an example of the difference between clustered indexes and non-clustered indexes:
In fact, the body of our Chinese dictionary is a clustered index. For example, if we want to check the word "an", we will naturally open the first few pages of the dictionary, because the Pinyin of "an" is "", the Dictionary of Chinese characters sorted by pinyin starts with the English letter "a" and ends with "Z". Then the word "an" is naturally placed at the front of the dictionary. If you still cannot find the word in all the parts starting with "A", it means that you do not have this word in your dictionary. Similarly, if you query the word "Zhang, then you will turn your dictionary to the last part, because the Pinyin of "Zhang" is "Zhang ". That is to say, the body of the dictionary itself is a directory, and you do not need to query other directories to find the content you need. We refer to this text content as a directory arranged according to certain rules as "clustered Index ".
If you know a word, you can quickly find it automatically. However, you may also encounter a word you do not know and do not know its pronunciation. At this time, you cannot find the word you want to query according to the method just now, you need to find the word you are looking for based on the "radicals", and then directly go to a page based on the page number after the word to find the word you are looking. However, the words you find in combination with the "radical directory" and "word checking table" are not really the sorting method of the text. For example, you can query the word "Zhang, we can see that the page number of the "Zhang" in the word checking table after the department head is 672, and the "Zhang" in the word checking table is "Chi", but the page number is 63, under "Zhang" is the word "", and the page is 390 pages. Obviously, these words are not really in the upper and lower sides of the word "Zhang, the continuous "Chi, Zhang, and "words you see are actually their sorting in the non-clustered index, which is the ing of words in the dictionary body in the non-clustered index. We can find the words you need in this way, but it requires two steps: first find the results in the directory, and then flip to the page number you need. We refer to this directory as a directory, and the text as a non-clustered index ".
Through the above example, we can understand what is "clustered index" and "non-clustered index ". Further, we can easily understand that each table can only have one clustered index, because directories can only be sorted in one way.

Ii. When to use clustered or non-clustered Indexes

The following table summarizes when to use clustered or non-clustered indexes (important ):

 

Action Description Use clustered Index Use non-clustered Index
Columns are sorted by group. Ying Ying
Returns data within a certain range. Ying Should not
One or few different values Should not Should not
Different decimal values Ying Should not
Different values of large numbers Should not Ying
Frequently updated Columns Should not Ying
Foreign key column Ying Ying
Primary Key Column Ying Ying
Frequently modify index Columns Should not Ying



In fact, we can understand the above table through the previous examples of clustered index and non-clustered index definitions. For example, a data item in a certain range is returned. For example, if you have a time column in a table and you have created an aggregate index in the column, When you query all the data from January 1, January 1-20, 2004 to January 1, October 1, this speed will be very fast, because the text of your dictionary is sorted by date, and the clustering index only needs to find the start and end data of all the data to be retrieved; unlike non-clustered indexes, you must first check the page number of each item in the directory, and then find the specific content based on the page number.

Iii. misunderstandings about index usage based on actual conditions

The purpose of the theory is to apply it. Although we have just listed when clustered or non-clustered indexes should be used, in practice, the above rules are easily ignored or cannot be comprehensively analyzed based on the actual situation. Next we will talk about the mistakes in Indexing Based on the actual problems encountered in practice, so that you can master the indexing method.

1. Primary keys are clustered indexes.
I think this is an extremely bad idea and a waste of clustered indexes. Although SQL Server creates a clustered index on the primary key by default.
Generally, we create an ID column in each table to distinguish each data entry. The ID column is automatically increased and the step size is generally 1. This is the case with the GID column in our office automation instance. At this time, if we set this column as the primary key, SQL Server will set this column as a clustered index by default. The advantage of doing so is that you can physically sort your data in the database by ID, but I think this is of little significance.
Obviously, the advantages of clustered indexes are obvious, and each table can have only one clustered index rule, which makes clustered indexes more precious.
From the definition of clustered index we mentioned above, we can see that the biggest advantage of using clustered index is that it can quickly narrow the query scope according to the query requirements to avoid full table scanning. In practice, because the ID number is automatically generated and we do not know the ID number of each record, it is difficult for us to use the ID number for query in practice. This makes the primary key of the ID number a waste of resources as a clustered index. Second, setting different fields for each ID number as clustered indexes does not comply with the rule of "no aggregated index should be set up for different values of large numbers". Of course, this is only applicable to users' frequent modification of record content, especially when indexing items, but does not affect the query speed.
In the office automation system, whether the system homepage displays files, meetings, or file queries that need to be signed by the user, data query is inseparable from the field "date" and the user's "User Name" in any case".
Generally, the home page of office automation displays documents or meetings that have not been signed for by each user. Although our where statement can only limit the situations that the current user has not signed for, if your system has been established for a long time and the data volume is large, then, each time a user opens the homepage, a full table scan is performed. This is of little significance. Most users have browsed the files one month ago, this can only increase the database overhead. In fact, when we allow users to open the home page of the system, the database only queries the files that the user has not viewed in the last three months, and uses the "date" field to restrict table scanning, improves the query speed. If your office automation system has been established for two years, the display speed of your home page will theoretically be 8 times faster than the original one.
The reason why the word "Theoretically" is mentioned here is that if your clustered index is blindly built on the id Primary Key, your query speed is not so high, even if you create an index (non-aggregate index) on the "date" field ). Next, let's take a look at the speed of various queries with 10 million data records (0.25 million data records in three months ):

(1) only create a clustered index on the primary key without dividing the time range:

Select GID, fariqi, neibuyonghu, title from tgongwen

Time: 128470 milliseconds (I .e., 128 seconds)

(2) create a clustered index on the primary key and a non-clustered index on Fariq:

Select GID, fariqi, neibuyonghu, title from tgongwen
Where fariqi> dateadd (day,-90, getdate ())

Time used: 53763 milliseconds (54 seconds)

(3) create an aggregate index on the date column (fariqi:

Select GID, fariqi, neibuyonghu, title from tgongwen
Where fariqi> dateadd (day,-90, getdate ())

Time used: 2423 milliseconds (2 seconds)

Although each statement extracts 0.25 million pieces of data, there are huge differences in various situations, especially when the clustered index is created on the date column. In fact, if your database has a capacity of 10 million, you can set up the primary key on the ID column. In the above 1st and 2 cases, timeout occurs on the webpage, it cannot be displayed at all. This is also one of the most important factors for me to discard the ID column as a clustered index. The above speed is obtained by adding:

Declare @ d datetime
Set @ d = getdate ()

Add the following after the SELECT statement:

Select [statement execution time (MS)] = datediff (MS, @ D, getdate ())

2. You only need to create an index to significantly increase the query speed.
In fact, we can find that in the above example, 2nd and 3 statements are identical, and the index creation fields are also the same; the difference is that the former creates a non-aggregate index on the fariqi field, while the latter creates an aggregate index on this field, but the query speed is quite different. Therefore, it is not simply creating an index on any field that can increase the query speed.
From the table creation statement, we can see that the fariqi field in the table with 10 million data has 5003 different records. It is no longer appropriate to create an aggregate index on this field. In reality, we send several files every day. These files have the same posting date, which fully meets the requirements for building clustered indexes: "The vast majority cannot be the same, there cannot be a very small number of identical rules. From this point of view, it is very important for us to create an "appropriate" aggregate index to increase the query speed.

3. Add all fields that need to increase the query speed to the clustered index to increase the query speed.
As mentioned above, the field "date" and the user's "User Name" are indispensable for data query ". Since both fields are so important, we can combine them to create a compound index ).
Many people think that adding any field to the clustered index can increase the query speed. Some people are also confused: If the composite clustered index field is queried separately, will the query speed slow down? With this problem, let's take a look at the following query speed (the result set contains 0.25 million pieces of data): (the date column fariqi is first placed in the starting column of the composite clustered index, and the username neibuyonghu is placed in the back column ):

(1) Select GID, fariqi, neibuyonghu, title from tgongwen where fariqi> ''2017-5-5''

Search speed: 2513 Ms

(2) Select GID, fariqi, neibuyonghu, title from tgongwen
Where fariqi> ''2017-5-5 ''and neibuyonghu = ''''

Search speed: 2516 Ms

(3) Select GID, fariqi, neibuyonghu, title from tgongwen where neibuyonghu = ''''

Search speed: 60280 Ms

From the above experiment, we can see that the query speed is almost the same if only the starting column of the clustered index is used as the query condition and all columns of the composite clustered index are used at the same time, it is even a little faster than using all the composite index columns (when the number of query result sets is the same). If only the non-start column of the composite index is used as the query condition, this index does not have any effect. Of course, the query speed of statements 1 and 2 is the same because the number of queried items is the same. If all columns of the composite index are used and the query results are few, the index Overwrite will be formed ", therefore, the performance can be optimal. At the same time, please remember: No matter whether you frequently use other columns of the aggregate index, but its leading column must be the most frequently used column.

Iv. Summary of index usage experience not found in other books

1. Using an aggregate index is faster than using a primary key that is not an aggregate index
The following is an instance statement: (both extract 0.25 million pieces of data)

Select GID, fariqi, neibuyonghu, reader, title from tgongwen where fariqi = ''2004-9-16''

Usage time: 3326 Ms

Select GID, fariqi, neibuyonghu, reader, title from tgongwen where GID <= 250000

Usage time: 4470 Ms

Here, using an aggregate index is nearly 1/4 faster than using a primary key that is not an aggregate index.

2. Using an aggregate index is faster than using an ordinary primary key as order by, especially when the data volume is small.

Select GID, fariqi, neibuyonghu, reader, title from tgongwen order by fariqi

Time: 12936

Select GID, fariqi, neibuyonghu, reader, title from tgongwen order by GID

Time: 18843

Here, the speed of using an aggregate index is 3/10 faster than that of using an ordinary primary key as order. In fact, if the data volume is small, using clustered indexes as sorting columns is much faster than using non-clustered indexes. If the data volume is large, such as more than 0.1 million, the speed difference between the two is not obvious.

3. When you use an aggregate index for a period of time, the search time decreases proportionally according to the percentage of data in the entire data table, regardless of the number of aggregate indexes used:

Select GID, fariqi, neibuyonghu, reader, title from tgongwen where fariqi> ''2004-1-1''

Time used: 6343 milliseconds (1 million records extracted)

Select GID, fariqi, neibuyonghu, reader, title from tgongwen where fariqi> ''2004-6-6''

Time used: 3170 milliseconds (0.5 million records extracted)

Select GID, fariqi, neibuyonghu, reader, title from tgongwen where fariqi = ''2004-9-16''

Time: 3326 milliseconds (same as the result of the above sentence. If the number of items collected is the same, the greater than or equal to the number is the same)

Select GID, fariqi, neibuyonghu, reader, title from tgongwen
Where fariqi> '''2017-1-1 ''and fariqi <'2017-6-6''

Time: 3280 milliseconds

4. The date column does not slow down the query because of input in minutes.
In the following example, there are a total of 1 million data records. There are January 1, 2004 data records after January 1, 0.5 million, but there are only two different dates with precise dates to the day. There are 0.5 million data records before, there are 5000 different dates, accurate to seconds.

Select GID, fariqi, neibuyonghu, reader, title from tgongwen
Where fariqi> ''2004-1-1'' order by fariqi

Time: 6390 milliseconds

Select GID, fariqi, neibuyonghu, reader, title from tgongwen
Where fariqi <''2004-1-1'' order by fariqi

Time: 6453 milliseconds

5. Other Precautions

The index is also the same. Indexing helps improve the search performance, but too many or improper indexing will also lead to low system efficiency. Because every time a user adds an index to a table, the database needs to do more work. Too many indexes may even cause index fragmentation.
Therefore, we need to establish an "appropriate" index system, especially for the creation of aggregate indexes, so that your database can enjoy high performance.
Of course, in practice, as a due diligence database administrator, You need to test more solutions to find the most efficient and effective solution.

 

 

 

 

 

---------------------- >>>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.