SQL Server clustered index key selection from the perspective of performance

Source: Internet
Author: User
Introduction

In SQL Server, data is stored by page. After a clustered index is added to a table, SQL Server performs data search based on the column of the clustered index as a keyword. Therefore, the impact of clustered index selection on performance becomes very important. This article aims to talk about the choice of clustered index from the perspective of performance, but this is only from the performance aspect. For tables with special business requirements, you need to select according to the actual situation.

 

The combination of columns or columns where the clustered index is located should be unique.

For this reason, we need to talk about the principle of data storage. In SQL Server, data is not stored in rows, but in pages. Therefore, the minimum unit for SQL Server to query data is actually a page. That is to say, even if you only search for a small row of data, SQL Server will find the entire page and put it in the buffer pool.

The size of each page is 8 KB. Each page has a physical address for SQL Server. This address is written as the file number: page number (you need to know about the file and file group to understand the file number). For example, the first file page is 50th. The page number is. When a table has no clustered index, the data pages in the table are stored in heap. On the basis of the page, SQL Server uniquely identifies each row through an additional row number, this is also the legendary RID. The RID is represented by the file number: page number: row number. If this row is in row 5th of the preceding page, the RID is represented as, 1.

Figure 1. Rid example

In terms of the concept of the RID, the RID is not only the basis for SQL Server to determine each row, but also the storage location of the row. When a page is organized by heap, the page is rarely moved.

When a clustered index is created on a table, the pages in the table are organized according to the B tree. In this case, the SQL server does not search for rows by RID. Instead, it uses keywords, that is, columns of clustered indexes, as keywords. Suppose in the Table in Figure 1, we set the clustered mentid column as the clustered index column. Then, the rows of non-leaf nodes of Tree B only contain the departmentid and bookmark pointing to the next node ).

When the value of the clustered index we created is not unique, SQL Server cannot determine a row only by the clustered index column (that is, the keyword. In this case, SQL Server needs to generate an additional identifier for clustered index columns with the same value to differentiate each row. This is also called uniquifiers. After uniquifier is used, the impact on performance is divided into the following two parts:

    • SQL Server must judge whether the current data is duplicate with the existing key when inserting or updating the data. If the data is duplicate, it must generate a uniquifier, which is an additional overhead.
    • Because you need to add additional uniquifier for keys with the same value to distinguish them, the size of the keys is increased. Therefore, both leaf nodes and non-leaf nodes require more pages for storage. This also affects non-clustered indexes and makes the bookmarked columns of non-clustered indexes larger, so that non-clustered indexes also need more pages for storage.

Next we will perform a test to create a test table and create a clustered index. Insert 0.1 million pieces of test data, two of which are repeated, as shown in figure 2.

Figure 2. Test Data insertionCode

In this case, we can view the number of pages occupied by the table, as shown in 3.

Figure 3. 0.1 million pages of data after duplicate keys are inserted

 

Insert 0.1 million non-duplicate data again, as shown in figure 4.

Figure 4. Insert 0.1 million non-repeated code

 

In this case, the page number is reduced to 335, as shown in Figure 5.

Figure 5. Insert duplicate keys and scale down to 335 pages

 

Therefore, we recommend that you use a unique key for the column where the clustered index is located.

 

It is best to use a combination of narrow or narrow columns as a clustered index Column

This principle is the same as the principle of reducing pages above. Narrow columns reduce the size of keys. This reduces the number of non-leaf nodes of clustered indexes, rather than the number of bookmarks of clustered indexes, resulting in fewer leaf node pages. This eventually improves performance.

 

Use a column or a combination of Columns with few changes in values as a clustered index Column

We know before. After creating a clustered index for the table. SQL Server searches for rows by key. Because the data in the number B is ordered, when the clustered index key changes, not only the value itself needs to be changed, but also the location of the row where the key is located (RID ), therefore, rows may be moved from one page to another. In order. Therefore, the following problems may occur:

    • This operation requires overhead to move rows from one page to another. In addition, this operation may also affect other rows, so that other rows also need to be moved and paging may occur.
    • Moving rows between pages will produce index fragmentation
    • Key changes will affect non-clustered indexes, and the bookmarks of non-clustered indexes need to be changed. This is an additional overhead.

This is why many tables create a column unrelated to the data as the primary key, such as the person in the adventureworks database. the address table uses the addressid column that is independent of the data itself as the clustered index column, as shown in 6. If addressline1 is used as the primary key, changes to employee addresses may cause problems in the preceding list.

Figure 6. Create a column unrelated to the data as a clustered index Column

 

It is best to use the auto-incrementing column as the clustered index Column

We recommend that you create an auto-incrementing column irrelevant to the data as the clustered index column. We know that if the newly added data needs to be inserted into the current ordered B tree in the clustered index column, we need to move other rows to make room for the newly inserted rows. This may cause paging and index fragmentation. In the same way, it will also cause additional burden for modifying non-clustered indexes. When the auto-increment column is used, the insert of a new row will greatly reduce the paging and fragmentation.

I have encountered a situation recently. The performance of a table is extremely slow every few months. A large number of index fragments exist. However, rebuilding indexes every few months makes me very bored. In the end, I found that the problem was that the database design staff built the clustered index on the guid, while the guid was randomly generated and may be inserted to any position in the table, this greatly increases the number of fragments. Therefore, this situation occurs.

 

Summary

This article briefly introduces the principles of SQL server storage and the establishment of several clustered indexes that should be avoided. However, this is just to talk about the choice of clustered indexes from the perspective of performance. The selection of clustered indexes still requires comprehensive consideration.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.