Discussion on the selection of SQL Server clustered index keys from the perspective of performance

Source: Internet
Author: User

Introduction

In SQL Server, data is stored on a per-page basis. When you add a clustered index to a table, SQL Server finds the data by using the clustered index column as the keyword. Therefore, the impact on the performance of the selection of clustered indexes becomes very important. This article focuses on the selection of clustered indexes from a performance perspective, but this is only a performance consideration. For tables with special business requirements, you need to select them according to the actual situation.

The combination of columns or columns where the clustered index is located is best unique

This reason needs to be discussed from the data storage principle. In SQL Server, the data is not stored in rows (row), but in page units. Therefore, when you look up data, the smallest unit that SQL Server finds is actually a page. This means that even if you only look for a small row of data, SQL Server looks up the entire page and puts it in the buffer pool.

The size of each page is 8K. Each page will have a physical address for SQL Server. The address is written as a file number: Page number (Understanding the file number requires you to understand the file and filegroup). For example, page 50th of the first file. The page number is 1:50. When a table does not have a clustered index, the data pages in the table are stored in heaps (heap), and on the basis of the page, SQL Server uniquely determines each row by an additional line number, which is the legendary RID. RID is the file number: Page number: line number to represent, assuming this line in the previous page in the 5th row, the RID is expressed as 1:50:5, 1 is shown.

Figure 1. Example of RID

From the concept of RID, rids are not just the basis for which SQL Server uniquely determines each row, but also the location where the rows are stored. When a page is organized through the heap (heap), the page rarely moves.

When a clustered index is established on a table, the pages in the table are organized according to the B-tree. At this point, the SQL Server lookup row is no longer looked up by the RID, and instead uses the keyword, which is the clustered index column, to find the keyword. Assuming the table in Figure 1, we set the DepartmentID column as the clustered index column. The row of the non-leaf node of the B-tree contains only the DepartmentID and bookmarks (bookmark) pointing to the next layer of nodes.

When the value of the clustered index that we create is not unique, SQL Server cannot uniquely determine a row by a clustered index column (that is, the keyword). At this point, in order to achieve a unique distinction between each row, SQL Server needs to generate an additional identifying information for the clustered index column of the same value, which is known as uniquifiers. With the use of uniquifier, the impact on performance is divided into the following two parts:

    • SQL Server must evaluate the current data at insert or update to see if it duplicates the existing key, and if it repeats, it needs to generate Uniquifier, which is an additional overhead.
    • Because you need to add extra uniquifier to the keys of the same value, the size of the key is increased by an extra amount. Therefore, both leaf nodes and non-leaf nodes require more pages to be stored. This also affects nonclustered indexes, which make the bookmark columns of nonclustered indexes larger, making nonclustered indexes require more pages to be stored.

Here we test, create a test table, create a clustered index. Insert 100,000 test data, each repeating 2 bars, 2.

Figure 2: Test code for inserting data

At this point, let's look at the number of pages that this table occupies, as shown in 3.

Figure 3:100,000 data accounted for 359 pages after inserting duplicate keys

We are inserting 100,000 non-repeating data again, as shown in 4.

Figure 4: Inserting 100,000 non-repeating built code

At this point, the page count is reduced to 335 pages, as shown in 5.

Figure 5:335 page reduction after inserting a non-repeating key

Therefore, it is recommended to use a unique key for the column that contains the clustered index.

It is best to use a narrow or narrow column combination as a clustered index column

This is the same as the principle of reducing the page above, the narrow column makes the key size smaller. The non-leaf nodes of the clustered index are reduced, and the non-clustered index's bookmarks become smaller, resulting in fewer leaf node pages. The result is improved performance.

use a combination of columns or columns with very little change in value as a clustered index column

In the front we know. When a clustered index is created for a table. SQL Server finds rows by key. Because in the B number, the data is ordered, so when the clustered index key changes, not only need to change the value itself, but also need to change the position of the key row (RID), it is possible to move the row from one page to another page. thus achieving order. As a result, the following questions can be created:

    • A row moves from page to page, and this operation is overhead, and this operation can affect other rows as well, so that other rows also need to be moved, which may result in paging
    • The movement of rows between pages results in index fragmentation
    • A change in the key will affect the nonclustered index, so that the bookmark for the nonclustered index also needs to be changed, which is an additional overhead

This is why many tables create a column unrelated to the data itself as the primary key, such as the Person.Address table in the AdventureWorks database, using Addressid, which is not the data itself, as a clustered index column, as shown in 6. With AddressLine1 as the primary key, changes to the employee's address can cause problems with the list above.

Figure 6. Create a column independent of the data itself as a clustered index column

It is best to use the self-increment column as a clustered index column

It is also recommended to create a self-increment column that is independent of the data itself as a clustered index column. We know that if the newly added data is to be inserted into the currently ordered B-tree, the additional rows need to be moved to make the newly inserted row free. Paging and index fragmentation can therefore be caused. Similarly, the additional burden of modifying a nonclustered index is also caused. With self-increment columns, the insertion of new rows greatly reduces paging and fragmentation.

I've had a situation recently. A table performance is incredibly slow every few months, and the initial look is due to a large number of index fragments. But rebuilding the index every few months makes me sick. I finally found that the problem was that the person who designed the database at the time built the clustered index on the GUID, and the GUID was randomly generated, and could be inserted anywhere in the table, greatly increasing the number of fragments. So this is the case.

Summary

This paper briefly introduces the principle of SQL Server storage and the establishment of several clustered indexes that should be avoided, but it only discusses the selection of clustered indexes from the perspective of performance. For the selection of clustered indexes, it is still necessary to consider the decision in a comprehensive way.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.