Simple Introduction
In SQL Server, data is stored on a per-page basis. When you add a clustered index to a table, SQL Server looks for the data according to the clustered index column as keyword.
Therefore, the impact on the performance of the selection of clustered indexes becomes very important. This article focuses on the selection of clustered indexes from a performance perspective, but this is a performance consideration. For a table with special business requirements, you need to choose according to the actual situation.
The combination of columns or columns where the clustered index is located is best unique
This reason needs to be discussed from the data storage principle.
In SQL Server, data is not stored in rows (row). Instead, the page is the unit. So. When you are looking for data. The smallest unit that SQL Server finds is actually a page.
This means that even if you just look for a very small row of data, SQL Server looks up the entire page and puts it in the buffer pool.
The size of each page is 8K. Each page will have a physical address for SQL Server. The address is written as a file number: Page number (Understanding the file number requires you to understand the file and the filegroup). For example, the 50th page of the first file.
The page number is 1:50. When the table does not have a clustered index. The data pages in the table are stored in heaps (heap), and on the basis of the page, SQL Server uniquely determines each row by an additional line number. This is also the legendary RID. The RID is the file number: Page number: line number to be represented, if this line is in line 5th of the previously mentioned page, the RID is represented as 1:50:5, 1 is seen.
Figure 1. Demo Sample of RID
From the concept of RID, the RID is not the only SQL Server to determine the basis of each row, but also the storage location of the row. When the page is organized through the heap (heap). The page moves very little.
When a clustered index is established on a table, the pages in the table are organized according to the B-tree. At this point, the SQL Server lookup row is no longer looked up by the RID, instead using keyword, which is the clustered index column as keyword. If the table in Figure 1, we set the DepartmentID column as the clustered index column. The rows of the non-leaf nodes of the B-tree include only the DepartmentID and the bookmark (bookmark) Pointing to the next layer of nodes.
And when we create a clustered index, the value is not only temporary. SQL Server cannot uniquely determine a row through a clustered index column (that is, keyword).
At this time In order to achieve a unique distinction between each row, SQL Server is required to generate an additional identifying information for the clustered index column of the same value to differentiate. This is also called Uniquifiers.
And after using the Uniquifier. The impact on performance is divided into two parts, such as the following:
- SQL Server must infer whether today's data is repeated with existing keys when inserting or updating. Suppose repeatedly. You need to generate Uniquifier, which is an extra overhead.
- Because of the need to add additional uniquifier to the keys of the same value, the size of the key is added to the extra. Therefore, both leaf nodes and non-leaf nodes require many other pages to be stored. This also affects nonclustered indexes, which make the bookmark columns of nonclustered indexes larger, so that nonclustered indexes also require many other pages to store.
Here we test, create a frequently, create a clustered index. Insert 100,000 test data. Each of the 2 articles repeats, 2 see.
Figure 2: Test code for inserting data
At this point, let's look at the number of pages that this table occupies, 3 see.
Figure 3:100,000 data accounted for 359 pages after inserting repeated keys
We insert 100,000 again without repeating the data, 4 see.
Figure 4: Inserting 100,000 code that is not repeatedly built
At this point, the number of pages reduced to 335 pages, 5 see.
Figure 5: Insert does not repeat key after reduced to 335 pages
Therefore, it is recommended to use a unique key for the column that contains the clustered index.
It is best to use a narrow or narrow column combination as a clustered index column
This is the same as the principle of reducing the page above, the narrow column makes the key size smaller.
The non-leaf nodes of the clustered index are lowered, and the non-clustered index's bookmarks become smaller, resulting in fewer leaf node pages. Finally, it improves performance.
use a combination of columns or columns with very little change in value as a clustered index column
In the front we know.
When a clustered index is created for a table. SQL Server looks up rows according to the key.
Since the data is ordered in the B number, when the clustered index key changes, it is necessary not only to change the value itself, but also to change the position (RID) of the row where the key is located. Therefore, it is possible to move rows from one page to another page.
thus achieving order. This can therefore lead to problems such as:
- The line moves from page to page, this operation is overhead, and this operation can affect other rows as well. So that other rows also need to move position, it is possible to generate paging
- The movement of rows between pages results in index fragmentation
- The change of the key affects the nonclustered index, so that the bookmark for the nonclustered index also needs to be changed. This is another extra expense.
This is why very many tables create a column that is independent of the data itself as a primary key analogy to the Person.Address table in the AdventureWorks database, using ADDRESSID, which is independent of the data itself, as a clustered index column, as seen in 6. With AddressLine1 as the primary key, changes to the employee's address can cause problems with the list above.
Figure 6. Create a column independent of the data itself as a clustered index column
It is best to use the self-increment column as a clustered index column
It is also recommended to create a self-increment column that is independent of the data itself as a clustered index column. We know that assuming that the newly added data assumes that the clustered index column needs to be inserted into the currently ordered B-tree, it is necessary to move the other rows to make a position for the newly inserted row. Paging and index fragmentation can therefore be caused. The same. It also creates additional burdens on changing nonclustered indexes. With self-increment columns, the insertion of new rows greatly reduces paging and fragmentation.
I've had a situation recently. A table performance is incredibly slow every few months. Initial review is due to a large number of index fragments. But rebuilding indexes every few months makes me sick.
Finally I found. The problem is that the person who designed the database at the time built the clustered index on the GUID, and the GUID was randomly generated, and could be inserted anywhere in the table, thus greatly adding the number of fragments.
So the above situation is caused.
Summary
This article introduces the principles of SQL Server storage and the establishment of several clustered indexes that should be circumvented, but it does not talk about the choice of a clustered index from a performance standpoint.
For the selection of clustered indexes, it is still necessary to take full account of the decision.
Copyright notice: This article blog original articles, blogs, without consent, may not be reproduced.
But from talking about Performance point SQL Server Select the clustered index key