Brief introduction
In SQL Server, data is stored on a page. When you add a clustered index to a table, SQL Server searches for the data by using the columns of the clustered index as the keyword. Therefore, the impact on performance of the selection of clustered indexes becomes very important. This paper focuses on the selection of clustered indexes from the perspective of performance, but this is only considered in terms of performance. For tables with special business requirements, you need to choose the actual situation.
The combination of the column or column that contains the clustered index is best unique
This reason needs to be discussed from the data storage principle. In SQL Server, data is stored not in rows (row), but in page units. Therefore, when looking for data, the smallest unit that SQL Server looks for is actually a page. That is, even if you look for only a small row of data, SQL Server will find the entire page and put it in the buffer pool.
The size of each page is 8K. Each page will have a physical address for SQL Server. This address is written as a file number: Page number (Understanding the file number requires you to understand the file and file group). For example, page 50th of the first file. The page number is 1:50. When a table does not have a clustered index, the data pages in the table are stored in a heap (Heap), and on the basis of the page, SQL Server uniquely identifies each row by an extra line number, which is the legendary RID. The RID is a file number: Page number: line number to represent, assuming that this line is on line 5th of the page mentioned earlier, the RID is represented as 1:50:5, as shown in Figure 1.
Figure 1. Example of a RID
From the concept of the RID, the RID is not just a basis for SQL Server to determine each row, but also a place to store rows. Pages are rarely moved when the page is organized through a heap (Heap).
When a clustered index is established on a table, the pages in the table are organized according to B-Tree. At this point, SQL Server looks for rows that are no longer being looked up by the RID, instead using the keyword, which is the column of the clustered index to look up as a keyword. Assuming the table in Figure 1, we set the DepartmentID column as the clustered index column. The row of the non-leaf node of the B-tree contains only the DepartmentID and the bookmark (bookmark) that points to the next level of nodes.
When we create a clustered index with a value that is not unique, SQL Server cannot determine only one row through the clustered indexed column (that is, the keyword). In this case, in order to achieve a unique distinction for each row, SQL Server is required to make a distinction between the aggregated indexed columns of the same value, which is called uniquifiers. With the use of uniquifier, the impact on performance is divided into the following two parts:
SQL Server must judge the current data at the time of insertion or update whether it duplicates the existing key, and if it repeats, it needs to generate Uniquifier, which is an extra overhead.
Because of the need to add extra uniquifier to the keys of the same value, the size of the key is increased by an extra. Therefore, both leaf nodes and non leaf nodes require more pages for storage. This also affects the nonclustered index, which makes the bookmark columns of the nonclustered index larger so that the nonclustered indexes need more pages to store.
Here we test, create a test table, and create a clustered index. Insert 100,000 test data, each of which repeats in 2, as shown in Figure 2.
See more highlights of this column: http://www.bianceng.cnhttp://www.bianceng.cn/database/SQLServer/
Figure 2. Test code for inserting data
At this point, we'll look at the number of pages in this table, as shown in Figure 3.
Figure 3. After inserting the duplicate key, 100,000 data occupies 359 pages.