In database principles, clustering indexes are interpreted as follows: the order of clustering indexes is the physical storage order of data, while the non-clustering indexes are interpreted as follows: the index order is irrelevant to the physical data arrangement order. Because of this, a table can have at most one clustered index.
However, this definition is too abstract. In SQL Server, indexes are described through the binary tree data structure. We can understand clustering indexes as follows: the leaf node of the index is the data node. The leaf node without clustering index is still an index node, but there is a pointer pointing to the corresponding data block. For example:
Non-clustered Index
Clustered Index
What is the essential difference between clustered index and non-clustered index? When Will clustering indexes be used and when will non-clustering indexes be used?
This is a complicated problem and it is difficult to make it clear in a few words. Here is a brief introduction from the perspective of SQL Server INDEX OPTIMIZATION query (if you are interested in this aspect, you can read the data structure introduction of unit 2000 in Microsoft SQL Server 3rd database programming and unit 6th, 13, and 14 published by Microsoft ).
I. Differences between index blocks and data blocks
As we all know, indexing can improve the retrieval efficiency, because it has a binary tree structure and occupies a small amount of space, so the access speed block. Let's calculate a mathematical problem: If a record in the table occupies 1000 bytes on disk, We will index a field of 10 bytes, the index block size of the record is only 10 bytes. We know that the minimum space allocation unit of SQL Server is "Page". If one Page occupies 8 K space on the disk, this Page can store 8 of the above records, however, 800 indexes can be stored. Now we need to retrieve records that meet a certain condition from a table with 8000 records. If there is no index, we may need to traverse 8000x1000 bytes/8 K Bytes = 1000 pages to find the result. If the above index is available for the search field, we can retrieve the index blocks that meet the Search Condition in 8000 × 10 bytes/8 K Bytes = 10 pages, then find the result data block one by one based on the pointer on the index block, so that the IO traffic is much less.
Ii. Index Optimization Technology
Is retrieval faster if there is an index? Answer. In some cases, using indexes is not as fast as using indexes. For example, if you want to retrieve all the records in the above table, if you do not need an index, you need to access 8000x1000 bytes/8 K Bytes = 1000 pages. If you want to use an index, you must first retrieve the index, access 8000 pages x 10 bytes/8 K Bytes = 10 pages to obtain the index search results. Then, retrieve the corresponding data page based on the index search results. Because all data is retrieved, therefore, you need to access 8000 more lines X 1000 bytes/8 K Bytes = 1000 pages to read all the data. A total of 1010 pages are accessed, which is obviously not as fast as that without indexes.
SQL Server has a complete set of data retrieval optimization technologies. In the above cases, SQL Server's Search Plan will automatically retrieve data using Table scanning instead of using any indexes. So how does SQL Server know when to use indexes and when to not use indexes? In addition to daily data maintenance, SQL Server also maintains data statistics, which is one of the database attribute pages:
We can see that SQL Server automatically maintains statistics, including data density and data distribution information, this information helps SQL Server determine how to create a query plan, whether to use indexes, and what indexes are used for queries (here we will not explain how they helped SQL Server establish a query plan ). Let's make an experiment. Create a table: tabTest (ID, UnqValue, intValue), where ID is the primary index with an automatic integer number, and unqValue is of the uniqueidentifier type. A common index is created on it, And intValue is an integer without an index. The reason why an index-free intValue field is mounted is to prevent SQL Server from overwriting the query optimization technology using the index, so that the experiment will not function. Enter 10000 random records into the table. The Code is as follows:
Create table [dbo]. [tabTest] (
[ID] [int] IDENTITY (1, 1) not null,
[UnqValue] [uniqueidentifier] not null,
[IntValue] [int] NOT NULL
) ON [PRIMARY]
GO
Alter table [dbo]. [tabTest] WITH NOCHECK ADD
CONSTRAINT [PK_tabTest] PRIMARY KEY CLUSTERED
(
[ID]
) ON [PRIMARY]
GO
Alter table [dbo]. [tabTest] ADD
CONSTRAINT [DF_tabTest_unqValue] DEFAULT (newid () FOR [unqValue]
GO
Create index [IX_tabTest_unqValue] ON [dbo]. [tabTest] ([unqValue]) ON [PRIMARY]
GO
Declare @ I int
Declare @ v int
Set @ I = 0
While @ I <10000
Begin
Set @ v = rand () * 1000
Insert into tabTest ([intValue]) values (@ v)
Set @ I = @ I + 1
End
Then we execute two queries and view the execution plan. (The query plan can be opened in the query menu of the query analyzer. At the same time, The GUID of the first query on the graph is obtained from the database, you can set this parameter based on the values in your database ):
As you can see, in the first query, SQL Server uses the IX_tabTest_unqValue index. Based on the arrow direction, the computer first finds the index range, use Bookmark Lookup to map the index node to the data node, and finally give the SELECT result. In the second query, the system directly traverses the table to give the result, but it uses the clustered index. Why? Don't forget, the page node of the clustered index is the data node! In this way, clustering indexes will be faster (not affected by the storage holes left by data deletion and update, and directly traversing data will skip these holes ).
Next, in SQL Server, we change the clustered index of the ID field to a non-clustered index, and then execute select * from tabTest. The execution plan we see is changed:
Instead of using any indexes, SQL Server directly executes Table Scan, because only in this way can the retrieval efficiency be the highest.
Iii. essential differences between clustered indexes and non-clustered Indexes
Now we can discuss the essential differences between clustered indexes and non-clustered indexes. As shown in the preceding two figures, the leaf node of the clustered index is the data node, and the page node of the non-clustered index is still the index checkpoint, and keep a link pointing to the corresponding data block.
Let's look at their differences through a mathematical question: Suppose there is a table with 8000 records, each of which occupies 1000 bytes on the disk, if you create a primary key for a non-clustered index on a 10-byte long field, you need 16000 Binary Tree nodes (these 16000 nodes have 8000 leaf nodes, each page node points to a data record. In this way, the data occupies 8000x1000 bytes/8 K Bytes = 1000 pages; the index occupies 16000 nodes × 10 bytes/8 K Bytes = 20 pages, totaling 1020 pages.
In the same table, if we create a primary key for the clustered index on the corresponding field, because the page node of the clustered index is a data node, there are only 8000 index nodes, occupying 10 pages, data still occupies 1000 pages.
Next, let's take a look at why the primary key of the non-clustered index is faster than the primary key of the clustered index during the insert operation. The primary key constraint requires that the primary key should not be repeated. How does SQL Server know that the primary key should not be repeated? The only method is search. For non-clustered indexes, you only need to retrieve 16000 nodes on 20 pages to check whether there are duplicates, because all primary key values are included in these 16000 index nodes. However, for clustered indexes, the index node only contains 8000 intermediate nodes. As to whether there will be duplicates, You must retrieve the remaining 1000 data pages, it is equivalent to searching 10 + 1000 = 1010 pages to check whether there are duplicates. Therefore, the insertion speed of the clustered index primary key is much slower than that of the non-clustered index primary key.
Let's take a look at the efficiency of data retrieval. If you search for the above two tables, you can use the index (sometimes the SQL Server execution plan will choose not to use the index, however, let's assume that you must use indexes here. For clustered index retrieval, we may access 10 index pages and 1000 data pages to get the results (which is better in actual situations ), for non-clustered indexes, the system will find qualified nodes from 20 pages and map them to 1000 data pages (this is also the worst case, when one accesses 1010 pages and the other accesses 1020 pages, the retrieval efficiency is not very different. Therefore, no matter whether it is a non-clustered index or a clustered index, it is suitable for sorting. The clustered index is only a little faster than the non-clustered index.
Conclusion
Well, I have been writing for a long time, and my hands are exhausted. The experiment on clustering index and non-clustering index efficiency will not be done. If you are interested, you can use the query analyzer to analyze the query plan. SQL Server is a complicated system, especially the index and query optimization technologies. Oracle is more complicated. Understanding the things behind indexes and queries is not a bad thing. It can help us better understand our system.