server| Difference | index | optimization
In the principle of the database, the explanation of the clustered index is that the order of the clustered index is the physical storage order of the data, but the explanation of the nonclustered index is that the index order is independent of the physical order of the data. Because of this, a table can have at most one clustered index.
But this definition is too abstract. In SQL Server, the index is described by a binary tree's data structure, so we can understand the clustered index: the leaf node of the index is the data node. Rather than a clustered index, the leaf node is still an index node, except that there is a pointer to the corresponding block of data. The following figure:
Non-clustered index
Clustered index
What is the essential difference between a clustered index and a non-clustered index? When do I use a clustered index, and when do I use nonclustered indexes?
This is a very complicated problem, it is difficult to use words to make it clear. I'm here to talk a little from the perspective of SQL Server Index tuning queries (if you're interested in this, read the introduction to the data structure of Microsoft SQL Server 2000 database programming, Unit 3rd, and the 6th, 13, 14 units).
One, the difference between the index block and the data block
As we all know, indexing can improve retrieval efficiency because of its two-fork tree structure and small footprint, so access speed blocks. Let's calculate a math problem: if a record in a table occupies 1000 bytes on disk, we index one of the 10 bytes of the field, and the record has only 10 bytes of the size of the index block. As we know, SQL Server's minimum space allocation unit is "page", a page occupies 8K of space on disk, so this page can store 8 of the above records, but can store index 800. Now we're going to retrieve a record from a table with 8,000 records, and if there's no index, we might need to traverse 8,000 x1000 bytes/8k bytes = 1000 pages to find the result. If you have the above index in the retrieval field, then we can retrieve the index block that satisfies the condition in the 8,000 x10 byte/8k byte = 10 pages, and then find the result block according to the pointer on the index block, so that the IO traffic is much less.
Second, index optimization technology
Is there an index that must be retrieved quickly? The answer is yes. Sometimes it's better to use indexes than to index faster. For example, we want to retrieve all the records in the above table, if not indexed, need access to 8,000 x1000 bytes/8k = 1000 pages, if you use the index, first retrieves the index, accesses 8,000 x10 bytes/8k byte = 10 pages to obtain the index retrieval result, According to the index search results to the corresponding data page, because it is to retrieve all the data, so need to access 8,000 x1000 byte/8k bytes = 1000 pages to read all the data out, a total of 1010 pages visited, which is obviously not as fast as the index.
There is a complete set of data retrieval optimization techniques within SQL Server, in which case the SQL Server query plan automatically retrieves data using table scans without using any indexes. So how does SQL Server know when to index and when not to index? SQL Server maintains data statistics in addition to maintaining data information on a daily basis, and the following illustration is a screenshot of the Database Properties page:
As we can see from the diagram, SQL Server automatically maintains statistics that include data density information and data distribution information that helps SQL The server decides how to make a query plan and whether the query uses indexes and what indexes to use (this will no longer explain how they help SQL Server establish a query plan). Let's just do an experiment. Create a table: Tabtest (ID, unqvalue,intvalue), where the ID is an automatic numbering master index, Unqvalue is the uniqueidentifier type, the normal index on the above, Intvalue is plastic, Indexing is not established. The reason why a intvalue field without an index is hung is to prevent SQL Server from using index overlay query optimization techniques so that the experiment does not work. Input 10,000 random records to the table, the code is as follows:
CREATE TABLE [dbo]. [Tabtest] (
[ID] [int] IDENTITY (1, 1) not NULL,
[Unqvalue] [uniqueidentifier] Not NULL,
[Intvalue] [INT] Not NULL
) on [PRIMARY]
Go
ALTER TABLE [dbo]. [Tabtest] With NOCHECK ADD
CONSTRAINT [pk_tabtest] PRIMARY KEY CLUSTERED
(
[ID]
) on [PRIMARY]
Go
ALTER TABLE [dbo]. [Tabtest] ADD
CONSTRAINT [Df_tabtest_unqvalue] DEFAULT (NEWID ()) for [Unqvalue]
Go
CREATE INDEX [Ix_tabtest_unqvalue] on [dbo]. [Tabtest] ([Unqvalue]) On [PRIMARY]
Go
DECLARE @i int
DECLARE @v int
Set @i=0
While @i<10000
Begin
Set @v=rand () *1000
INSERT into Tabtest ([intvalue]) VALUES (@v)
Set @i=@i+1
End
Then we execute two queries and look at the execution plan, as shown in figure: (You can open the query plan in Query Analyzer's Query menu, and the GUID of the first query in the diagram is what I looked for in the database, and you can do the experiment according to the values in your database):
As you can see from the diagram, SQL Server uses the Ix_tabtest_unqvalue index in the first query, and, depending on the direction of the arrow, the computer looks in the index range first, and when found, uses bookmark lookup to map the index nodes to the data nodes. Finally, the select result is given. In the second query, the system directly traverses the table to give the result, but it uses a clustered index, why? Don't forget, the page node of the clustered index is the data node! This can be done faster with clustered indexes (not affected by data deletion, the storage void left by the update, and skipped through the empty data directly).
Next, we change the clustered index of the ID field to a nonclustered index in SQL Server and then execute the SELECT * from Tabtest, which is what we see as the execution plan:
Instead of using any indexes, SQL Server executes the table Scan directly, because the retrieval efficiency is the highest.
Three, the essential difference between clustered index and non-clustered index
It is now possible to discuss the essential difference between a clustered index and a nonclustered index. As shown in the first two diagrams of this article, the leaf nodes of the clustered index are the data nodes, and the page nodes that are not clustered are still index-oriented and retain a link to the corresponding block of data.
or through a math problem to see the difference between them: Suppose you have a table with 18,000 records, each record in the table occupies 1000 bytes on disk, and if you create a nonclustered index primary key on a 10-byte field, you need two fork tree nodes 16,000 (16,000 leaf nodes in 8,000 nodes, Each page node points to a data record, so that the data occupies 8,000 x1000 bytes/8k bytes = 1000 pages; The index consumes 16,000 nodes x10 bytes/8k bytes = 20 pages, totaling 1020 pages.
The same table, if we create a clustered index primary key on the corresponding field, because the page node of the clustered index is the data node, there are only 8,000 index nodes, which occupy 10 pages, and the data still occupies 1000 pages.
Let's take a look at why the primary key for nonclustered indexes is faster than the clustered index primary key when performing the insert operation. Primary KEY constraints require a primary key to not duplicate, so how does SQL Server know that there is no duplication? The only way to do that is to retrieve it. For nonclustered indexes, it is only necessary to retrieve 16,000 nodes from 20 pages to know if there are duplicates, because all the primary key values are included in the 16,000 index nodes. But for the clustered index, the index node only contains 8,000 intermediate nodes, as to whether there will be duplication must retrieve another 1000 page data nodes to know, then the equivalent of retrieving 10+1000=1010 page to know if there is duplication. Therefore, the insertion speed of clustered index primary key is much slower than that of nonclustered index primary key.
Let's take a look at the efficiency of data retrieval, if you retrieve the two tables, and in the case of indexes (sometimes the SQL Server execution plan chooses not to use the index, but let's assume that the index is used), for clustered index retrieval, We may visit 10 index pages plus 1000 data pages to get results (which is better than this) for nonclustered indexes, the system will find eligible nodes from 20 pages, then map to 1000 data pages (which is also the worst case), and compare One visited 1010 pages and the other visited 1020 pages, so the efficiency of the search was not very significant. So no matter whether the nonclustered index or the clustered index is good for sorting, the clustered index is only a little faster than the nonclustered index.
Conclusion
Well, writing for half a day, hands are tired. The experiment on the efficiency of Clustering index and non-clustered index I don't do it. You can use Query Analyzer to analyze your query plan. SQL Server is a complex system, especially indexing and query optimization techniques, and Oracle is more complex. Understanding the index and the things behind the query is not a bad thing, it can help us understand our system more deeply.