Understanding the concept of indexes in SQL Server, the principle

Source: Internet
Author: User

Transfer from: http://www.cnblogs.com/CareySon/archive/2011/12/22/2297568.html introduction

In SQL Server, an index is an enhanced presence, which means that even without an index, SQL Server can still implement its due functionality. However, indexes can greatly improve query performance in most cases, especially in OLAP. To fully understand the concept of the index, you need to understand a lot of original knowledge, including B-tree, heap, database pages, zones, fill factors, fragments, filegroups and so on a series of related knowledge, this knowledge to write a small book is not too. So this article does not discuss these topics in depth.

What is an index

An index is a structure that sorts the values of one or more columns in a database table and uses an index to quickly access specific information in a database table.

In the case of streamlining, an index is a structure. In SQL Server, the storage structure of the Index and table (which refers to the table with the clustered index) is the same as the B-tree, which is a balanced multi-fork tree used for lookups. Understanding the concept of B-trees such as:

Understanding why a B-tree is used as a structure for indexes and tables (with clustered indexes) first requires understanding how SQL Server stores data.

In SQL Server, the smallest unit of storage is a page, and the page is not re-divided. Just as cells are no longer part of biology, or atoms are the smallest units in chemistry that cannot be divided. This means that SQL Server reads the page, either the entire read, or not at all, without compromise.

In the case of database retrieval, the disk IO scan is the most time consuming. Because disk scanning involves a lot of physical features, these are quite time consuming. So the B-tree was designed to reduce the number of scans on the disk. If a table or index does not use a B-tree (for tables that do not have a clustered index that uses heap heap storage), then looking up a data requires a full scan of the database pages contained in the entire table. This will undoubtedly greatly increase the IO burden. Instead of using B-trees for storage in SQL Server, you just need to store the root node of the B-tree in memory, and after a few lookups, you can find the page containing the leaf node that holds the required data! The overall scan to avoid this improves performance.

Below, an example is shown to illustrate:

In SQL Server, if the clustered index is not established on the table, it is stored in heap (heap), assuming I have a table like this:

Now there are no indexes on this table, that is, heap storage, and I show a decrease in IO by adding a clustered index (stored in B-tree) to it:

Understanding Clustered and Clustered Indexes

In SQL Server, the two most important types of indexes are clustered and nonclustered indexes. As you can see, these two classifications are clustered around this keyword. So first of all to understand what is aggregation.

The definitions that are clustered in the index:

In order to improve the query speed of a property (or group of properties), it is called aggregation that a tuple with the same value on this or these attributes (called the aggregation code) is stored in a contiguous set of physical blocks.

In simple terms, a clustered index is:

In SQL Server, the role of aggregation is to change the physical order of one column (or columns) to match the logical order, for example, I extract 5 data from the employee of the AdventureWorks database:

When I set up a clustered index on ContactID, I query again:

In SQL Server, the storage of a clustered index is stored as a B-tree, and the leaves of the B-tree store the data of the clustered index directly:

Because the clustered index changes the physical storage order of the table in which it resides, there can be only one clustered index per table.

Nonclustered indexes

Because each table can have only one clustered index, if our query on a table is not limited to the fields on the clustered index. We also have requirements for indexes outside of the clustered index column, so we need a nonclustered index.

A nonclustered index is, in essence, a clustered index. A nonclustered index does not alter the physical structure of the table in which it is located, but rather generates an additional B-tree structure for a clustered index, but the leaf node is a reference to its own table, which is divided into two types, and the row number is referenced if there is no clustered index on the table. If the clustered index already exists on the table, the page referencing the clustered index is referenced.

A simple nonclustered index concept is as follows:

As you can see, nonclustered indexes require additional space for storage, clustered indexes by indexed columns, and a leaf node in the B-tree contains pointers to the table where the nonclustered indexes are located.

In MSDN, the description graph for nonclustered indexes is:

As you can see, a nonclustered index is also a B-tree structure, unlike a clustered index, where the leaf node of a B-tree is a pointer to a heap or clustered index.

The principle of nonclustered indexes shows that if the physical structure of the table is changed, such as adding or deleting a clustered index, then all nonclustered indexes need to be rebuilt, and this loss of performance is considerable. Therefore, it is best to set up a clustered index and then set up a corresponding nonclustered index.

Clustered index VS Nonclustered index

This is explained earlier by the principle of clustered and nonclustered indexes. It's easy to see that, in most cases, clustered indexes are faster than nonclustered indexes. Because the B-tree leaf nodes of the clustered index store data directly, the nonclustered index also requires additional pointers to the leaf nodes to find the data.

Also, for large numbers of continuous data lookups, nonclustered indexes are weak because nonclustered indexes need to find a pointer to each row in the B-Tree of the nonclustered index, and then go to the table where the data is found, so performance can be compromised. Sometimes it's better not to add nonclustered indexes.

Therefore, in most cases, the clustered index is faster than the nonclustered index. But there can only be one clustered index, so choosing the columns that are applied to the clustered index is critical to query performance improvements.

Use of indexes

The use of indexes does not need to be used explicitly, and the Query Analyzer automatically finds the shortest Path usage index after indexing.

But this is the case. As the amount of data grows, the index fragmentation is generated, and many of the stored data is not properly spread, causing fragmentation (introduction to spreads and fragments and fill factors, as I'll say in subsequent articles) we need to re-index to speed performance:

For example, a clustered index and a nonclustered index established on the previous TEST_TB2 can query its index through the DMV statement:

SELECT Index_type_desc,alloc_unit_type_desc,avg_fragmentation_in_percent,fragment_count,avg_fragment_size_in_ Pages,page_count,record_count,avg_page_space_used_in_percentfrom sys.dm_db_index_physical_stats (DB_ID (' AdventureWorks '), object_id (' Test_tb2 '), Null,null, ' Sampled ')

We can improve the speed by rebuilding the index:

ALTER INDEX Idx_text_tb2_employeeid on TEST_TB2 REBUILD

Another situation is that, as the amount of data in the table increases, sometimes it is necessary to update the statistics on the table so that the Query Analyzer chooses the path based on the information, using:

UPDATE STATISTICS Table Name

So when do you know if you need to update these statistics, that is, when the estimated number of rows in the execution plan is not the same as the actual table:

The cost of using an index

My favorite word is "everything has the price". Any performance gains we get through indexing are not without a price. The cost comes from several aspects.

1. The principle of the clustered index we know that when the table is indexed, the data is stored as a B-tree. So when the update is inserted and deleted, it is necessary to move the page physically to adjust the B-tree. Therefore, when the update is inserted to delete data, it can degrade performance. For clustered indexes, when the table is updated, the nonclustered indexes also need to be updated, which is equivalent to a multiple update of N (n= nonclustered index) tables. Therefore, it also degrades performance.

2. With the above introduction to the principle of nonclustered indexes, you can see that nonclustered indexes require additional disk space.

3. As mentioned earlier, inappropriate nonclustered indexes can degrade performance.

Therefore, the use of indexes requires a tradeoff based on the actual situation. I usually put the nonclustered index on a separate hard disk, so that I can scatter IO so that the query is parallel.

Summarize

This article introduces the index in SQL Server from the principles and concepts of the index, which is a powerful tool and a double-edged sword. For proper use of indexes, it is necessary to systematically learn the principles of indexes and the related principles of database storage.

Understanding the concept of indexes in SQL Server, the principle

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.