T-SQL queries--in-depth understanding of indexes, Principles (b-Tree)

Source: Internet
Author: User

In SQL Server, an index is an enhanced presence, which means that even without an index, SQL Server can still implement its due functionality. However, indexes can greatly improve query performance in most cases. In OLAP, it is particularly obvious that to fully understand the concept of the index, we need to understand a lot of the original knowledge, including B-tree, heap, Database page, area, fill factor, fragment, file group and so on a series of related knowledge.

A structure for sorting the values of one and more columns in a table in a database at index time, using an index to quickly access specific information in a data table.

In the case of streamlining, the index is structured in SQL Server, where the storage structure of the Index and table (which is worth the table with the clustered index) is the same as the B-tree, B-tree is a way to find the balanced multi-fork tree and understand the concept of B-tree such as:

Understanding why a B-tree is used as a structure for indexes and tables (with clustered indexes) first requires understanding how SQL Server stores data.

B-Tree, binary search tree, structural features:

1. All non-leaf nodes have up to two sub-nodes (left and right)

2. All nodes store a keyword

3. The left pointer of a non-page child node points to a subtree smaller than its key, and the right node points to a subtree larger than its keyword

Look at the following chart:

B-Tree search, starting from the root node, if the query keyword is equal to the node, then hit, otherwise, if the query keyword is smaller than the node keyword, then go to the left node, if larger than the keyword, enter the right node; If the left node or right node pointer is empty, the report cannot find the corresponding keyword.

If the number of nodes of all non-leaf nodes in the B-tree remains approximately (balanced), then the search performance of the B-tree approximates the binary lookup, but it has the advantage over the binary lookup of contiguous memory space: Changing the B-tree structure (insert and delete nodes) does not need to move large segments of memory data, or even constant overhead, This is also a bit of data tables and indexes that are stored in this way.

However, this will also bring a corresponding disadvantage, that is, the B-tree after many insertions and deletions, it is possible to lead to structural changes:

The right is also a B-tree, but its straight-line can already be linear, the same keyword set may lead to different tree structure index, so, the use of B-tree to keep the B-tree as far as possible to maintain the structure of the left, and avoid the structure of the right graph, there is a balanced binary tree algorithm.

The actual use of B-tree is based on the B-tree to add the balance algorithm, that is, "balanced binary tree"; If the equilibrium algorithm that maintains the distribution of B-tree nodes evenly is the key to balance the binary tree. The equilibrium algorithm is a strategy to insert and delete nodes in the B-tree;

The other is a multi-path search tree (not binary) all the features of the tree are:

1, define any non-leaf sub-node up to only M child nodes, and m>2

2. The number of child nodes of the root node is [2,m]

3. Root node The number of child nodes of unexpected non-leaf nodes is [m/2,m]

4, each node storage at least m/2-1 (take up the whole) and at most M-1 keywords; (at least two keywords)

5. Number of key words for non-leaf nodes = number of pointers to child nodes-1

6. Non-leaf node keywords: k[1],k[2]......k[m-1], and k[i]<k[i+1]

7, non-leaf node pointer: p[1],p[2]....p[m]; where p[1] to the subtree of the keyword k[1], p[m] to the subtree of the keyword greater than k[m-1], and the other p[i] to the subtree that the keyword belongs to (k[i-1],k[i]);

8. All leaf nodes are located on the same floor as: (m=3)

B-Tree search, starting from the root node, the nodes within the keyword (ordered) sequence of binary search, if the hit is finished, otherwise enter the query keyword to the range of the son node; repeat until the corresponding son pointer is empty, or is already a leaf node;

B-Tree Features:

1, the keyword set is distributed in the whole tree;

2. Any keyword appears and appears only in one node.

3. Search may end at non-leaf junction

4. Its search performance is equivalent to doing one-time binary search in the complete range of keywords

5. Automatic Level control

Due to the restriction of non-leaf nodes other than nodes, at least the M/2 son is included to ensure the minimum utilization of the end point. Its minimum search performance is:

Among them, M is the number of non-leaf nodes set, n is the total number of keywords, so the performance of B-tree is always equivalent to binary lookup (independent of M-value), that is, the problem of no B-tree balance, due to the limitations of M/2, when inserting nodes, if the node is full, the node needs to be divided into two each M/2 When you delete a node, you need to combine two insufficient M/2 nodes to ensure balance.

A little complicated, to tell the truth personally do not understand, the general understanding of the structure of the storage form can be, in SQL Server, the smallest unit of storage is the page, the page is not re-divided. Atomicity, which means that the read of the page in SQL Server, either the entire read, or completely not read, there is no compromise.

In database retrieval, the most time-consuming for disk IO scanning, because disk scanning involves a lot of physical features, which are quite time consuming. So the B-tree is designed to minimize the number of scans on the disk. If a table or index does not use a B-tree (for tables that do not have a clustered index, it is stored in heap heap), then looking up a data requires a full scan of the database pages contained in the entire table. This will undoubtedly greatly increase the IO burden, and in SQL Server using the B-tree for storage, you just need to store the root node of the B-tree in memory, after a few lookups can find the leaf node containing the data needed to contain the page, thus avoiding the overall scan to improve performance.

Below, an example is shown to illustrate:

In SQL Server, if you do not have a clustered index on the table, it is stored in a heap heap, assuming I have such a table

Now there are no indexes on this table, that is, heap storage, and we show a reduction in IO by adding a clustered index (stored in B-tree) to it:

Understanding Clustered and Clustered Indexes

In SQL Server, the two most important types of indexes are clustered and nonclustered indexes. As you can see, these two classifications are all centered around the keyword aggregation, so first understand what aggregation is.

The definitions that are clustered in the index:

In order to improve the query speed of a property (or attribute group), this or these attributes (which become the aggregation code) have the same worth of tuples that are stored in contiguous physical blocks into the aggregation.

In simple terms, a clustered index is:

In SQL Server, the role of aggregation is to change the physical order of a column (or columns) to match the logical order, for example, I extract 5 data from an employee in the AdventureWorks database:

When I set up a clustered index on ContactID, I query again:

In SQL Server, the storage of a clustered index is stored as a B-tree, and the leaves of the B-tree store the data of the clustered index directly:

Because the clustered index changes the physical storage order of the table in which it resides, there can be only one clustered index per table.

Nonclustered indexes

Because each table can have only one clustered index, if our query on a table is not limited to the fields on the clustered index. We also have requirements for indexes outside of the clustered index column, so we need a nonclustered index.

A nonclustered index, which is essentially a clustered index, does not alter the physical structure of the table in which it is located, but rather generates an additional B-tree structure of a clustered index, but the leaf node is a reference to its table, which is divided into two types, and the reference line number if there is no clustered index on the table. If the clustered index is already on the table, reference the page of the clustered index for greater use.

A simple nonclustered index concept is as follows:

As you can see, nonclustered indexes require additional space for storage, clustered indexes by indexed columns, and the leaf nodes of the B-tree contain pointers to the nonclustered indexes.

As you can see, a nonclustered index is also a B-tree structure, unlike a clustered index, where the leaf node of a B-tree is a pointer to a heap or clustered index.

The principle of nonclustered indexes shows that if the physical structure of the table is changed, such as adding or deleting a clustered index, then all nonclustered indexes need to be rebuilt, which is quite a significant performance drain. Therefore, it is best to set up a clustered index and then set up a corresponding nonclustered index.

Clustered index vs Nonclustered index

By explaining the principles of clustered and nonclustered indexes, it is easy to see that in most cases the clustered index is slightly faster than a nonclustered index, because the B-tree leaf nodes of the clustered index hold data directly, while non-clustered indexes require additional data to be found through the leaf node's pointer.

Also, for large numbers of continuous data lookups, nonclustered indexes are weak because nonclustered indexes need to find a pointer to each row in the B-Tree of the nonclustered index, and then go to the table where the data is found, so the sex will be compromised, and sometimes not even a nonclustered index.

Therefore, in most cases, nonclustered indexes are faster than nonclustered indexes. But there can only be one clustered index, so choosing the columns that are applied to the clustered index is critical to query performance improvements.

Fill factor

When the B-tree group as the physical storage structure of the index, this also involves a new concept: fill factor, as seen from the structure of the B-tree above, when the data table in the area of the deletion and increase, the need to dynamically modify the index structure in the B-tree, in order to achieve the balance of B-tree, to achieve the search dichotomy There is a need for each node in the B-tree non-page node to have a certain amount of space to record new data or to describe the deletion of data, which is called a fill factor, see the explanation in MSDN:

The fill factor option is provided to optimize index storage and performance. When you create or rebuild an index, the value of the fill factor determines the percentage of space on each page-level page on which to populate the data so that some remaining space on each page is left as free space for future extended indexes. For example, specifying a fill factor value of 80 means that 20% of the space on each leaf-level page will be left blank to provide space for the extended index as data is added to the underlying table. Preserves free space between indexes, rather than at the end of the index. However, it is important to note that when the fill factor is 0, it does not mean that the page is all data, but instead represents a fully populated level page. The padding factor is worth a percentage of 1 to 100.

Use of indexes

The use of the index does not need to be displayed, and the Query Analyzer automatically finds the shortest path using the index after indexing.

But in this case, as the amount of data grows and the index fragments are generated, many of the stored data is not properly spread across pages, causing fragmentation, so we need to re-establish the index to speed up performance.

For example, a clustered index and a nonclustered index built on the previous TEST_TB2 can be indexed by the DMV Statement Finder:

SELECT Index_type_desc,alloc_unit_type_desc,avg_fragmentation_in_percent,fragment_count,avg_fragment_size_in_ Pages,page_count,record_count,avg_page_space_used_in_percent
From Sys.dm_db_index_physical_stats (db_id (' AdventureWorks '), object_id (' Test_tb2 '), Null,null, ' Sampled ')

We can improve the speed by rebuilding the index:

ALTER INDEX Idx_text_tb2_employeeid on TEST_TB2 REBUILD

Another situation is that, as the amount of data in the table increases, sometimes it is necessary to update the statistics on the table so that the Query Analyzer chooses the path based on the information, using:

UPDATE STATISTICS Table Name

So when do you know if you need to update these statistics, that is, when the estimated number of rows in the execution plan is not the same as the actual table:

The cost of the index

Of course, the use of indexes also has to pay a price:

1, through the principle of the clustered index we know that when the table is indexed, the B-tree can store the data, so when it is updated insert Delete, you need to physically move the page to adjust the B-tree, so when the update insert delete data, it will bring performance degradation. For nonclustered indexes, when the table is updated, the nonclustered index needs to be updated, which is equivalent to a multiple update of N (n= nonclustered index) tables. Therefore, it also degrades performance.

2, through the above on the principle of non-clustered index introduction, you can see that the nonclustered index requires additional disk space.

3, previously mentioned, inappropriate nonclustered indexes will reduce performance.

Therefore, the use of indexes requires a tradeoff based on the actual situation. I usually put the nonclustered index on a separate hard disk, so that I can scatter IO so that the query is parallel.

T-SQL queries--in-depth understanding of indexes, Principles (b-Tree)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.