Introduction
In SQL Server, indexes are enhanced, which means that SQL Server can still implement its functions even if there is no index. However, indexes can greatly improve query performance in most cases, especially in OLAP. to fully understand the concept of indexes, you need to understand a lot of principles, including B-tree, heap, database pages, areas, fill factors, fragments, file groups, and so on, I cannot write a small book with this knowledge. Therefore, this article will not discuss these topics in depth.
What is an index?
An index is a structure that sorts the values of one or more columns in a database table. You can use an index to quickly access specific information in a database table.
In short, the index is a structure. in SQL Server, the storage structure of indexes and tables (the tables with clustered indexes) are the same, both of which are B trees, tree B is a balanced multi-Cross Tree for search. understanding the concept of B-tree is as follows:
To understand why B-tree is used as an index and table (with clustered indexes) structure, you must first understand how SQL Server stores data.
In SQL SERVER, the smallest unit of storage is PAGE, and pages cannot be divided. Just like cells cannot be further divided in biology, or atoms are the smallest unit in chemistry. this means that the SQL SERVER reads the page either completely or completely without any compromise.
In database retrieval, disk I/O scanning is the most time-consuming because disk scanning involves many physical features, which are time-consuming. Therefore, the original intention of the B-tree design is to reduce the number of disk scans. If a table or index does not use the B-tree (heap Storage is used for tables without clustered indexes), you need to scan the entire database page in the whole table for a data query. This will undoubtedly greatly increase the IO burden. in SQL server, if you use B-tree for storage, you only need to store the root node of B-tree into the memory, after several searches, you can find the pages contained by the leaf nodes that store the required data! This avoids full scanning and improves the performance.
The following is an example:
In SQL server, if no clustered index is set up on a table, it is stored by HEAP. Suppose I have a table like this:
Now this table does not have any indexes, so it is stored in a heap. I add a clustered index (stored in the B tree) on it to demonstrate IO reduction:
Understanding clustering and clustered Index
In SQL SERVER, the two most important types of indexes are clustered indexes and non-clustered indexes. We can see that these two categories are centered around the aggregation keyword. First, we need to understand what aggregation is.
Definition of clustered indexes:
To increase the query speed of an attribute (or attribute group), the tuples with the same value on this or these attributes (called clustering codes) are stored in a continuous physical block called clustering.
In short, clustered indexes are:
In SQL SERVER, clustering is used to change the physical order of a column (or multiple columns) to be consistent with the logical order. For example, I extracted 5 pieces of data from the employee of the adventureworks database:
When I create a clustered index on ContactID, query again:
In SQL server, clustered indexes are stored in Tree B, and the leaves of Tree B directly store clustered index data:
Because clustered indexes change the physical storage order of the table in which they are located, each table can have only one clustered index.
Non-clustered Index
Because each table can only have one clustered index, If we query a table, it is not limited to fields on the clustered index. In addition to clustered index columns, we need non-clustered indexes.
Non-clustered index is essentially a type of clustered index. A non-clustered index does not change the physical structure of the table in which it is located. Instead, it generates an additional B-tree structure of the clustered index. However, a leaf node references the table in which it is located. There are two types of reference, if no clustered index exists in the table, the row number is referenced. If a clustered index already exists in the table, the page of the clustered index is referenced.
A simple non-clustered index concept is as follows:
As you can see, non-clustered indexes require additional storage space, clustered indexes are performed according to the indexed columns, and the leaf nodes in the B-tree contain pointers to the tables where the non-clustered indexes are located.
In MSDN, the description of non-clustered indexes is:
As you can see, non-clustered indexes are also a B-tree structure. Different from clustered indexes, the leaf nodes of the B-tree store pointers to heap or clustered indexes.
The principle of non-clustered indexes shows that if the physical structure of the table in which the data is located changes, such as adding or deleting clustered indexes, all non-clustered indexes need to be rebuilt, this is a considerable loss of performance. Therefore, it is best to create a clustered index first and then a corresponding non-clustered index.
Clustered index VS non-clustered Index
We have explained the principles of clustered and non-clustered indexes. in most cases, clustered indexes are faster than non-clustered indexes. because the leaf node of the B-tree that clustered indexes directly stores data, instead of the clustered index, the data needs to be found through the pointer of the leaf node.
In addition, for a large number of continuous data queries, non-clustered indexes are very weak, because non-clustered indexes need to find the pointer of each row in the B tree of non-clustered indexes, and then find the data on the table where the data is located, the performance will be compromised. sometimes it is better not to add non-clustered indexes.
Therefore, clustered indexes are faster than non-clustered indexes in most cases. However, only one clustered index exists. Therefore, selecting the columns applied to the clustered index is crucial for improving the query performance.
Use of Indexes
The index does not need to be explicitly used. After the index is created, the query analyzer automatically finds the shortest path and uses the index.
However, this is the case. as the amount of data increases and index fragmentation occurs, a lot of stored data is not properly cross-page, which may cause fragmentation (about cross-page and fragmentation and fill factor introduction, we need to rebuild the index to speed up the performance:
For example, a clustered index and a non-clustered index created on test_tb2 can be queried using the DMV statement:
SELECT index_type_desc,alloc_unit_type_desc,avg_fragmentation_in_percent,fragment_count,avg_fragment_size_in_pages,page_count,record_count,avg_page_space_used_in_percentFROM sys.dm_db_index_physical_stats(DB_ID('AdventureWorks'),OBJECT_ID('test_tb2'),NULL,NULL,'Sampled')
We can increase the speed by re-indexing:
ALTER INDEX idx_text_tb2_EmployeeID ON test_tb2 REBUILD
In another case, when the table data volume increases, you may need to update the statistical information of the table so that the query analyzer can select the path based on the information and use:
Update statistics table name
When do we know that we need to update these statistics? When the estimated number of rows in the execution plan differs from the actual number of rows in the table:
Cost of using Indexes
My favorite sentence is "everything has price ". We do not have to pay any price for any performance improvement we get through indexes. This cost comes from several aspects.
1. based on the principle of clustered index, we know that after an index is created for a table, data is stored in the B tree. therefore, when updating, inserting, and deleting a page, you need to physically move the page to adjust the B-tree. therefore, when data is updated, inserted, and deleted, the performance may decrease. For clustered indexes, non-clustered indexes also need to be updated after the table is updated, which is equivalent to updating N (N = number of non-clustered indexes) tables. Therefore, performance is also reduced.
2. Based on the introduction of the non-clustered index principle, we can see that non-clustered indexes require additional disk space.
3. As mentioned above, improper non-clustered indexes will reduce the performance.
Therefore, we need to weigh indexes based on the actual situation. Generally, I will put all non-clustered indexes on another independent hard disk, So that I/O can be dispersed, so that the query can be parallel.
Summary
This article describes the indexing principles and concepts of SQL SERVER. indexing is a powerful tool and a double-edged sword. to properly use indexes, you must systematically learn the indexing principles and database storage principles.