Small taping you in-depth analysis of the principles of SQL Server indexing

Last Update:2018-07-12 Source: Internet

Author: User

Tags first row unique id

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In fact, you can understand the index as a special directory, the following article mainly introduces the SQL Server index principle of the relevant information, the text through the sample code introduced in very detailed, to everyone's study or work has a certain reference learning value, The friends who need to study together with the small series

Objective

This article is my previous notes collated, with the index as the portal to explore the relevant database knowledge (and modified to make people better digestion). SQL Server Contact soon friends can only look at the following Blue font words, simple and useful to save time, if the database is a good friend, you can see all, welcome to explore.

The concept of an index

The purpose of the index: our data query and processing speed has become a measure of the success of the application system standards, and the use of indexes to speed up data processing speed is often the most commonly used optimization method.

What the index is: the index in the database is similar to a book's directory, and the use of a catalog in a book allows you to quickly find the information you want without having to read through the books. In a database, a database program can use an index to weigh the data in a table without having to scan the entire table. The catalog in the book is a word and a list of the page numbers where the words are, and the indexes in the database are the values in the table and a list of where each value is stored.

The pros and cons of the index: most of the overhead of query execution is I/O, and one of the primary goals of using indexes to fetch high performance is to avoid full table scans, because full table scans need to read every data page of a table from disk, and if an index points to data values, then the query needs to read only a few disks at a time. Therefore, reasonable use of the index can speed up the data query. However, indexes do not always improve the performance of the system, and indexed tables require more storage space in the database, as well as the time required to delete and remove data, and to maintain the index for longer processing time. So we should use the index reasonably, update the sub-optimal index in time.

1. Clustered indexes and nonclustered indexes

Indexes into clustered and nonclustered indexes

1.1 Clustered Index

The data for the table is stored in the data page (the pagetype tag of the data page is 1), and the SQL Server page is 8k, and the next page is saved by filling a page. If the table has a clustered index, then a single piece of physical data is stored in the page by the size of the clustered index field in ascending/descending order. When you update or intermediate insert/delete data on a clustered index field, it causes the table data to move (which can have a performance impact) because it keeps ascending/descending ordering.

Note that the primary key is simply the clustered index by default, it can also be set to a nonclustered index, or it can be set to a clustered index on a non-primary key field, and only one clustered index is available for the whole table.

A good clustered index field typically contains the following 4 features:

(A). Self-growth

Always add records at the end to reduce paging and index fragmentation.

(B). Not to be changed

Reduce data movement.

Uniqueness is the most desirable attribute of any index and can be used to clarify the position of the index key value in the sort.

More importantly, the index key is unique, and it can point to the source data row RIDs correctly in each record. If the clustered index key value is not unique, SQL Server needs an internally generated Uniquifier column combination as a clustered key to guarantee the "key value" uniqueness; if the nonclustered index key value is not unique, the RID column (the clustered index key or the row pointer in the heap table) is added to guarantee the "key value" uniqueness.

Thinking (can be skipped): The index "key value" is also guaranteed to be unique on non-leaf nodes, because it should be to clarify the position of the index record in a non-leaf node. For example, there is a nonclustered index field Name2, there are many records of name2= ' a ' in the table, causing name2= ' a ' to have multiple index records (nodes) on non-leaf nodes, then insert a name2= ' A ' record, it is possible to quickly determine which index record (node) to insert on the rid of the non-leaf node and the rid of the new record, and if there is no rid of the non-leaf node, it is necessary to traverse the leaf nodes of all name2= ' a ' to determine the location. Also, when we select * from Table1 where name2<= ' a ', the returned data is sorted by nonclustered index Name2 and RID, and it is good to understand that the returned data is sorted in the order in which the index is stored here. This is the result of this SQL query that is useful to the Name2 index, and if the database query plan chooses direct table data scanning due to a "tipping point" problem, the returned data is sorted by default in the order of the table data.

For the key value uniqueness, for clustered indexes, the Uniquifier column is incremented only when the index value repeats. For nonclustered indexes, if the index is created without a unique definition, the RIDs are incremented at all records, even if the index value is unique, and if the index is created uniquely, the RID is only added to the leaf layer to find the source data row, which is the bookmark lookup operation.

(D). Small field length

The smaller the clustered index key length, the more index records can be accommodated by one page of index pages, thus reducing the depth of the index B-tree structure. For example, a table with millions of records has an int clustered index, which may only require a 3-tier B-tree structure. If you define a clustered index in a wider column (for example, the uniqueidentifier column requires 16 bytes), the depth of the index is increased to 4 levels. Any clustered index lookup requires 4 I/O operations (exactly 4 logical reads), originally as long as 3 I/O operations.
Similarly, the nonclustered index contains the clustered index key value, the smaller the clustered index key length, the smaller the non-clustered index record, and the One-page index page can accommodate more index records.

1.2 Nonclustered indexes

It is also stored in the page (PageType labeled 2, called the index page). For example, table T establishes a nonclustered index index_a, then the table T has 100 data, then the index index_a has 100 data (exactly 100 leaf node data, the index is a B-tree structure, if the height of the tree is greater than 0, Then there is the root node page or the intermediate node page data, when the index data is more than 100, if the table T also has a nonclustered index Index_b, then Index_b is also at least 100 data, so the more indexes built more overhead.

Updating an indexed field, inserting a piece of data, and deleting a single piece of data can result in the maintenance of the index, which can have a certain effect on performance. in different situations, the performance impact is different. For example, when you have a clustered index, the inserted data is at the end, so that almost no data movement, the impact is small, if the inserted data in the middle position, will generally lead to data movement, and may result in paging and page fragmentation, The impact will be slightly larger (if the intervening page is inserted with enough space left to hold the inserted data, and the position is at the end of the page, it will not cause data movement).

2. Structure of the Index

It is said that SQL Server's index is a B-tree structure (assuming you have a certain understanding of the B-tree structure), it is exactly what it looks like, you can use the SQL statement to see its logical rendering.

New query execution Syntax: DBCC IND (test,orderbo,-1)--where the Test library's Orderbo table has 10,000 data, a clustered index ID primary key field
(You might as well build a table, have a clustered index field, insert 10,000 table data, and then execute this syntax to see how much, seeing is believing)

Execution Result:

For example, see a indexlevel=2 index page 2112 (here it is the root node of the B-Tree, indexlevel the largest is the root node, down is the child, sub-children ... There is only one root page as the access entry point for the B-tree structure, indicating there must also be indexlevel=1 index pages and indexlevel=0 leaf pages. Because this is a clustered index, the leaf page of indexlevel=0 is a data page that stores a single stroke of physical data. As you can see, the pagetype of the indexlevel=0 line equals 1, that is, the data page, the above 1.1 chapters refer to the clustered index, also mentions pagetype=1, and if the non-clustered index, indexlevel=0 leaf page, PageType is equal to 2, the index page is still.

Similarly, we use the SQL command DBCC page to take a look at

--DBCC TRACEON (3604,-1) DBCC page (test,1,2112,3)  --root node 2112, you can find its two child nodes 2280 and 2448, and then make DBCC page query for these two child nodes DBCC page (test,1,2280,3) DBCC PAGE (test,1,2448,3)

For example, indexlevel=2 2112 pages have two indexlevel=1 child nodes 2280 and 2448, sub-nodes have child nodes, each node is responsible for different index key values of the interval (that is, the "Id (key)" field, the first row value is null, Represents the maximum value when the minimum or reverse is reversed). Such a hierarchical relationship is not a B-tree structure, in which indexlevel is actually the height of the B-tree structure.

When SQL Server looks for a record in the index, it finds the leaf node from the root node, because all data addresses have leaf nodes, which is actually one of the features of the B + tree (the tree feature is that if the lookup value is found on a non-leaf node, it can be returned directly, Obviously SQL Server does not do this, to verify this you can set statistics IO on to turn the statistics up and then select to see the number of logical reads.

Since the leaf node must be found, the index contains the column as long as the leaf node is recorded on, that is, the non-leaf node does not have a record containing column, "Index contains the column" See the 3rd chapter below.

B + Tree This feature (all data addresses have leaf nodes) is also conducive to between value1 and value2 interval query, just find value1 and value2 (in the leaf node), and then the middle string up is the result.

The SQL Server index structure is more like a B + tree, eventually a hybrid version of the B. + + tree, the data structure is human, not necessarily a pure B-tree or a simple + + tree.

3. Index contains column and bookmark lookup

When it comes to indexing, here's another SqlServer2005. The "Index contains columns" feature that starts adding is useful.

For example, when querying data in a large report, where conditions are used in the index field Name2, but the field to select is Name1, you can use index include column to include Name1 in the Index field Name2, which greatly improves query performance.

Syntax: Create [UNIQUE] nonclustered/clustered Index indexname on dbo. Table1 (Name2) Include (Name1);

Next analyze why the index contains columns can greatly improve performance. Still using the DBCC PAGE command, look at a nonclustered index and have index data for the column that contains the condition:

It is known that the containing column Name1 is also stored in the index data. Therefore, when the database is anchored to a row to be found with the indexed field Name2, it is possible to return the value of the Name1 directly, instead of locating the value in the data page based on the RID (the HEAP RID (Key) column), which reduces the bookmark lookup. When the query only returns a single piece of data, only one bookmark lookup is of course nothing, if the query returns a large amount of data, every pen to go to the data page to find data out, 1000 pen is 1000 times the bookmark lookup, imagine the performance consumption is very large, this time "index contains column" value is greatly reflected.

With respect to a bookmark lookup, a table with a clustered index (such as an Id) is similar to executing a select Name1 from Table1 where id=1, using the clustered index key ID lookup (the lookup is the B-tree structure lookup of the index Id), and if the table does not have a clustered index, is based on the data row pointer (composed of "File number 2byte: Page number 4byte: Slot number 2byte"). Clustered index keys and row pointers are generally collectively referred to as RID (row ID) pointers. From here we can think that if your table does not have a good clustered index field, it is recommended to do a clustered index primary key (redundant out ID field also line) from the growing ID field, which conforms to the characteristics of self-growth, non-change, uniqueness, and small length, which is a good choice for clustered indexes.

Since the growth ID most of the case is applicable, special circumstances to see the specific needs of the. There are also self-growth ID to consider a flaw, when the table large data volume of concurrent insert records, you can imagine that each thread is to insert to the end of the page, there will be competition and waiting. To solve this situation you can use the uniqueidentifier type field (16 bytes, I am not recommended) or hash partition (that is, a table is divided into multiple tables, large data processing in the sub-database table is normal) and so on. But I recommend optimizing your insert efficiency first (insert performance itself is fast), testing whether the number of concurrent inserts per second meets the production environment to preserve a simple, stable and efficient self-growth ID practice.

Self-growth ID is not necessarily the self-growth provided by the database, you can also write your own algorithm to generate a concurrency can also be unique ID (the general length is bitint,8 byte shaping), This scenario is suitable for scenarios where the ID field is required to be in a distributed database in the master-slave replication (master-slave replication of the general mode, the main library ID is the main library growth, from the library ID is also on the growth from the library itself, if you encounter a deadlock, etc. causes the master-slave replication is not synchronized, The ID of the library and the ID of the main library are not the same since the growth. If the self-growth ID is a redundant primary key, the master-slave ID will have no effect on the number.

In addition, the last column "row Size" also tells us that the Index column or index contains the column size is not too long , otherwise a page can not tolerate a few records, which greatly increased the number of index pages, and the index data accounted for a significant increase in the space.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More