SQL Server index (principle, storage) clustered index, nonclustered index, heap < first >

Source: Internet
Author: User

first, the storage structure

In SQL Server, there are many different options available for arranging rules.

Binary: Sort by the numeric representation of the characters (ASCII, with the number 32 for spaces, and 68 for the letter "D"). Because everything is represented as a number, it is the fastest to handle, and unfortunately it is not always as you would expect it to be a serious mess to use when comparing in the WHERE clause.

Dictionary order: This sort is sorted in the same way as you see in the dictionary, but with few differences, you can set a number of different additional options to determine whether to differentiate between case, tone, and character set.

  1. Balance tree (b-Tree)

A balance tree or B-tree provides a way to find specific information in a consistent and relatively low-cost way. The "balance" in its name is self-explanatory. The balance tree is self-balancing, which means that each tree branches with nearly half the data on one side, while the other half is on the other side. The origin of the tree name is because, if the structure is drawn, and then upside down, it is found to resemble a tree, so it is called a tree.

The balance tree starts at the root node. If there is a small amount of data, the root node can point directly to the actual location of the data.

Structure diagram:

    

Therefore, start with the root node and browse the record until you find the last page that starts with a value that is less than the lookup value. Then get a pointer to the node and browse for it. Until you find the line you want.

When there is a lot of data, the root node points to the middle node (not the page-level node). A non-page-level node is a node that is located between the root node and the node that describes the physical storage location of the data

    root node, intermediate node (non-leaf-level node) [n]-> Storage location node (leaf-level node)

Non-leaf-level nodes can point to other non-leaf-level nodes or leaf-level nodes. a leaf-level node is a node from which a reference to the actual physical data is obtained.

  

As you can see, the lookup starts at the root node and then moves to the node in the next level node that starts at the highest value that is equal to or less than the lookup value. Then repeat the process-find the highest starting value node with a value equal to or less than the lookup. Continue down the tree level down to level two nodes-thus knowing the physical location of the data.

  2. Page splitting

All of these pages work well in reading-but it can be a bit cumbersome to insert. The B-tree structure is mentioned earlier, and each time you encounter a branch in the tree, because each side has about half the data, the B-tree is balanced. In addition, because the method of adding new data to the tree generally avoids imbalance, the B-tree is sometimes considered self-balanced.

By adding data to the tree, the node will eventually become full and will need to be split. Because in SQL Server, a node is the equivalent of a page- so this is called a page split. :

When page splits occur, the data is automatically moved around to maintain balance, the first half of the data remains on the old page, and the rest of the data is added to the new page-thus forming a half-split so that the tree remains balanced.

If you consider this split process, you will realize that it adds a lot of overhead to the split. Not just inserting a page will also do the following:

    • Create a new page
    • Move rows from an existing data page to a new page
    • Add a new row to one of the pages
    • Add another record entry in the parent node

Note the last one, if you add a record to the parent node, the parent page is full, causing the split, and the entire process restarts. It can even affect the root node. Also, if the root node is split, the actual end result is to create two additional pages, so there can only be one root node, so the page of the previous root node is split into two pages, and is called the new intermediate level node of the tree. The new root node is then created, and there will be two record entries that point to the two intermediate nodes that have just been spun off by the root node.

As you can see from the above, the number of page splits becomes smaller as you move toward the top of the tree. Because a subordinate page split is a record for the superior.

Although SQL Server has many different types of indexes, all of these indexes use this balanced tree method in some way. In fact, because of the flexible nature of the balance tree, all indexes are structurally similar, but they actually have a little bit of a difference, and these differences can have an impact on the performance of the system.

  3. How data is accessed in SQL Server

Broadly speaking, there are only two ways for SQL Server to retrieve the required data:

    • Use full table scan
    • Working with Indexes

  1. Use full table scan

Table scans are fairly straightforward. When a table scan is performed, SQL Server starts at the physical starting point of the table and browses through each row in the table. When a row is found that matches the query criteria, they are included in the result set. Many of the statements about table scanning are inefficient, but if the table data is reduced, the actual use of table scans is the fastest.

  2. Use index

During query optimization, the optimizer looks at all available index structures and chooses the best one (this is based primarily on the information specified in the connection and where clauses, as well as the statistics that SQL Server saves in the index structure). Once the index is selected, SQL Server navigates to the data location that matches the criteria in the tree structure and extracts only the records it needs. The difference with table scanning is that because the data is sorted, the query engine knows when it reaches the lower bound of the current range being looked up. It can then end the query, or move to the next data range as needed. The way exists works is to find the matching records that SQL Server immediately stops. The performance of using an index is similar to or better than using exists, because the process of finding data works the same way, that is, the server may use an index to know when there is no relevant content and stop immediately. In addition, you can perform very fast lookups (called seek) on the data, rather than looking in the entire table.

  3. Index type and index navigation

Although there are two kinds of index structures (clustered and nonclustered indexes) on the surface in SQL Server, there are 3 different index types internally.

    • Clustered index
    • Nonclustered indexes, where nonclustered indexes also include the following two types:
    1. Nonclustered indexes on a heap
    2. Nonclustered indexes on clustered tables

How physical data is stored is different in clustered and nonclustered indexes. The way SQL Server traverses the balance tree to reach the end data is also different in all 3 types of indexes.

All SQL Server indexes have leaf-and non-leaf-level pages, and the leaf level is the "key" level that holds the identity record, and the non-leaf level is the leader of the leaf level.

  The index is created on the clustered table (if the table has a clustered index) or on a heap (for tables that do not have a clustered index).

  (1), Aggregation table

A clustered table is any table on which you have a clustered index. However, they mean that the data is physically stored in the specified order for the table. A column that defines a clustered index is defined by using the clustered index key to uniquely flag a separate row-clustered key.

What if the clustered index is not unique? How can a clustered index be used to uniquely flag a row if the index is not a unique index? SQL Server adds a suffix to the key internally to ensure that the row has a unique identifier.

  (2), heap

  a heap is a table on which there is no clustered index . In this case, a unique identifier, or row ID (RID), is created based on the combination of the row's segment, page, and row offset (the position at the top of the offset page) . If no clustered key is available (no clustered index), then the RID is the only necessary content. The heap table is not a B-tree structure.

  4. Clustered index

A clustered index is unique to any given table, and a table can have only one clustered index. There is not necessarily a clustered index. A special aspect of a clustered index is that the leaf level of the clustered index is the actual data -that is, the data is reordered and stored in the same physical order as the clustered index ordering criteria declaration. This means that once the leaf level of the index is reached, the data is reached. Instead of a clustered index, reaching the leaf level just finds a reference to the data.

  any new records are inserted into the clustered index according to the correct physical order of the clustered columns . The way you create a new page varies depending on where you want to insert the record. If the new record needs to be inserted in the middle of the index structure, normal page splits occur. The second half of the record from the old page is moved to a new page, and when appropriate, the new record is inserted into a new or old page. If the new record is logically at the end of the index structure, a new page is created, but only new records are added to the new page.

  

From the point of view of data insertion, it should be possible to see the benefits of using the int type as a clustered index.

To illustrate the order in which the indexes are tables, take a look at the table:

  

Then set up the clustered index in the ID column:

CLUSTERED on person  (ID) --Build ID column clustered index 

To execute a query statement:

From person

  

Index Person.index_name    --delete index    CLUSTERED --Re-rebuild Name column clustered index    

Then execute the query statement:

From person

The output results are as follows:

  

Notice the same statement that the return has changed. A clustered index can be the order of the table, affecting the top statement.

  5. Navigation tree

Even indexes are stored in the balance tree in SQL Server, and in theory, the balance tree always has general residual information in every possible direction as a branch of a tree. The balance tree of the clustered index is shown in the form.

  

Here, you perform a range query on the number 158-400 (things that the clustered index is good at), and you only need to navigate to the first record and include all the remaining records on that page. Knowing that the rest of the page is needed is because the information from the previous level node also needs data from some other pages. Because this is an ordered table, you can be sure that it is sequential-this means that if the next page has a qualifying record, the rest of the page must be included. No validation is required.

First, navigate to the root node. SQL Server is able to give the sys.indexes system metadata view to locate the root node in the Save record entry.

Light said no practice, purely fraud, the following is a 10,000-line Persontenthousand table to illustrate the B-tree structure of the data page read promotion.

The contents of the table are as follows:

  

At first this table does not have any indexes:

  

Because there is no index on this table, only the required data can be retrieved from the heap table scan, so the entire table is scanned once for either the ID or the name column. Therefore, read all the data pages in a read-ahead, logical read.

The following establishes a clustered index in the ID column of the table:

On Persontenthousand (ID)

To perform the same query again:

  

We see that because the ID column is a clustered index, the advantage of the B-tree structure is fully developed by the ID lookup, which requires only 2 physical reads to locate the data.

There is no index on the name column, so you still need to read it 838 times (or a clustered table scan) to locate the data.

The above examples fully illustrate the advantages of B-tree structure.

  6. Nonclustered indexes

  6.1 Nonclustered Index Benefits:

1, because one page in SQL Server is only 8K, page space is limited, so the fewer columns a row contains, the more rows it can save. A nonclustered index typically does not contain all the columns in a table, and it generally contains only a very small number of columns. Therefore, a single page will be able to contain a nonclustered index that is more rows than the table row (all columns). Therefore, you can also read a page that may contain 200 rows in a nonclustered index, but there may be only 10 rows in the table, with the size of the table rows and the size of the nonclustered columns determined.

2. Another benefit of nonclustered indexes is that it has a structure that is independent of the data table, so it can be placed in different filegroups, using different I/O paths, which means that SQL Server can access indexes and tables in parallel, making lookups faster.

Here's a look at the benefits of nonclustered indexes:

  

Suppose there is a single column table with 27 rows and 3 rows on each page. There is no order, and if we are looking for a row with a value of 5, we need to read 9, because it must scan to the last page to be able to determine that none of the pages has a value of 5.

If a nonclustered index is established:

  

Looking for a row with a value of 5 again, the number of reads required is 2, why? Because nonclustered indexes are sequential, when SQL Server reads the row with a value of 6, it knows that it will not have to be read anymore. So what if you want to read a page with a value of 25? Still requires 9 read operations. Because it happens to be on the last page. Exactly this thing can be optimized by the B-tree structure. The B-tree algorithm speeds up the data access process by minimizing the number of pages that are required to locate the key value accessed.

  6.2 Cost of nonclustered indexes

There is a cost to the performance benefits of indexing. Indexed tables require more storage and memory space to hold index pages outside of the data page. Adding or deleting data may take longer and require more processing time to maintain indexes on changing tables. If an INSERT statement adds a row to the table, it must also add a row to the index structure. If the index is a clustered index, the overhead may be greater because the rows must be added to the data page in the correct order (of course the Int clustered column and the string clustered column will be different). Update and delete are similar.

Although the index has some impact on additions and deletions, it is important to remember that the update or delete row must be found on a row, so the index is actually useful for update or delete with complex where conditions. The effectiveness of locating a row with an index can often compensate for the additional overhead of updating the index. Unless the index design is unreasonable.

  7. Nonclustered indexes on the heap

One point here is that, whether on a heap or on a clustered column, a nonclustered index is sorted and stored. Sort by nonclustered index column.

Nonclustered indexes and clustered indexes on the heap are tools in a similar way in most respects. The differences are as follows:

The leaf level is not data-instead, it is a level from which you can get pointers to that data. The pointer appears as a RID (the heap on which a RID appears, with a clustered key on the clustered table), which consists of the segment, page, and row offset of the particular row to which the index points. That is, the leaf level is not the actual data, and using the leaf level is only one step more than using a clustered index. Because the RID has all the information about the position of the row, it can reach the data directly.

One step down, the actual difference in system overhead is very large.

With a clustered index, the data is physically arranged in the order of the clustered index. This means that for a range of data, when you find a row on which you have the beginning of the data range, it is likely that there are other rows on the same page (that is, because they are stored together, the next record is physically almost reached).

With heaps, data is not connected by other methods than indexes. Physically, there is absolutely no sorting of any kind. This means that from a physical reading perspective, the system has to retrieve records from the entire file. In fact, it is likely that the data will eventually be fetched from the same page multiple times. SQL Server does not have a method to instruct it to return to that physical location because there is no connection between the data. Therefore, the nonclustered indexes on the heap work by scanning the nonclustered indexes on the heap, Locating (row_number), each finding a RID, and getting the data through the RID. If the search returns more than one record, performance may not be the same as scanning the entire table. shows that the nonclustered index on the heap performs the same query as the clustered index above:

  

Primarily through index navigation, everything works in the same way as before, starting with the same root node, and all over again, processing more and more concentrated pages. Until the leaf level of the index is reached. There's a difference here. With a clustered index, you can stop right here, and in a nonclustered index, you need to do more work. If the index is on a heap, you get rid of the leaf-level page as soon as you enter a level, and you navigate to the rid-until the actual data is available directly.

  8. Nonclustered indexes on clustered tables

There are some similarities when using non-clustered indexes on clustered tables-but there are also differences. As with non-clustered indexes on a heap, the index's non-leaf level and the single work are almost the same as when using a clustered index. The difference appears at the leaf level.

At the leaf level, there is a significant difference from what you see with the other two index structures. A non-clustered index on a clustered table has another index to find. With a clustered index, when you reach the leaf level, you can find the actual data, and when you use a non-clustered index on the heap, you cannot find the actual data, but you can find an identifier that directly obtains the data (just one more step). You can find the clustered key by using a nonclustered index on the clustered table. In other words, find enough information to continue and take advantage of the clustered index.

The understanding above is that when using a nonclustered index, the nonclustered index is traversed to find the clustered index, and finally the clustered index is used to find the data.

The end result is as follows:

  

The first is a range search. Perform a separate lookup in the index, and you can browse the nonclustered index to find a contiguous range of data that satisfies the criteria (t%). This lookup, which is able to reach a specific location in the index directly, is called Seek.

Then the second lookup-using a clustered index lookup, the second finding is very quick: the problem is that it must be executed multiple times. Can see. SQL Server looks up the search list from the first index (all lists that begin with "T"), but the list logically does not match the clustered key in any contiguous way-each record is looked up individually. Shown in the figure:

  

Naturally, this multiple lookup scenario introduces more overhead than the initial use of a clustered index. The first index lookup-through a nonclustered index method-requires very little logical read operations.

Note that using a nonclustered index on the clustered table, a list of clustered index keys is found. Then use this list to find the data that you want by using a clustered index.

Note that if the table does not have a clustered index and a nonclustered index is established, then the nonclustered index uses the row number, and if you add a clustered index at this point, all the RID that is referenced by the nonclustered index is changed to the clustered index key. This is very expensive for performance, so it's a good idea to build a clustered index before creating a nonclustered index.

A few points about indexing:

    1. Clustered indexes are usually faster than non-clustered indexes (bookmarks).
    2. Non-clustered indexes are placed only on columns that will get high-level selectivity (more than 90%).
    3. All data manipulation languages (Dml:insert, UPDATE, delete, SELECT) statements can benefit from indexing, but insertions, deletions, and updates are slower because of indexing.
    4. The index takes up space.
    5. The index is used only if the first column in the index is related to the query.
    6. The negative impact of the index is as much as its positive impact-so only the desired index is established.
    7. Indexes can provide structured data performance for unstructured XML data, but remember that, as with other indexes, overhead is involved.

SQL Server index (principle, storage) clustered index, nonclustered index, heap < first >

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.