Database storage structure: page, clustered index, nonclustered index
To understand the database storage structure, understand how the database is accessed first and then understand the database storage structure in principle.
I. How data is accessed in SQL Server
Broadly speaking, there are only two ways for SQL Server to retrieve the required data:
(1) Use full table scan
(2) using the index
1. Use full table scan
Table scans are fairly straightforward. When a table scan is performed, SQL Server starts at the physical starting point of the table and browses through each row in the table. When a row is found that matches the query criteria, they are included in the result set. Many of the statements about table scanning are inefficient, but if the table data is reduced, the actual use of table scans is the fastest.
2. Use index
During query optimization, the optimizer looks at all available index structures and chooses the best one (this is based primarily on the information specified in the connection and where clauses, as well as the statistics that SQL Server saves in the index structure). Once the index is selected, SQL Server navigates to the data location that matches the criteria in the tree structure and extracts only the records it needs. The difference with table scanning is that because the data is sorted, the query engine knows when it reaches the lower bound of the current range being looked up. It can then end the query, or move to the next data range as needed. The way exists works is to find the matching records that SQL Server immediately stops. The performance of using an index is similar to or better than using exists, because the process of finding data works the same way, that is, the server may use an index to know when there is no relevant content and stop immediately. In addition, you can perform very fast lookups (called seek) on the data, rather than looking in the entire table.
3. Index type and index navigation
Although there are two kinds of index structures (clustered and nonclustered indexes) on the surface in SQL Server, there are 3 different index types internally: ① clustered index,
② nonclustered indexes.
The nonclustered indexes include the following two types: nonclustered indexes on the ① heap
② nonclustered indexes on clustered tables
How physical data is stored is different in clustered and nonclustered indexes. The way SQL Server traverses the balance tree to reach the end data is also different in all 3 types of indexes.
All SQL Server indexes have leaf-and non-leaf-level pages, and the leaf level is the "key" level that holds the identity record, and the non-leaf level is the leader of the leaf level.
The index is created on the clustered table (if the table has a clustered index) or on a heap (for tables that do not have a clustered index).
(1), Aggregation table
A clustered table is any table on which you have a clustered index. However, they mean that the data is physically stored in the specified order for the table. A column that defines a clustered index is defined by using the clustered index key to uniquely flag a separate row-clustered key.
What if the clustered index is not unique? How can a clustered index be used to uniquely flag a row if the index is not a unique index? SQL Server adds a suffix to the key internally to ensure that the row has a unique identifier.
(2), heap
a heap is a table on which there is no clustered index . In this case, a unique identifier, or row ID (RID), is created based on the combination of the row's segment, page, and row offset (the position at the top of the offset page) . If no clustered key is available (no clustered index), then the RID is the only necessary content. The heap table is not a B-tree structure.
4 , clustered index
A clustered index is unique to any given table, and a table can have only one clustered index. There is not necessarily a clustered index. A special aspect of a clustered index is that the leaf level of the clustered index is the actual data -that is, the data is reordered and stored in the same physical order as the clustered index ordering criteria declaration. This means that once the leaf level of the index is reached, the data is reached. Instead of a clustered index, reaching the leaf level just finds a reference to the data.
any new records are inserted into the clustered index according to the correct physical order of the clustered columns . The way you create a new page varies depending on where you want to insert the record. If the new record needs to be inserted in the middle of the index structure, normal page splits occur. The second half of the record from the old page is moved to a new page, and when appropriate, the new record is inserted into a new or old page. If the new record is logically at the end of the index structure, a new page is created, but only new records are added to the new page.
From the point of view of data insertion, it should be possible to see the benefits of using the int type as a clustered index.
To illustrate the order in which the indexes are tables, take a look at the table:
Then set up the clustered index in the ID column:
CREATE CLUSTERED
Index index_name on person (ID)--Build ID column clustered index
To execute a query statement:
Select Top 3 * from person
Drop INDEX person.index_name--Delete indexes
CREATE CLUSTERED
Index index_name on person (name)--again in the Rebuild Name column clustered index
Then execute the query statement:
Select Top 3 * from person
The output results are as follows:
Notice the same statement that the return has changed. A clustered index can be the order of the table, affecting the top statement.
5 , navigation tree
Even indexes are stored in the balance tree in SQL Server, and in theory, the balance tree always has general residual information in every possible direction as a branch of a tree. The balance tree of the clustered index is shown in the form.
Here, you perform a range query on the number 158-400 (things that the clustered index is good at), and you only need to navigate to the first record and include all the remaining records on that page. Knowing that the rest of the page is needed is because the information from the previous level node also needs data from some other pages. Because this is an ordered table, you can be sure that it is sequential-this means that if the next page has a qualifying record, the rest of the page must be included. No validation is required.
First, navigate to the root node. SQL Server is able to give the sys.indexes system metadata view to locate the root node in the Save record entry.
Light said no practice, purely fraud, the following is a 10,000-line Persontenthousand table to illustrate the B-tree structure of the data page read promotion.
The contents of the table are as follows:
At first this table does not have any indexes:
Because there is no index on this table, only the required data can be retrieved from the heap table scan, so the entire table is scanned once for either the ID or the name column. Therefore, read all the data pages in a read-ahead, logical read.
The following establishes a clustered index in the ID column of the table:
CREATE CLUSTERED INDEX index1 on Persontenthousand (ID)
To perform the same query again:
We see that because the ID column is a clustered index, the advantage of the B-tree structure is fully developed by the ID lookup, which requires only 2 physical reads to locate the data.
There is no index on the name column, so you still need to read it 838 times (or a clustered table scan) to locate the data.
The above examples fully illustrate the advantages of B-tree structure.
6 , nonclustered indexes
6.1 Nonclustered Index Benefits:
1, because one page in SQL Server is only 8K, page space is limited, so the fewer columns a row contains, the more rows it can save. A nonclustered index typically does not contain all the columns in a table, and it generally contains only a very small number of columns. Therefore, a single page will be able to contain a nonclustered index that is more rows than the table row (all columns). Therefore, you can also read a page that may contain 200 rows in a nonclustered index, but there may be only 10 rows in the table, with the size of the table rows and the size of the nonclustered columns determined.
2. Another benefit of nonclustered indexes is that it has a structure that is independent of the data table, so it can be placed in different filegroups, using different I/O paths, which means that SQL Server can access indexes and tables in parallel, making lookups faster.
Here's a look at the benefits of nonclustered indexes:
Suppose there is a single column table with 27 rows and 3 rows on each page. There is no order, and if we are looking for a row with a value of 5, we need to read 9, because it must scan to the last page to be able to determine that none of the pages has a value of 5.
If a nonclustered index is established:
Looking for a row with a value of 5 again, the number of reads required is 2, why? Because nonclustered indexes are sequential, when SQL Server reads the row with a value of 6, it knows that it will not have to be read anymore. So what if you want to read a page with a value of 25? Still requires 9 read operations. Because it happens to be on the last page. Exactly this thing can be optimized by the B-tree structure. The B-tree algorithm speeds up the data access process by minimizing the number of pages that are required to locate the key value accessed.
6.2 Cost of nonclustered indexes
There is a cost to the performance benefits of indexing. Indexed tables require more storage and memory space to hold index pages outside of the data page. Adding or deleting data may take longer and require more processing time to maintain indexes on changing tables. If an INSERT statement adds a row to the table, it must also add a row to the index structure. If the index is a clustered index, the overhead may be greater because the rows must be added to the data page in the correct order (of course the Int clustered column and the string clustered column will be different). Update and delete are similar.
Although the index has some impact on additions and deletions, it is important to remember that the update or delete row must be found on a row, so the index is actually useful for update or delete with complex where conditions. The effectiveness of locating a row with an index can often compensate for the additional overhead of updating the index. Unless the index design is unreasonable.
7 , nonclustered indexes on the heap
One point here is that, whether on a heap or on a clustered column, a nonclustered index is sorted and stored. Sort by nonclustered index column.
Nonclustered indexes and clustered indexes on the heap are tools in a similar way in most respects. The differences are as follows:
The leaf level is not data-instead, it is a level from which you can get pointers to that data. the pointer appears as a RID (the heap on which a RID appears, with a clustered key on the clustered table), which consists of the segment, page, and row offset of the particular row to which the index points. That is, the leaf level is not the actual data, and using the leaf level is only one step more than using a clustered index. Because the RID has all the information about the position of the row, it can reach the data directly.
One step down, the actual difference in system overhead is very large.
With a clustered index, the data is physically arranged in the order of the clustered index. This means that for a range of data, when you find a row on which you have the beginning of the data range, it is likely that there are other rows on the same page (that is, because they are stored together, the next record is physically almost reached).
With heaps, data is not connected by other methods than indexes. Physically, there is absolutely no sorting of any kind. This means that from a physical reading perspective, the system has to retrieve records from the entire file. In fact, it is likely that the data will eventually be fetched from the same page multiple times. SQL Server does not have a method to instruct it to return to that physical location because there is no connection between the data. Therefore, the nonclustered indexes on the heap work by scanning the nonclustered indexes on the heap, Locating (row_number), each finding a RID, and getting the data through the RID. If the search returns more than one record, performance may not be the same as scanning the entire table. shows that the nonclustered index on the heap performs the same query as the clustered index above:
Primarily through index navigation, everything works in the same way as before, starting with the same root node, and all over again, processing more and more concentrated pages. Until the leaf level of the index is reached. There's a difference here. With a clustered index, you can stop right here, and in a nonclustered index, you need to do more work. If the index is on a heap, you get rid of the leaf-level page as soon as you enter a level, and you navigate to the rid-until the actual data is available directly.
8 , nonclustered indexes on clustered tables
There are some similarities when using non-clustered indexes on clustered tables-but there are also differences. As with non-clustered indexes on a heap, the index's non-leaf level and the single work are almost the same as when using a clustered index. The difference appears at the leaf level.
At the leaf level, there is a significant difference from what you see with the other two index structures. A non-clustered index on a clustered table has another index to find. With a clustered index, when you reach the leaf level, you can find the actual data, and when you use a non-clustered index on the heap, you cannot find the actual data, but you can find an identifier that directly obtains the data (just one more step). You can find the clustered key by using a nonclustered index on the clustered table. In other words, find enough information to continue and take advantage of the clustered index.
The understanding above is that when using a nonclustered index, the nonclustered index is traversed to find the clustered index, and finally the clustered index is used to find the data.
The end result is as follows:
The first is a range search. Perform a separate lookup in the index, and you can browse the nonclustered index to find a contiguous range of data that satisfies the criteria (t%). This lookup, which is able to reach a specific location in the index directly, is called Seek.
Then the second lookup-using a clustered index lookup, the second finding is very quick: the problem is that it must be executed multiple times. Can see. SQL Server looks up the search list from the first index (all lists that begin with "T"), but the list logically does not match the clustered key in any contiguous way-each record is looked up individually. Shown in the figure:
Naturally, this multiple lookup scenario introduces more overhead than the initial use of a clustered index. The first index lookup-through a nonclustered index method-requires very little logical read operations.
Note that using a nonclustered index on the clustered table, a list of clustered index keys is found. Then use this list to find the data that you want by using a clustered index.
Note that if the table does not have a clustered index and a nonclustered index is established, then the nonclustered index uses the row number, and if you add a clustered index at this point, all the RID that is referenced by the nonclustered index is changed to the clustered index key. This is very expensive for performance, so it's a good idea to build a clustered index before creating a nonclustered index.
A few points about indexing:
1, clustered indexes are usually faster than non-clustered indexes (bookmarks).
2. Place non-clustered indexes only on columns that will get high-level selectivity (more than 90%).
3. All data manipulation languages (Dml:insert, UPDATE, delete, SELECT) statements can benefit from indexing, but insertions, deletions, and updates are slower because of indexing.
4, the index will occupy space.
5. Use indexes only when the first column in the index is relevant to the query.
6. The negative impact of the index is as much as its positive impact-so only the desired index is established.
7. Indexes can provide structured data performance for unstructured XML data, but keep in mind that, as with other indexes, overhead is involved.
Database storage structure: page, clustered index, nonclustered index