Definition: An index is a structure that sorts the values of one or more columns in a database table and uses an index to quickly access specific information in a database table.
For example such a query: SELECT * FROM table1 where id=10000. If there is no index, you must traverse the entire table until the row with the ID equals 10000 is found, and after the index (which must be an index established on the ID column), look in the index (for example, binary lookup, which will quickly locate the position of 10000 in the index table. If the index stores the ID and the address of the ID, it points directly to the address to get the corresponding data, but the index is optimized by some algorithm, and the number of lookups is much less. It is visible that the index is used for positioning.
What is stored in the index? How does an index speed up the query?
First, Introduction
The focus on database indexing has never faded from my discussions, so what is a database index? What is the difference between a clustered index and a nonclustered index? I hope this article to colleagues have some help. There are many places of doubt, sincerely hope that you do not hesitate to correct, common progress. [Recent front page controversy, also do not know this put in this suitable, elbow grease? Credit? ......]
Second, B-tree
Our common database systems, whose indexes use data structures are mostly b-tree or b+tree. For example, MSSQL uses b+tree,oracle and sysbase to use B-tree. So at the very beginning, simply introduce B-tree.
B-tree is different from binary tree (binary trees, up to two subtrees), and an M-order of B-tree satisfies the following conditions:
1) Each node has a maximum of M children;
2) root nodes and leaf nodes, each of the other nodes has at least M/2 children;
3) a root node has at least two children (unless the tree contains only one node);
4) All leaf nodes are at the same level, and leaf nodes do not contain any keyword information;
5) A non-leaf node with k keywords exactly contains k+1 children;
In addition, for a node, its internal keywords are from small to large sort. The following is a sample of B-tree (m=4):
For each node, it consists of a key word group key[], an array of pointers (pointing to Son) son[]. Within B-tree, the lookup process is: Use order lookup (when the array length is shorter) or binary Find method find key[] array, if the keyword K is found, then return the address of the node and the location of K in key[], otherwise, can be determined that k between a key[i] and key[i+1], continues to be found from a sub-node referred to by son[i] until it finds success in a node, or until a leaf node is found and the find in the leaf node is still unsuccessful, the lookup process fails.
Next, we use a slice to demonstrate how to generate B-tree (m=4, insert 1~6 in turn):
From the diagram, when we insert the keyword 4 o'clock, because the original node is full, so split, basically divided by half of the principle, and then take out the middle of the keyword 2, upgrade (this is the root node). Other analogies are such an approximate process.
Third, database index
1. What is an index
In the database, the meaning of the index is not much different from the term "index" in everyday sense (think of the dictionary as a child), which is the database object used to improve the speed of database table data access.
A) The index avoids A full table scan. Most queries can scan only a small number of index pages and data pages, rather than traversing all data pages.
B) for nonclustered indexes, some queries can even not access data pages.
C) A clustered index prevents the data insertion operation from concentrating on the last data page of the table.
D) In some cases, the index can also be used to avoid sorting operations.
Of course, it is well known that although indexes can improve query speed, they also cause performance degradation in database system Update data because most data updates require that indexes be updated at the same time.
2. Storage of Indexes
The basic information contained in an index record consists of a key value (that is, the value of all the fields specified when you define the index) + logical pointers (point to the data page or another index page).
When you create an index for an empty table, the database system assigns you an index page that is empty until you insert the data. This page is both a root node and a leaf node at this point. Whenever you insert a row of data into a table, the database system inserts a row of index records into the root node. When the root node is full, the database system is probably split as follows:
A) Creation of two sons knot points
B) The data in the original root node is roughly divided into two halves, and a new two sons node is written separately.
C) Add a pointer to the two-son node in the root node.
Typically, because index records contain only indexed field values (and 4-9-byte pointers), indexed entities are much smaller than real data rows, and index pages are much denser than data pages. An index page can store a larger number of index records, which means that there is a significant advantage in I/O when looking in the index, and understanding this helps to understand the benefits of using indexes in essence.
3. Type of index
A) clustered index, table data stored in the order of the index. For clustered indexes, the leaf nodes store the actual rows of data and no longer have separate data pages.
B) Nonclustered indexes, the table data storage order is independent of the index order. For a nonclustered index, the leaf node contains the indexed field value and a logical pointer to the data page data row, which is adjacent to the data page and has the same number of rows as the data table row data.
Only one clustered index can be created on a table, because the physical order of real data can only be one. If a table does not have a clustered index, it is called a heap. Data rows in such a table do not have a specific order, and all new rows are added at the end of the table.
4. Clustered index
In a clustered index, the leaf node is also the data node, and all data rows are stored in the same order as the index is stored.
1) clustered index and query operation
For example, if we establish a clustered index on a name field, the database system looks for the root of this index based on a particular system table and then finds the next one, based on the pointer, when it is necessary to find a particular record based on this field. For example we want to query "green", because it is between [Bennet,karsen], so we found the index page 1007, in the page "Green" between [Greane, Hunter], so we find the leaf node 1133 (also known as Data node), The target data row is finally found on this page.
The IO for this query includes queries for 3 index pages (the last time it was actually queried on a data page). The lookup here may be read from disk (physical read) or read from the cache (Logical read), and if this table is accessed more frequently, indexes in the index tree that are higher are likely to be found in the cache. So the real IO may be smaller than the above scenario.
2) clustered index and insert operation
In the simplest case, the insert operation finds the corresponding data page based on the index and then moves the existing records to make room for the new data and finally inserts the data.
If the data page is full, you need to split the data page (page splitting is a resource-intensive operation, and there are mechanisms in the general database system to minimize the number of page splits, usually by reserving space for each page):
A) A new data page is allocated on the data segment (extent) that is used, and if the data segment is full, a new segment needs to be allocated.
B) Adjust the index pointer, which needs to read the corresponding index page into memory and lock.
C) About half of the data rows are grouped into new data pages.
D) If the table also has nonclustered indexes, you need to update these indexes to point to the new data page.
Special cases:
A) If a newly inserted record contains large data, it is possible to allocate two new data pages, one of which is used to store the new record and the other to store the data that is split from the original page.
B) duplicate data records are typically stored on the same page in the database system.
C) similar to the self-increment column as a clustered index, the database system may not split the data page, and the page is simply a newly added data page.
3) clustered index and delete operation
Deleting rows causes the rows of data below it to move up to fill the whitespace caused by deleting records.
If the deleted row is the last row in the data page, the data page will be reclaimed and the records in the corresponding index page will be deleted. If the recovered data page is on the same segment as the other data pages of the table, it may be exploited in the subsequent time. If the data page is the only data page for that segment, the segment is also recycled.
The deletion of the data may result in only one record in the index page, at which point the record may be moved to a neighboring index page, and the original index page will be recycled, the so-called "index merge".
5. Nonclustered Indexes
A nonclustered index is compared to a clustered index:
A) leaf nodes are not data nodes.
B) The leaf node stores a "key-pointer" pair for each real data row.
C) The leaf node also stores a pointer offset, which can be positioned to a specific data row based on the page pointer and pointer offset.
D) Similarly, the other index nodes outside the leaf node are stored in a similar context, except that it refers to the index page at the next level.
A clustered index is a sparse index, and the index page on the top level of the data page stores the page pointer, not the row pointer. For a nonclustered index, it is a dense index that stores an index record for each row of data at the top-level index page of the data page.
For index records of the root and intermediate levels, its structure includes:
A) Indexed field values
B) RowId (that is, the page pointer + pointer offset of the corresponding data page). The index page in the upper level contains rowid to precisely locate the data row when the data is changed when the index allows duplicate values.
C) pointers to next-level index pages
For the Index object of the leaf layer, its structure includes:
A) Indexed field values
B) RowId
1) Nonclustered indexes and query operations
For, if we also look for "Green", then a query operation will contain the reading of +1 data pages read by the following Io:3 index pages. Similarly, due to the cached relationship, real IO may actually be smaller than those listed above.
2) nonclustered index and insert operation
If a table contains a nonclustered index but no clustered index, the new data is inserted into the last data page, and then the nonclustered index is updated. If you also include a clustered index, the clustered index will be used to find where the new row will be, and then the clustered index, and the nonclustered index, will be updated.
3) Nonclustered indexes and delete operations
If a nonclustered index is built on the column contained in the WHERE clause of the delete command, the nonclustered index is used to find the location of the data row, and the corresponding record on the leaf of the index is deleted after the data is deleted. If there are other nonclustered indexes on the table, the corresponding data on their leaf nodes will also be deleted.
If the deleted data is the only one in the page of that number, the page is also recycled, and the pointers on each index tree need to be updated.
Because there is no automatic merge feature, if you have frequent random deletions in your application, you may end up with a table that contains more than one data page, but only a small amount of data per page.
6. Index Overrides
Index overrides are an indexing strategy where the required fields contained in a query are included in an index, and the index greatly improves query performance.
An index that contains multiple fields, called a composite index. The index can contain up to 31 fields, and the index record has a maximum length of 600B. If you create a composite nonclustered index on several fields, and the fields involved in the required select field in your query and the Where,order by,group by,having clause are included in the index, only the index page is searched to satisfy the query, without the need to access the data page. Because the leaf nodes of a nonclustered index contain the indexed column values in all rows of data, using these nodes can return real data, which is called index overlay.
In the case of an index overlay, two index scans are included:
A) matching index scan
B) non-matching index scan
1) Matching index scan
Such an index scan allows us to omit the steps to access a data page, where performance gains are limited when the query returns only one row of data, but in the case of a range query, the performance increase will grow with the number of result sets.
For such scans, the index must contain all the fields involved in the query, as well as the leading column in the WHERE clause that contains the index, such as a composite index that contains a,b,c,d four columns, and A is the boot column. If the included column in the WHERE clause is a BCD or BD, you can only use a non-matching index scan.
2) non-configured index scan
As mentioned above, a non-configured index scan is used if the WHERE clause does not contain an indexed guide column. This eventually results in scanning all leaf nodes on the index tree, and, of course, its performance is usually better than scanning all data pages.
Database index Introduction SQL Index storage structure