I. Introduction
My discussions on Database indexes have never been removed. What is the database index? What is the difference between clustered index and non-clustered index? I hope this article will help you. I sincerely hope that you will give me some advice and make common progress.
II. B-Tree
Our common database systems use B-Tree or B + Tree data structures for indexing. For example, MsSql uses B + Tree, Oracle and Sysbase uses B-Tree. So in the beginning, we will briefly introduce B-Tree.
B-Tree is different from Binary Tree, and has a maximum of two Subtrees. a m-Level B-Tree meets the following conditions:
1) each node has up to M children;
2) Apart from the root node and leaf node, each other node has at least M/2 children;
3) There are at least two children except that the tree contains only one node );
4) All leaf nodes are on the same layer. Leaf nodes do not contain any keyword information;
5) Non-leaf nodes with K keywords exactly contain K + 1 child;
In addition, for a node, the internal keywords are sorted from small to large. Example of B-TreeM = 4:
Each node mainly contains a Key [] keyword array, and a pointer array points to Son []. In B-Tree, the search process is: Use the sequential search array length is short) or the semi-query method to find the Key [] array. If the Key K is found, returns the address of the node and the location of K in Key []. Otherwise, you can determine that K is between a Key [I] and a Key [I + 1, the query continues from the child node indicated by Son [I] until the query is successful at a node, or until the query in the leaf node is found and the query in the leaf node is still unsuccessful.
Next, we use a piece to demonstrate how to generate B-TreeM = 4, insert 1 ~ 6 ):
As you can see from the figure, when we insert the keyword 4, the original node is full, so the split is basically based on the half principle, and then the intermediate keyword 2 is taken out, upgrade to the root node ). And so on.
Iii. database index
1. What is an index?
In a database, the meaning of an index is no different from the word "Index" in the daily sense. Think about the dictionary when you are a child.) It is a database object used to increase the data access speed of database tables.
A) indexes can avoid full table scans. Most queries can only scan a small number of index pages and data pages, instead of traversing all data pages.
B) For non-clustered indexes, some queries may not even access data pages.
C) clustered indexes can prevent data insertion operations from concentrated on the last data page of the table.
D) In some cases, indexes can also be used to avoid sorting operations.
Of course, as we all know, although indexes can increase the query speed, they will also lead to a decline in the performance of database system update data, because most data updates need to update indexes at the same time.
2. Storage of Indexes
The basic information contained in an index record includes: key value (the value of all fields you specified when defining the index) + logical pointer pointing to the data page or another index page ).
When you create an index for an empty table, the database system will allocate an index page for you, which is always empty before you insert data. This page is both a root node and a leaf node. Every time you insert a row of data into the table, the database system inserts a row of index records into the root node. When the root node is full, the database system splits according to the following steps:
A) create two son nodes
B) split the data in the original root node into two halves and write the data to the new two sons respectively.
C) Add a pointer to two son nodes in the Root Node
Generally, because index records only contain index Field Values and 4-9 bytes of pointers), index entities are much smaller than real data rows, and index pages are much more intensive than data pages. An index page can store more index records, which means that I/O takes a significant advantage in index search, understanding this helps you understand the advantages of using indexes in essence.
3. Index type
A) clustered indexes. Table data is stored in the order of indexes. For clustered indexes, leaf nodes store real data rows, and there are no separate data pages.
B) Non-clustered indexes. The table data storage sequence is irrelevant to the index sequence. For non-clustered indexes, the leaf node contains the index field value and the logical pointer to the data row of the data page. This layer is close to the data page, and the number of rows is consistent with the data row data volume of the data table.
Only one clustered index can be created on a table, because the physical order of real data is only one. If a table has no clustered index, it is called "Heap ). There is no specific sequence of data rows in such a table, and all new rows will be added to the end of the table.
4. Clustered Index
In clustered indexes, leaf nodes are also data nodes. The storage order of all data rows is the same as that of indexes.
1) clustered index and query operations
For example, we create a clustered index on the name field. When we need to search for a specific record based on this field, the database system will find the root of this index based on a specific system table, search for the next one based on the pointer until it is found. For example, we want to query "Green". Because it is between [Bennet, Karsen], we have found index page 1007 where "Green" is between [Greane, Hunter, based on this, we find the leaf node 1133, that is, the data node), and finally find the target data row on this page.
I/O of this query includes three index pages. The last query is actually in the data page ). The search here may be from the disk (Physical Read) or from the cache (Logical Read). If the table is frequently accessed, higher indexes in the index tree may be found in the cache. Therefore, the actual IO may be smaller than the above.
2) clustered index and insert operations
In the simplest case, the insert operation finds the corresponding data page based on the index, moves the existing records to free up space for the new data, and finally inserts the data.
If the data page is full, splitting the data page is a resource-consuming operation. Generally, the database system has a mechanism to minimize the number of page splits, it is usually implemented by reserving space for each page ):
A) allocate A new data page on the Data Segment extent used. If the data segment is full, allocate A new segment.
B) Adjust the index pointer. Read the corresponding index page into the memory and lock it.
C) about half of data rows are classified into new data pages.
D) if the table has non-clustered indexes, update these indexes to point to the new data page.
Special Cases:
A) if A newly inserted record contains large data, two new data pages may be allocated, one of which is used to store new records and the other is used to store data split from the original page.
B) generally, duplicate data records are stored on the same page in the database system.
C) similar to the auto-increment column clustering index, the database system may not split the data page. The page is just a simple new data page.
3) clustered index and delete operations
The deletion row causes the data rows below to move up to fill in the gaps caused by the deletion record.
If the row to be deleted is the last row in the data page, the data page will be recycled and the records on the corresponding index page will be deleted. If the recycled data page is located in the same segment as other data pages of the table, it may be used later. If the data page is the only data page of the section, the Section is also recycled.
Data deletion may result in only one record on the index page. In this case, the record may be moved to the adjacent index page and the original index page will be recycled, this is the so-called "index merge ".
5. Non-clustered Index
Compared with clustered indexes, non-clustered indexes:
A) leaf nodes are not data nodes.
B) The leaf node stores a "key-pointer" pair for each real data row.
C) The leaf node also stores a pointer offset, which can be used to locate specific data rows based on the page pointer and pointer offset.
D) Similarly, other index nodes except leaf nodes store similar content, but it points to the next index page.
Clustered index is a sparse index. The index pages on the data page store page pointers instead of row pointers. For non-clustered indexes, the index is intensive. On the top-level index page of the data page, it stores an index record for each data row.