How database indexing works

Source: Internet
Author: User

Database index is a Sort data structure in the database management system. It helps you quickly query and update data in database tables. The implementation of indexes usually uses the B tree and Its Variant B + tree.

In addition to data, the database system also maintains data structures that meet specific search algorithms. These data structures reference (point to) data in a certain way, in this way, you can implement advanced search algorithms on these data structures. This data structure is an index.

The cost for setting indexes for a table is: first, the storage space of the database is increased, second, it takes a lot of time to insert and modify data (because the index also needs to change ).

Shows a possible indexing method. On the left is a data table with a total of seven records in two columns, and on the left is the physical address of the data records (note that logically adjacent records are not physically adjacent on the disk ). To speed up Col2 search, you can maintain a binary search tree shown on the right. Each node contains an index key value and a pointer to the physical address of the corresponding data record, in this way, you can use binary search to obtain the corresponding data in the complexity of O (log2n.

Creating indexes can greatly improve the system performance.

First, you can create a unique index to ensure the uniqueness of each row of data in the database table.

Second, it can greatly speed up data retrieval, which is also the main reason for creating an index.

Third, it can accelerate the connection between tables, especially in achieving Data Reference integrity.

Fourth, when you use grouping and sorting clauses to retrieve data, you can also significantly reduce the time for grouping and sorting in queries.

Fifth, by using indexes, you can use the optimizer during the query process to improve system performance.

Some may ask: why not create an index for each column in the table because increasing Indexes has so many advantages? This is because adding Indexes has many disadvantages.

First, it takes time to create and maintain indexes. This time increases with the increase of data volume.

Second, indexes occupy physical space. In addition to data tables, each index occupies a certain amount of physical space. To create a clustered index, the required space is larger.

Third, when adding, deleting, and modifying data in the table, the index must also be dynamically maintained, which reduces the Data Maintenance speed.

Indexes are created on certain columns in the database table. When creating an index, you should consider which columns can create an index and which Columns cannot create an index. In general, you should create an index on these columns: in columns that frequently need to be searched, the search speed can be accelerated; in columns that are used as the primary key, force the uniqueness of the column and the data arrangement structure in the organization table. These columns are usually used in connected columns and are mainly foreign keys, which can speed up the connection; create an index on a column that often needs to be searched by range. Because the index has been sorted, the specified range is continuous. Create an index on a column that frequently needs to be sorted because the index has been sorted, in this way, the sorting of indexes can be used to speed up the sorting query time. indexes are often created on the columns in the WHERE clause to accelerate the condition judgment speed.

Similarly, indexes should not be created for some columns. In general, these columns that should not be indexed have the following features:

First, indexes should not be created for columns that are rarely used or referenced in queries. This is because, since these columns are rarely used, there is an index or no index, and the query speed cannot be improved. On the contrary, the addition of indexes reduces the system maintenance speed and space requirements.

Second, indexes should not be added to columns with only few data values. This is because these columns have very few values, such as gender columns in the personnel table. In the query results, the data rows in the result set account for a large proportion of the data rows in the table, that is, the proportion of data rows to be searched in the table is large. Adding indexes does not significantly accelerate the search speed.

Third, indexes should not be added for columns defined as text, image, and bit data types. This is because the data volume of these columns is either large or small.

Fourth, when the modification performance is far greater than the retrieval performance, you should not create an index. This is because the modification performance and retrieval performance are inconsistent. When an index is added, the search performance is improved, but the modification performance is reduced. When the index is reduced, the modification performance is improved and the retrieval performance is reduced. Therefore, when the modification performance is much higher than the retrieval performance, you should not create an index.

Based on the functions of the database, you can create three indexes in the Database Designer: unique index, primary key index, and clustered index.

Unique Index

A unique index is an index that does not allow any two rows to have the same index value.

When duplicate key values exist in existing data, most databases do not allow you to save the newly created unique index with the table. The database may also prevent adding new data that will create duplicate key values in the table. For example, if the employee's last name (lname) in the employee table creates a unique index, neither employee can have the same name. A primary key index database table often has a combination of columns or columns. Its Values uniquely identify each row in the table. This column is called the primary key of the table. When you define a primary key for a table in the database relationship diagram, the primary key index is automatically created. The primary key index is a specific type of unique index. This index requires that each value in the primary key be unique. When a primary key index is used in a query, it also allows quick access to data. The physical order of the row in the table is the same as that of the key value in the clustered index. A table can contain only one clustered index.

If an index is not a clustered index, the physical sequence of the row in the table does not match the logical sequence of the key value. Compared with non-clustered indexes, clustered indexes generally provide faster data access speeds.

Local principle and disk pre-read

Because of the characteristics of the storage medium, the access to the disk itself is much slower than the primary storage, coupled with the cost of mechanical movement, the access speed of the disk is often one of the primary storage, so in order to improve efficiency, minimize disk I/O. To achieve this goal, the disk is usually not read strictly on demand, but preread every time. Even if only one byte is required, the disk starts from this location, read data of a certain length in sequence into the memory. This theory is based on the well-known local principle in Computer Science: when a data is used, the data nearby it is usually used immediately. The data required during the program running is usually concentrated.

Because sequential disk reading is highly efficient (with little rotation time required without seeking time), preread can improve I/O efficiency for local programs.

The preread length is generally an integer multiple of the page. Pages are logical blocks for computer memory management. Hardware and operating systems often divide primary and disk storage areas into contiguous blocks of the same size, each block is called a page (in many operating systems, the page size is usually 4 k). The primary storage and disk exchange data in pages. When the data to be read by the program is not in the primary storage, a page missing exception is triggered, and the system sends a disk reading signal to the disk, the disk finds the starting position of the data and reads one or more pages consecutively into the memory. If an exception is returned, the program continues to run.

B-/+ Tree index Performance Analysis

At last, we can analyze the performance of B-/+ Tree indexes.

As mentioned above, the index structure is evaluated by the number of disk I/O operations. From B-Tree analysis, according to the definition of B-Tree, we can see that a maximum of h nodes can be accessed at a time. The database system designer cleverly utilizes the disk pre-read principle to set the size of a node to equal to a page, so that each node can be fully loaded only once I/O. To achieve this goal, you also need to use the following techniques to implement B-Tree:

Each time you create a node, you can directly apply for a page space to ensure that a node is physically stored on a page. In addition, the computer storage allocation is page-aligned, A node only needs one I/O operation.

A B-Tree retrieval requires a maximum of H-1 I/O (root node resident memory), and the progressive complexity is O (h) = O (logdN ). Generally, in practice, the output degree d is a very large number, usually greater than 100, so h is very small (usually no more than 3 ).

The structure of the red and black trees is much deeper than h. Because logically close nodes (Parent and Child) may be far physically unable to use locality, the I/O complexity of the red and black trees is O (h ), the efficiency is much lower than that of B-Tree.

In conclusion, the efficiency of using B-Tree as the index structure is very high.

Time should be spent learning the B-tree and B +-tree data structures

========================================================== ========================================================== ======================

1) B tree

Each node in Tree B contains the address pointer of the Data Object for key-value and key-value pairs. Therefore, you can search for an object without reaching the leaf node of the tree.

A successful search includes a search in a node and a search along a specific path. The time for a successful search depends on the level of the key code and the number of key codes in the node.

The method to search for a given keyword in Tree B is: first obtain the root node, the key word K1,…, contained in the root node ,..., Kj searches for the given keywords (available sequential search or binary search). If the keyword is found to be equal to the given value, the search is successful. Otherwise, you must be sure that the keyword to be queried is between a certain Ki Or Ki + 1, so take the index node block of the next layer referred to by Pi to continue searching until it is found, or if the pointer Pi is null, the query fails.

2) B + tree

The key code stored in the non-leaf node of the B + tree does not indicate the address pointer of the data object. The non-leaf node is only the index part. On the same layer, all leaf nodes contain all key codes and corresponding data object storage address pointers, and leaf nodes are connected in ascending order by key codes. If the actual data objects are stored in the order they are added instead of the key code count, the leaf node index must be a dense index. If the actual data storage is stored in the key code order, sparse indexes are used for leaf node indexes.

The B + tree has two head pointers. One is the root node of the tree and the other is the leaf node with the minimum key code.

Therefore, the B + tree has two search methods:

One is to search by the linked list pulled by the leaf node.

One is to start searching from the root node, which is similar to tree B. However, if the key code of a non-leaf node is equal to the given value, the search does not stop, but continues along the right pointer, keep checking the key code on the leaf node. Therefore, no matter whether the search is successful or not, all layers of the tree are completed.

In the B + tree, data objects are inserted and deleted only on leaf nodes.

The differences between the two data structures that process indexes are as follows:

The same key value in tree a and Tree B does not appear multiple times, and it may appear in leaf nodes or non-leaf nodes. The keys of the B + tree will certainly appear in the leaf node, and may also appear repeatedly in non-leaf nodes to maintain the balance of the B + tree.

B. Because the key position of Tree B is not fixed and appears only once in the entire tree structure, although it can save storage space, it significantly increases the complexity of insert and delete operations. B + tree is a good compromise.

The query efficiency of the c and B trees is related to the key position in the tree. The maximum time complexity is the same as that of the B + tree (when the leaf node is used ), the minimum time complexity is 1 (at the root node ). The complexity of the B + tree is fixed for a constructed tree.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.