database indexes and their data structures

Source: Internet
Author: User

Source: http://blog.linezing.com/?p=798

Reprint: http://blog.csdn.net/kennyrose/article/details/7532032

Plainly, the index problem is a lookup problem ...

Database index is a sort of data structure in the database management system, which helps to quickly query and update data in database tables. the implementation of an index typically uses a B-tree and its variants, plus trees .

In addition to data, the database system maintains a data structure that satisfies a particular lookup algorithm that references (points to) data in some way, so that an advanced find algorithm can be implemented on those data structures. This data structure is the index.

There is a cost to indexing a table: one is to increase the storage space for the database, and the other is to spend more time inserting and modifying the data (because the index changes as well).

Shows a possible way to index. On the left is the data table, a total of two columns seven records, the leftmost is the physical address of the data record (note that logically adjacent records on disk is not necessarily physically adjacent). To speed up the search for Col2, you can maintain a two-fork lookup tree on the right, each containing the index key value and a pointer to the physical address of the corresponding data record, so that the binary lookup can be used to obtain the corresponding data in the complexity of O (log2n).

Creating an index can greatly improve the performance of your system.

First, by creating a unique index, you can guarantee the uniqueness of each row of data in a database table.

Second, it can greatly speed up the retrieval of data, which is the main reason for creating indexes.

Thirdly, the connection between tables and tables can be accelerated, particularly in terms of achieving referential integrity of the data.

Finally, when using grouping and sorting clauses for data retrieval, you can also significantly reduce the time to group and sort in queries.

By using the index, we can improve the performance of the system by using the optimized hidden device in the process of querying.

Perhaps someone will ask: there are so many advantages to adding indexes, why not create an index for each column in the table? Because there are many disadvantages to increasing the index.

First, it takes time to create indexes and maintain indexes, and this time increases as the amount of data increases.

Second, the index needs to occupy the physical space, in addition to the data table to occupy the data space, each index also occupies a certain amount of physical space, if you want to establish a clustered index, then the space will be larger.

Thirdly, when the data in the table is added, deleted and modified, the index should be maintained dynamically, thus reducing the maintenance speed of the data.

Indexes are built on top of some columns in a database table. When you create an index, you should consider which columns you can create indexes on and which columns you cannot create indexes on. In general, indexes should be created on these columns: on columns that are frequently searched, you can speed up the search, enforce the uniqueness of the column on the column that is the primary key, and arrange the structure of the data in the organization table; These columns are often used on connected columns, which are mostly foreign keys, to speed up the connection Create an index on a column that often needs to be searched by scope, because the index is sorted, its specified range is contiguous, and the index is created on columns that are often ordered, because the index is sorted so that the query can take advantage of the sorting of the index to speed up the sort query time To speed up the judgment of a condition by creating an index on a column that is often used in the WHERE clause.

Similarly, indexes should not be created for some columns. In general, these columns that should not be indexed have the following characteristics:

First, the index should not be created for columns that are seldom used or referenced in queries. This is because, since these columns are seldom used, they are indexed or non-indexed and do not improve query speed. Conversely, by increasing the index, it reduces the system maintenance speed and increases the space requirement.

Second, you should not increase the index for columns that have only a few data values. This is because, because these columns have very few values, such as the gender column of the personnel table, in the results of the query, the data rows of the result set occupy a large proportion of the data rows in the table, that is, the data rows that need to be searched in the table are large. Increasing the index does not significantly speed up the retrieval.

Third, for those columns defined as text, the image and bit data types should not be indexed. This is because the amount of data in these columns is either quite large or has very little value.

The index should not be created when the performance of the modification is far greater than the retrieval performance. This is because modifying performance and retrieving performance are conflicting . When you increase the index, the retrieval performance is improved, but the performance of the modification is reduced. When you reduce the index, you increase the performance of the modification and reduce the retrieval performance. Therefore, you should not create an index when the performance of the modification is far greater than the retrieval performance.

Depending on the capabilities of your database, you can create three indexes in the Database Designer: unique indexes, primary key indexes, and clustered indexes .

Unique index

A unique index is one that does not allow any two rows to have the same index value.

When duplicate key values exist in existing data, most databases do not allow a newly created unique index to be saved with the table. The database may also prevent the addition of new data that will create duplicate key values in the table. For example, if a unique index is created on the employee's last name (lname) in the Employees table, none of the two employees will have a namesake.

Primary key Index

Database tables often have one column or column combination whose values uniquely identify each row in the table. This column is called the primary key of the table.

Defining a primary key for a table in a database diagram automatically creates a primary key index, which is a specific type of unique index. The index requires that each value in the primary key be unique. When a primary key index is used in a query, it also allows quick access to the data.

Clustered index

In a clustered index, the physical order of rows in a table is the same as the logical (indexed) Order of the key values. A table can contain only one clustered index.

If an index is not a clustered index, the physical order of the rows in the table does not match the logical order of the key values. Clustered indexes typically provide faster data access than nonclustered indexes.

Principle of locality and disk pre-reading

Due to the characteristics of the storage media, the disk itself is much slower than main memory, coupled with mechanical movement, disk access speed is often one of the hundreds of of main memory, so in order to improve efficiency, to minimize disk I/O. To do this, the disk is often not read strictly on-demand, but is read-ahead every time, even if only one byte is required, and the disk starts from this location, sequentially reading a certain length of data into memory. The rationale for this is the well-known local principle of computer science: When a data is used, the data around it is usually used immediately. The data that is required during the program run is usually relatively centralized.

Due to the high efficiency of disk sequential reads (no seek time required and minimal rotational time), pre-reading can improve I/O efficiency for programs with locality.

The length of the read-ahead is generally the integer multiple of the page. Page is the logical block of Computer Management memory, hardware and operating system tend to divide main memory and disk storage area into contiguous size equal blocks, each storage block is called a page (in many operating systems, the page size is usually 4k), main memory and disk in the page to exchange data. When the program to read the data is not in main memory, will trigger a page fault, the system will send a read signal to the disk, the disk will find the starting position of the data and sequentially read one or several pages back into memory, and then return unexpectedly, the program continues to run.

Performance analysis of B-/+tree indexes

Here you can finally analyze the performance of the B-/+tree index.

As mentioned above, the index structure is generally evaluated using disk I/O times. First, from the B-tree analysis, according to the definition of b-tree, it is necessary to retrieve up to H nodes at a time. The designer of the database system skillfully exploits the principle of disk pre-reading, setting the size of a node equal to one page, so that each node can be fully loaded with only one I/O. To achieve this, the following techniques are required to implement B-tree in practice:

Each time you create a new node, request a page space directly, so that a node is physically stored in a page, and the computer storage allocation is page-aligned, the implementation of a node only one time I/O.

B-tree requires a maximum of h-1 I/O (root node resident memory) in a single retrieval, and a progressive complexity of O (h) =o (LOGDN). in general practice, the out-of-size D is a very large number, usually more than 100, so H is very small (usually not more than 3).

And the red-black tree structure, H is obviously much deeper. Because the logically close node (parent-child) may be far away physically, it is not possible to take advantage of locality, so the I/O asymptotic complexity of the red-black tree is also O (h), and the efficiency is significantly worse than B-tree.

In summary, using B-tree as index structure efficiency is very high.

Should take time to learn B-tree and B + Tree data structures

=============================================================================================================

1) B-Tree

Each node in the B-tree contains a key value and a key value for the data object that holds the address pointer, so a successful search for an object can be done without reaching the leaf node of the tree.

Successful searches include intra-node searches and searches along a path, and successful search times depend on the level of the key code and the number of key codes within the node.

The way to find a given keyword in a B-tree is to first take the root node, K1,..., kj in the root node to find the given keyword (available order lookup or binary lookup), and if a keyword equal to the given value is found, the search succeeds; otherwise, you can definitely identify the keyword you want to check in a Ki or ki+ 1, then take the next layer of index node that the PI refers to continues to find, until it is found, or the pointer pi is empty when the lookup fails.

2) B + Tree

The key code stored in a B + tree non-leaf node does not indicate the address pointer of the data object, and the non-node is just the index part. All the leaf nodes are on the same layer, which contains all the key codes and the corresponding data objects ' storing address pointers, and the leaf nodes are linked from small to large in order to key code. If the actual data objects are stored in the order in which they are added, rather than by key number of times, the index of the leaf node must be a dense index, and if the actual data store is stored in key order, the leaf node is indexed with sparse indexes.

B + trees have 2 head pointers, one is the root node of the tree, and the other is the leaf node of the minimum key code.

So the B + Tree has two methods of searching:

One is to search by the list of links pulled by the leaf nodes themselves.

One is to start the search from the root node, similar to the B-tree, but if the key code of the non-leaf node equals the given value, the search does not stop, but continues along the right pointer, always checking the key code on the leaf node. So whether the search succeeds or not, all the layers of the tree are going to be finished.

In a B + tree, the insertion and deletion of data objects is done only on leaf nodes.

The differences between the two data structures that handle indexes are:
The same key value does not appear more than once in a, b tree, and it may appear in leaf nodes or in non-leaf nodes. The keys of the B + tree are bound to appear in the leaf nodes and may be repeated in non-leaf nodes to maintain the balance of the B + tree.
b, because the B-tree key position is variable, and only once in the entire tree structure, although the storage space can be saved, but the complexity of the insertion and deletion operations increased significantly. B + trees are a better compromise than the other.
The query efficiency of the c,b tree is related to the position of the key in the tree, the maximum time complexity is the same as the B + tree (at the time of the leaf node), and the minimum time complexity is 1 (at the root node). The complexity of the B + tree is fixed for a built tree.

database indexes and their data structures

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.