A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Database indexing is a sort of data structure in a database management system to assist in the quick querying and updating of database tables. The implementation of an index typically uses a B-tree and its variants, plus trees.
In addition to data, the database system maintains a data structure that satisfies a particular lookup algorithm that references (points to) data in some way, so that an advanced find algorithm can be implemented on those data structures. This data structure is the index.
There is a cost to indexing a table: one is to increase the storage space for the database, and the other is to spend more time inserting and modifying the data (because the index changes as well).
Shows a possible way to index. On the left is the data table, a total of two columns seven records, the leftmost is the physical address of the data record (note that logically adjacent records on disk is not necessarily physically adjacent). To speed up the search for Col2, you can maintain a two-fork lookup tree on the right, each containing the index key value and a pointer to the physical address of the corresponding data record, so that the binary lookup can be used to obtain the corresponding data in the complexity of O (log2n).
Advantages of the index:
First, by creating a unique index, you can guarantee the uniqueness of each row of data in a database table.
Second, it can greatly speed up the retrieval of data, which is the main reason for creating indexes.
Thirdly, the connection between tables and tables can be accelerated, particularly in terms of achieving referential integrity of the data.
Finally, when using grouping and sorting clauses for data retrieval, you can also significantly reduce the time to group and sort in queries.
By using the index, we can improve the performance of the system by using the optimized hidden device in the process of querying.
Disadvantages of the Index
First, it takes time to create indexes and maintain indexes, and this time increases as the amount of data increases.
Second, the index needs to occupy the physical space, in addition to the data table to occupy the data space, each index also occupies a certain amount of physical space, if you want to establish a clustered index, then the space will be larger.
Thirdly, when the data in the table is added, deleted and modified, the index should be maintained dynamically, thus reducing the maintenance speed of the data.
The indexes should be created on these columns:
1, in the frequently need to search the column, you can speed up the search;
2. On the column that is the primary key, enforce the uniqueness of the column and the arrangement structure of the data in the organization table;
3, often used in the connected column, these columns are mainly some foreign keys, you can speed up the connection;
4. Create an index on a column that is often required to search by scope, because the index is sorted and its specified range is continuous;
5. Create an index on a column that is often ordered, because the index is sorted so that the query can use the sorting of the index to speed up the sorting query time;
6. Create an index on a column that is frequently used in the WHERE clause to speed up the judgment of the condition.
These columns that should not be indexed have the following characteristics:
1. For those columns that are seldom used or referenced in queries, you should not create an index. This is because, since these columns are seldom used, they are indexed or non-indexed and do not improve query speed. Conversely, by increasing the index, it reduces the system maintenance speed and increases the space requirement.
2. For columns that have very few data values, you should not increase the index. This is because, because these columns have very few values, such as the gender column of the personnel table, in the results of the query, the data rows of the result set occupy a large proportion of the data rows in the table, that is, the data rows that need to be searched in the table are large. Increasing the index does not significantly speed up the retrieval.
3. For columns that are defined as text, the image and bit data types should not be indexed. This is because the amount of data in these columns is either quite large or has very little value.
4. You should not create an index when modifying performance is far greater than retrieving performance. This is because modifying performance and retrieving performance are conflicting . When you increase the index, the retrieval performance is improved, but the performance of the modification is reduced. When you reduce the index, you increase the performance of the modification and reduce the retrieval performance. Therefore, you should not create an index when the performance of the modification is far greater than the retrieval performance.
Depending on the capabilities of your database, you can create three indexes in the Database Designer: Unique indexes, primary key indexes, and clustered indexes.
Due to the characteristics of the storage media, the disk itself is much slower than main memory, coupled with mechanical movement, disk access speed is often one of the hundreds of of main memory, so in order to improve efficiency, to minimize disk I/O. To do this, the disk is often not read strictly on-demand, but is read-ahead every time, even if only one byte is required, and the disk starts from this location, sequentially reading a certain length of data into memory. The rationale for this is the well-known local principle of computer science: When a data is used, the data around it is usually used immediately. The data that is required during the program run is usually relatively centralized.
Due to the high efficiency of disk sequential reads (no seek time required and minimal rotational time), pre-reading can improve I/O efficiency for programs with locality.
The length of the read-ahead is generally the integer multiple of the page. Page is the logical block of Computer Management memory, hardware and operating system tend to divide main memory and disk storage area into contiguous size equal blocks, each storage block is called a page (in many operating systems, the page size is usually 4k), main memory and disk in the page to exchange data. When the program to read the data is not in main memory, will trigger a page fault, the system will send a read signal to the disk, the disk will find the starting position of the data and sequentially read one or several pages back into memory, and then return unexpectedly, the program continues to run.
As mentioned above, the index structure is generally evaluated using disk I/O times. First, from the B-tree analysis, according to the definition of b-tree, it is necessary to retrieve up to H nodes at a time. The designer of the database system skillfully exploits the principle of disk pre-reading, setting the size of a node equal to one page, so that each node can be fully loaded with only one I/O. To achieve this, the following techniques are required to implement B-tree in practice:
Each time you create a new node, request a page space directly, so that a node is physically stored in a page, and the computer storage allocation is page-aligned, the implementation of a node only one time I/O.
B-tree requires a maximum of h-1 I/O (root node resident memory) in a single retrieval, and a progressive complexity of O (h) =o (LOGDN). in general practice, the out-of-size D is a very large number, usually more than 100, so H is very small (usually not more than 3).
And the red-black tree structure, H is obviously much deeper. Because the logically close node (parent-child) may be far away physically, it is not possible to take advantage of locality, so the I/O asymptotic complexity of the red-black tree is also O (h), and the efficiency is significantly worse than B-tree.
The efficiency of using B-tree as index structure is very high.
B-tree is a multi-path search tree (not two-pronged):
1. Definition of any non-leaf node up to only m sons; m>2;
2. The number of sons of the root node is [2, M];
3. The number of sons of non-leaf nodes outside the root node is [M/2, M];
4. Each node is stored at least m/2-1 (rounded) and up to M-1 keywords; (at least 2 keywords)
5. Number of key words for non-leaf nodes = number of pointers to sons-1;
6. Non-leaf node keywords: k, k, ..., k[m-1]; K[i] < k[i+1];
7. Pointers to non-leaf nodes: p, p, ..., p[m], where p a subtree that points to a keyword less than k, p[m] a subtree that points to a keyword greater than k[m-1], and other p[i] to the subtree of the keyword belonging to (k[i-1], k[i]);
8. All leaf nodes are located on the same floor.
1. The keyword set is distributed throughout the tree;
2. Any keyword appears and appears only in one node;
3. Search may end at non-leaf nodes;
4. Its search performance is equivalent to doing one-time binary search within the complete range of keywords;
5. Automatic level control;
B-Tree search, starting from the root node, the node in the key (ordered) sequence of binary search, if the hit is finished, otherwise enter the query keyword to the range of the son node; repeat until the corresponding son pointer is empty, or is already a leaf node.
2) B + Tree
the key code stored in a B + tree non-leaf node does not indicate the address pointer of the data object , but the non-leaf node is just the index part . all the leaf nodes are on the same layer , which contains all the key codes and the corresponding data objects ' storing address pointers, and the leaf nodes are linked from small to large in order to key code . If the actual data objects are stored in the order in which they are added, rather than by key number of times, the index of the leaf node must be a dense index, and if the actual data store is stored in key order, the leaf node is indexed with sparse indexes.
All keys will be in the leaf node.
(using B + Tree as index in MySQL)
Features of B + trees:
1. All keywords appear in the list of leaf nodes (dense index), and the key words in the list are in order;
2. Cannot be hit on non-leaf nodes;
3. The non-leaf node is equivalent to the index of the leaf node (sparse index), and the leaf node is equivalent to the data layer of storing (key) data;
4. More suitable for file indexing system.
A pointer to an adjacent leaf node is added to each leaf node of the B+tree, and a b+tree with sequential access pointers is formed. The purpose of this optimization is to improve the performance of the interval access.
B + trees have 2 head pointers, one is the root node of the tree, and the other is the leaf node of the minimum key code.
So the B + Tree has two methods of searching:
One is to search by the list of links pulled by the leaf nodes themselves.
One is to start the search from the root node, similar to the B-tree, but if the key code of the non-leaf node equals the given value, the search does not stop, but continues along the right pointer, always checking the key code on the leaf node. So whether the search succeeds or not, all the layers of the tree are going to be finished.
In a B + tree, the insertion and deletion of data objects is done only on leaf nodes.
The differences between the two data structures that handle indexes are:
1, B-the same key value does not appear multiple times in the tree, and it may appear in leaf nodes, or in non-leaf nodes. The keys of the B + tree are bound to appear in the leaf nodes and may be repeated in non-leaf nodes to maintain the balance of the B + Tree .
2, because the B-tree key position is uncertain, and in the entire tree structure appears only once, although the storage space can be saved, but in the insertion, deletion operation complexity increased significantly . B + trees are a better compromise than the other.
3, B-the query efficiency of the tree is related to the position of the key in the tree, the maximum time complexity is the same as the B + tree (at the time of the leaf node), the minimum time complexity is 1 (at the root node). The complexity of the B + tree is fixed for a built tree.
Why Choose B-+, B.-Tree
The index itself is also large and cannot be stored in memory, so the index is often stored in the form of an index file on the disk. In this way, the index lookup process will generate disk I/O consumption, relative to memory access, I/O access to the consumption of a few orders of magnitude, so the evaluation of a data structure as an index is the most important indicator of the number of disk I/O operations in the process of incremental complexity. In other words, the structural organization of the index minimizes the number of disk I/O accesses during the lookup process .
Memory reads, memory is composed of a series of storage units, each storage unit stores a fixed size of data, and has a unique address. When the memory needs to be read, the address signal is placed on the address bus to the memory, the memory parse signal and locate to the storage unit, and then put the data on the storage unit on the bus, callback.
When writing memory, the system will write the data and the cell address on the bus and address buses, memory read two bus content, do the corresponding write operations.
Memory access efficiency, in relation to the number of times, reading a data first or after reading a data does not affect access efficiency. Disk access is not the same, and disk I/O involves mechanical operation. Disks are made up of the same size and coaxial circular discs, and the disks can be rotated (each disk must be rotated simultaneously). There is a head bracket on one side of the disk, and the head bracket secures a set of heads and each head is responsible for accessing the contents of one disk. The head does not move, the disk rotates, but the magnetic arm can be moved back and forth for reading data on different tracks. The track is a series of concentric rings (labeled red) that are centered around the platters. The track is divided into small segments, called sectors, is the minimum storage of the disk
When the disk is read, the system passes the logical address of the data to the disk, and the control circuitry of the disk resolves the physical address, which is which sector of the track. So the head needs to move forward and backward to the corresponding track, the time spent is called seek time, and then the disk rotation to the corresponding sector to the head, the time spent is called the rotation time. Therefore, the appropriate sequence of operations and data storage can reduce the seek time and rotation time.
To minimize I/O operations, disk reads are read-ahead every time and are typically multiples of the page size. Even if only one byte is to be read, the disk reads a page of data (typically 4K) into memory, and the memory and disk Exchange data in pages. Because of the principle of locality, it is common for a data to be used, and the data near it will be used immediately.
Database index (combined with B-tree and B + Tree)
Start building with 50+ products and up to 12 months usage for Elastic Compute Service