Why the database should be implemented with B + tree structure--MYSQL index structure

Source: Internet
Author: User
Tags mysql index

The application of B + tree in database {Why use B + trees? Concise, because: 1. The file is large and cannot be stored in memory, so it is stored on disk2. The structure of the index minimizes the number of disk I/O accesses during the lookup process (why B-/+tree is used, but also the disk access principle. )3. The principle of locality and the disk pre-reading, the length of the pre-read is generally the full multiples of the page (page), (in many operating systems, the size of the pages is usually 4k) 4. The database system skillfully utilizes the principle of disk pre-reading, the size of a node is set equal to one page, so that each node needs only one I/O can be fully loaded (because there are two arrays in the node, so the address is contiguous). And the red-black tree structure,H is obviously much deeper. Because a logically close node (parent-child) can be physically far away, it cannot take advantage of locality

The difference between the structure of InnoDB and MyISAM

1.InnoDB primary key index, MYISAM index file and data file is detached, index file only saves the address of the data record. In InnoDB, the table data file itself is an index structure organized by B+tree, and the tree's leaf node data field holds the complete data record. The key of this index is the primary key of the data table, so the InnoDB table data file itself is the primary index, so there must be a primary key, if there is no display definition, automatically generate an implied field as the primary key, the field length is 6 bytes, the type is long shaping

The secondary index of the 2.InnoDB (secondary index, which is the non-primary key) also contains the primary key columns, such as the name index, the internal node contains the name, the leaf node contains the value of the primary key corresponding to the name, and if the primary key definition is larger, the other indexes will be large

The 3.MyISAM engine uses B+tree as the index structure, and the data domain of the index file leaf node holds the address of the record, pointing to the corresponding value in the data file, with each node having only the value of the indexed column

4. MyISAM Primary and secondary indexes (secondary key) have no difference in structure, except that the primary index requires that key is unique and secondary indexes can be duplicated .

(because the MyISAM secondary index stores the address of the data record on the leaf node, as with the primary key index, the InnoDB with respect to the B + is accessible through the secondary index

Quickly find all the data without having to iterate over the primary key index, so it works for OLAP)

InnoDB The difference between index and MyISAM index :

The first is the difference between the primary index, the InnoDB data file itself is the index file. The index and data of MyISAM are separate.

The second is the difference between secondary indexes: the secondary index of the InnoDB data field stores the value of the corresponding record primary key instead of the address. The secondary and primary indexes of MyISAM are not much different.


}

1. The role of indexes in the database

In the process of using the database system, the data query is one of the most frequently used data operations.

The most basic query algorithm, of course, is the sequential lookup (linear search), which iterates through the table and then matches row-by-row values equal to the keyword to be searched, with a time complexity of O (n).  But the time complexity is O (n) Algorithm small scale table, load light database, also can have good performance. But when the data grows, the algorithm with time complexity O (n) is obviously bad, and performance drops quickly.

Fortunately, the development of computer science provides a number of better search algorithms, such as binary search, binary tree search, binary search, and so on. If you look at it a little bit, you will find that each lookup algorithm can only be applied to a particular data structure, such as a binary lookup requires an orderly retrieval of data, while a binary tree lookup can only be applied to a binary lookup tree, but the data itself cannot be fully organized to meet a variety of data structures (for example, It is theoretically impossible to organize both columns sequentially, so in addition to the data, the database system maintains a data structure that satisfies a particular lookup algorithm that references (points to) data in some way, so that an advanced find algorithm can be implemented on those data structures. This data structure is the index.

An index is a structure that sorts the values of one or more columns in a database table. Compared to searching all rows in a table, the index uses pointers to data values stored in the specified columns in the table, and then arranges the pointers in the order specified to help get information faster. Typically, you need to create an index on a table only when you frequently query the data in an indexed column. The index consumes disk space and affects the speed of data updates. However, in most cases, the data retrieval speed advantage of index is much more than its disadvantage.

2. The application of B + tree in database index


At present, most database systems and file systems use B-tree or its variant B+tree as the index structure.

1) Application in database indexing

In the application of database indexing, the B + tree is organized in the following ways:

The way the leaf nodes are organized. The lookup key for a B + tree is the primary key of the data file, and the index is dense. In other words, in the leaf node, the first record of the data file has a key, refers to, the data file can be sorted by the primary key or not by the primary key, the data file is sorted by the primary key, and the B + tree is a sparse index, in the leaf node for each block of data files with a key, refers to; And this property is the B + Tree lookup key, the leaf node in the data file appears in each attribute K has a key, refers to, where the pointer performs a sort key value of K in the first of the record.

The organization of non-leaf nodes.  The non-leaf nodes in B + Trees form a multi-level sparse index on the leaf nodes. There are at least ceil (M/2) pointers in each non-leaf node, with a maximum of M pointers.

2) B + Tree index insertion and deletion

① when inserting new data into the database, you also need to insert the corresponding index key value into the database index, you need to insert a new key value into the B + tree. that is, the B-tree insertion algorithm we mentioned above.

② When you delete data from a database and you also need to remove the corresponding index key value from the database index, you need to remove the key value from the B + tree. that is, the B-Tree deletion algorithm

Why use B-tree (B+tree)

Binary search tree Evolutionary varieties of red and black trees and other data structures can also be used to implement the index, but the file system and database system generally adopt b-/+tree as the index structure.

In general, the index itself is large and cannot be stored in memory , so the index is often stored as an index file on the disk. In this way, the index lookup process will generate disk I/O consumption, relative to memory access, I/O access to the consumption of a few orders of magnitude, so the evaluation of a data structure as an index is the most important indicator of the number of disk I/O operations in the process of incremental complexity. In other words, the structural organization of the index minimizes the number of disk I/O accesses during the lookup process. Why using B-/+tree is also related to the principle of disk access.

principle of locality and disk pre-reading

Due to the characteristics of the storage media, the disk itself is much slower than main memory, coupled with mechanical movement, disk access speed is often one of the hundreds of of main memory, so in order to improve efficiency, to minimize disk I/O. To do this, the disk is often not read strictly on-demand, but is read-ahead every time, even if only one byte is required, and the disk starts from this location, sequentially reading a certain length of data into memory. The rationale for this is the well-known local principle in computer science:

When a data is used, the data around it is usually used immediately.

The data that is required during the program run is usually relatively centralized.

Due to the high efficiency of disk sequential reads (no seek time required and minimal rotational time), pre-reading can improve I/O efficiency for programs with locality.

The length of the read-ahead is generally the integer multiple of the page. Page is the logical block of Computer Management memory, hardware and operating system tend to divide main memory and disk storage area into contiguous size equal blocks, each storage block is called a page (in many operating systems, the page size is usually 4k), main memory and disk in the page to exchange data. When the program to read the data is not in main memory, will trigger a page fault, the system will send a read signal to the disk, the disk will find the starting position of the data and sequentially read one or several pages back into memory, and then return unexpectedly, the program continues to run.

The above analysis B-/+tree retrieves the maximum number of nodes to be accessed at a time:

h =

The database system skillfully utilizes the principle of disk pre-reading, setting the size of a node equal to one page, so that each node can be fully loaded with only one I/O. To achieve this, the following techniques are required to implement B-tree in practice:

Each time you create a new node, request a page space directly, so that a node is physically stored in a page, and the computer storage allocation is page-aligned, the implementation of a node only one time I/O.

B-tree requires a maximum of h-1 I/O (root node resident memory) in a single retrieval, and a progressive complexity of O (h) =o (LOGMN). In general practical applications, M is a very large number, usually more than 100, so H is very small (usually not more than 3).

In summary, using B-tree as index structure efficiency is very high.

And the red-black tree structure,H is obviously much deeper. Because the logically close node (parent-child) may be far away physically, it is not possible to take advantage of locality, so the I/O asymptotic complexity of the red-black tree is also O (h), and the efficiency is significantly worse than B-tree.

MySQL B-tree index (technically speaking b+tree)

In MySQL, there are four main types of indexes: B-tree index, Hash Index, Fulltext Index, and R-tree index. We mainly analyze the B-tree index.

The B-tree index is the most frequently used index type in a MySQL database, and all storage engines except the Archive storage engine support B-tree indexes. The Archive engine does not support indexing until MySQL 5.1, and only supports indexing a single auto_increment column.

Not only in MySQL, but in many other database management systems, the B-tree index is also the most important index type, mainly because the storage structure of the B-tree index has a very good performance in data retrieval of the database.

In general, the physical files of the B-tree index in MySQL are mostly stored in the structure of the Balance tree, that is, all the data that is actually needed is stored in the leaf node of the tree, and the shortest path to any of the leaf nodes is long Are exactly the same, so we all call it the B-tree index. Of course, it is possible that various databases (or MySQL's various storage engines) will slightly transform the storage structure when storing their own b-tree indexes. such as the B-tree index of the INNODB storage engine actually uses the storage structure is B+tree, that is, on the basis of the B-TREE data structure has made a small transformation, on each leaf Node above the information to hold the index key, Also stores pointer information to the next leafnode that is adjacent to the Leaf node (with the addition of sequential access pointers), primarily to speed up the efficiency of retrieving multiple neighboring leaf node.

The following is a discussion of the index implementations of the MyISAM and InnoDB two storage engines:

1. MyISAM Index Implementation:

1) primary key index:

The MyISAM engine uses B+tree as the index structure, and the data domain of the leaf node holds the address of the record. Is the schematic diagram of the MyISAM primary key index:


(Figure MYISAM1)

There are three columns in the table, assuming that we Col1 the primary key, figure myisam1 is a MyISAM table's main index (Primary key) schematic. You can see that the index file of MyISAM only stores the address of the data record.

2) Secondary index (secondary key)

In MyISAM, the primary index and secondary index (secondary key) are structurally indistinguishable, except that the primary index requires that key be unique, and the secondary index key can be duplicated. If we establish a secondary index on Col2, the structure of this index is as follows:

It is also the address of a b+tree,data field that holds data records. Therefore,the algorithm of index retrieval in MyISAM is to search the index according to the B+tree search algorithm first, if the specified key exists, the value of its data field is fetched, then the data record is read with the value of the data field.

The index of MyISAM is also called "non-aggregation", and the reason for this is to differentiate it from InnoDB's clustered index.

2. InnoDB Index Implementation

However InnoDB also uses B+tree as the index structure, but the concrete implementation way is different from MyISAM.

1) primary key index:

the MyISAM index file and the data file are detached, and the index file only holds the address of the data record. In InnoDB, the table data file itself is an index structure organized by B+tree, and the tree's leaf node data field holds the complete data record. The key of this index is the primary key of the data table, so the InnoDB table data file itself is the primary index.

(Figure Inndb Primary key index)

(Figure Inndb Primary key index) is the InnoDB primary index (also the data file), and you can see that the leaf node contains the complete data record. This index is called a clustered index . Because the InnoDB data file itself is clustered by the primary key, the INNODB requires that the table must have a primary key (MyISAM can not), and if it is not explicitly specified, the MySQL system automatically selects a column that uniquely identifies the data record as the primary key, and if no such column exists, Then MySQL automatically generates an implicit field for the InnoDB table as the primary key, which is 6 bytes long and has a length of type.

2). Secondary index of the InnoDB

All secondary indexes of InnoDB refer to the primary key as the data domain. For example, to define a secondary index on the Col3:

InnoDB tables are built on clustered indexes . Therefore, the InnoDB index provides a very fast primary key lookup performance. However, its secondary indexes (secondary index, or non-primary key indexes) also contain primary key columns, so if the primary key definition is larger, the other indexes will be large. If you want to define, many indexes on the table, try to make the primary key as small as possible. InnoDB does not compress the index.

The ASCII code of the text character as the comparison criterion. Clustered index This implementation makes search by primary key efficient, but secondary index search needs to retrieve two times index: first retrieves the secondary index to obtain the primary key, and then retrieves the record with the primary key to the primary index.

The way the index implementations of different storage engines is useful for proper use and optimization of indexes, such as knowing the InnoDB index implementation, makes it easy to understand why it is not recommended to use too-long fields as primary keys, because all secondary indexes refer to the primary index, and the long primary index makes the secondary index too large. For example, using non-monotonic fields as primary keys is not a good idea in InnoDB because the InnoDB data file itself is a b+tree, and a non-monotonic primary key causes the data file to be frequently split in order to maintain b+tree characteristics when inserting new records, which is inefficient, Using the self-increment field as the primary key is a good choice.

InnoDB The difference between index and MyISAM index :

The first is the difference between the primary index, the InnoDB data file itself is the index file. The index and data of MyISAM are separate.

The second is the difference between secondary indexes: the secondary index of the InnoDB data field stores the value of the corresponding record primary key instead of the address. The secondary and primary indexes of MyISAM are not much different.

Why the database should be implemented with B + tree structure--MYSQL index structure

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.