Principles of database indexing

Last Update:2018-05-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This digest is from the following public number article, which intercepts the index part.Micro-service database design and read-write separation,Link

Third, the principle of database indexing

In the whole process, the most easily referred to as the bottleneck point is the data read and write, often means to sequential or random read and write disk, and read and write disk speed is often slow.

What if this process is speeded up? I believe we all guessed that it was indexing.

Why does the index speed up the process?

I believe that we have visited Food city, there are many restaurants, if you are not anxious, belly is not hungry, the performance of the search is not required, you can slowly stroll in the mall, stroll a look at a home, know to find the restaurant you want to eat. But when you are hungry, or you have an appointment to the restaurant, you must want to go straight to that restaurant, this time, you tend to look at the floor of the index map, quickly find your target restaurant location, find, straight to the topic, will greatly save time, this is the role of the index.

So the index is the value, quickly find its location, so that can be quickly accessed.

The other role of the index is not to really look at the data, you can make some judgments, such as whether there is a restaurant in the mall, you look at the index to know, do not really go to the mall inside a circle, and then find out all the Sichuan restaurant, but also just look at the index can be, do not have a Sichuan restaurant run.

So how does the index work in MySQL?

The index structure of MySQL is often a B + tree.

An M-order B + Tree has the following properties:

Nodes are divided into index nodes and data nodes. The index node is equivalent to the internal node of the B-tree, and all the index nodes form a B-tree with all the characteristics of the B-tree. In an index node, a key and a pointer are stored, and no specific elements are stored. The data node is equivalent to the external node of the B-tree, the outer node of the B-tree is empty, and is used in a B + tree to hold the real data element, which contains the key and other information of the element, but there is no pointer.
A B-tree consisting of an entire index node is used only to find out which external node the data element with a key is located in. The key is found in the index node, it is not finished, to continue to find the data node, and then read the elements in the data node, or binary lookup, or sequential scan to find the real data elements.
M this order is used only to control the degree of the index node portion, as to how many elements each data node contains, regardless of M.
There is also a list of all data nodes that can be accessed sequentially.

This definition is more abstract, let's look at a concrete example.

As we can see, this is a 3-order B + tree, and an external data node contains a maximum of 5 items. If the data is inserted in the data node, the B-tree consisting of the index nodes will not change if it does not cause splitting and merging.

If an external node of 71 to 75 inserts an item 76, it causes splitting, 71,72,73 becomes a data node, 74,75,76 becomes a data node, and for an index node it is equivalent to inserting a key of 74.

If 43 is removed from an external node from 41 to 43, the merge is caused, and 41,42,61,62,63 is merged into one node, which is equivalent to removing the key 60 for the index node.

When looking for, because B + tree layer is very small, so can be relatively fast positioning, such as we want to find the value 62, the root node found that more than 40 access to the right, less than 70 access to the left, greater than 60 access to the right, in the leaf node of the second, found 62, the success of positioning.

In MySQL's InnoDB, there are two types of B + Tree indexes, one known as clustered index, and one is called a two-level index.

The leaf node of the clustered index is the data node, often the primary key as the clustered index, and the leaf node of level Two index is the key field plus the primary key value. The data is accessed through a Level two index, which accesses two indexes.

There is also a form of an index called a composite index, or compound index, which can be indexed on multiple columns.

The collation of this index is to compare the first column first, compare the second column when the first column is equal, and so on.

Iv. advantages and disadvantages of database indexes

The most obvious advantage of database indexing is to reduce I/O and analyze several scenarios below.

For fields with the = condition, you can directly navigate to the location of the data by finding the B + tree, with very few hard disk reads (equivalent to the B + Tree layer height), to the leaf node.

For range fields, because the B + tree is well sequenced, the range can be quickly positioned through the tree.

Similarly for Orderby/group By/distinct/max/min, because B + trees are well sequenced, they can also get results quickly.

There is also a common scenario called index overlay data. For example A, B two fields as a condition field, often appear a=a and b=b, while select C, D, often will build a federated index (a, b), is a two-level index, so when searching, through the two-level index B + tree can quickly find the corresponding leaf nodes and records, However, there are some clustered index IDs in the records, so we also need to look up the B + Tree of the clustered index, find the records in the real table, and then read the C,d in the records. If you are creating a federated index (a, B, C, D), you have all the data in the B + tree at level Two, and you can return directly, reducing the process of searching the tree.

Of course the index must have a price, there is no free lunch in the world.

The benefit of indexing is that the efficiency of reading is improved, and the cost of index is the decrease of writing efficiency.

Inserting and modifying data can mean changes to the index.

When inserting, the clustered index is often built on the primary key, so the primary key is best used for self-growth, so that the inserted data is always the last, and the order, and the efficiency is higher. The primary key does not use the UUID, so that the order is more random, resulting in random writes, the efficiency is poor. The primary key should not be used in relation to the business, because business-related means that it will be updated, will face a deletion and re-insert, the efficiency will be poor.

With the introduction of the principle of B + tree above, we can see that the splitting cost of B + tree is relatively large, and the division is often produced in the process of inserting.

And for the data modification, it is basically equivalent to delete and re-insert, the cost is relatively large.

For some string columns, the two-level index often results in random writes and reads, and the pressure on I/O is also greater.

V. Understanding the principles behind the database military

By understanding the principles of these two indexes, we can explain why so many so-called database military military-like. Let's explain below one by one.

Under what circumstances should a composite index be used instead of a separate index?

Assuming conditional statements A=a and b=b, if A and B are two separate indexes, only one cable is used in the and condition, and for B it is judged individually, and if you use a combined index (a, b), it is possible to traverse a tree, which greatly increases the efficiency. However, for a=a or b=b, because of a relationship that is or, the combined index is not functional, so a separate index can be used, at which point two indexes can function simultaneously.

Why should the index be differentiated, and the combination index should be in front of the sensitivity?

If there is no degree of differentiation, such as gender, it is equivalent to dividing the entire large table into two parts, finding the data or traversing a half table to find it, making the index meaningless.

Do I need a single-column index if I have a composite index?

If the combined index is (a, B), then for the condition a=a, it is possible to use this combination index, because the composite index is sorted by the first column first, so there is no need to establish an index for a alone, but for the b=b is not used, because only in the first column in the same situation, the second column is compared, So the second column is the same, can be distributed on different nodes, no way to quickly locate.

Is the index the more the better?

Of course not, only if the index is added where necessary, the index will not only make the insertion and modification inefficient, but also in the query, there is a query optimizer, too many indexes will make the optimizer confused, there may be no way to find the correct query path, thus choosing a slow index.

Why use a self-increment primary key

Because the string primary key and the random primary key make the data randomly inserted, the efficiency is poor, the primary key should be less updated, avoid B + tree and frequent merging and splitting.

Why not use NULL as much as possible

Null in B + tree is more difficult to handle, often require special logic to deal with, but reduce efficiency.

Why not index on frequently-updated fields

Updating a field means that the corresponding index is also updated, and the update often means that it is deleted and then inserted, and the index is originally a data structure that is formed in advance at the writing stage, making it more efficient in the reading phase, but it is not recommended to use an index if a field is less than read.

Why not use functions in query conditions

For example id+1=10 this condition, the index is written in advance to generate good, id+1 this operation in the query phase, index incompetence For example, there is no way to do all the index first to do a calculation, and then compare it, the cost is too big, so should use Id=10-1.

Why not use negative to query criteria

You can imagine, for a B + tree, and the node is 40, if your condition is equal to 20, go to the left to check, your condition equals 50, go to the right, but your condition is not equal to 66, the index should do? It's not until you've traversed it again.

Why fuzzy queries do not start with wildcard characters

For a B + tree, if the root is the character Def, if the wildcard in the back, such as abc%, you should search to the left, for example, efg%, you should search the right, if the wildcard in front%abc, you do not know which side, or scan it again.

Why or change to in, or use Union

or the optimization of the query condition is often difficult to find the best path, especially when the or conditions are more, especially, for the same field, using in is better, the database will be in the inside of the conditions are sorted, and unified through the binary search method processing. For different fields, using union, you can have each subquery use an index.

Why should the data type be as small as possible, commonly used as an integer instead of a character type, and long character types consider using a prefix index?

Because the database is stored according to the page, the size of each page is the same, if the data type is larger, the number of pages will be more, each page will be less data, the height of the tree is relatively high, so the search data to read the number of I/O will be more, insert the node is easy to split, efficiency will be reduced. Using integers instead of character types is this consideration, and integral types are more efficient for indexing, such as IP addresses. If you have long character types that need to be queried using an index, in order not to make the index too large, consider indexing the prefix of the field rather than the entire field.

VI. Methodology for query optimization

To find the SQL statement that needs to be optimized, first collect the problematic SQL statement.

The MySQL database provides a slow SQL log function, with parameter slow_query_log, to get a list of SQL quotations with execution times exceeding a certain threshold.

There is no SQL statement using the index, which can be opened by the Long_queries_not_using_indexes parameter.

MIN_EXAMINED_ROW_LIMIT, SQL statements that have a number of scan records greater than this value are not recorded in the slow SQL log.

To find the problematic statement, the next step is to get the SQL execution plan through EXPLAINSQL, whether the records are scanned by an index, and you can optimize execution efficiency by creating an index. Whether the number of scan records is too large. If the lock time is too long, there is a lock conflict. Whether the number of records returned is greater.

Next you can customize the optimizations. There are no fields covered by the index that are involved in the filter, create an index on a field with a greater degree of sensitivity, and if multiple fields are involved, try to create a federated index.

The number of scan records is very high, the number of returned records is not many, the sensitivity is poor, re-evaluate the fields involved in the SQL statement, select more highly differentiated fields to create the index

The number of scan records is very high, the number of returned records is also very many, the filter condition is not strong, increase the SQL filter condition

schema_redundant_indexes see which redundant indexes are available.

If multiple indexes involve a consistent field order, you can form a federated index schema_unused_indexes to see which indexes are never used.

Principles of database indexing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Principles of database indexing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Principles of database indexing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support