Performance analysis of MySQL indexes

Source: Internet
Author: User

"Preface" The previous blog introduced the index mechanism of the INNODB engine, mainly around the establishment of B + tree, the primary key and page number in the Catalog item record, the binary location data in the page directory, the primary key and the index column in the secondary index, and the table operation. This article analyzes the performance of indexes around how to better use indexes.

 I. The cost of the index

After familiar with the B+ tree index principle, the topic of this article is how to better use the index, although the index is a good thing, can not be built, before the introduction of how to better use the index to understand the cost of using this thing, it in space and time will be dragged down:

    • The cost of space

      This is obvious, each set up an index for it to build a B+ tree, B+ each node of the tree is a data page, a page is to occupy 16KB the storage space, a large B+ tree is composed of many data pages, it is a large piece of storage space.

    • The cost of the time

      We have said that each layer of nodes is sorted by the values of the indexed columns in order from small to large, and the records within each page (whether user records or catalog item records) are formed as a one-way linked list in the order of the values of the indexed columns from small to large. When the data in the table to increase, delete, change the operation of the node and record may be corrupted, so the storage engine needs extra time to do some record shift, page splitting, page recycling what the operation to maintain good node and record sorting. If we built a lot of indexes, the tree of each index B+ should be related maintenance operation, this can not hinder performance?

Second, B + Tree index applicable conditions

Below we will be nagging many of the B+ techniques and considerations for maximizing the efficiency of the tree index, but it is important to understand that all of the techniques are derived from your understanding of the B+ tree index, so if you do not guarantee a full understanding of the tree index, B+ then go back and review it again, Otherwise reading the article is a torment to you. First, the B+ tree index is not omnipotent, and not all query statements can use the index we build. Here are a few things we can do with a B+ tree index to query. For the smooth development of the story, we need to first create a table, which is used to store some basic information about people:

The CREATE TABLE person_info (    name VARCHAR) is not null,    birthday DATE is not null, Phone_number CHAR (one-by-one) is not    null,    Country varchar (+) not NULL,    KEY idx_name_age_birthday (name, birthday, phone_number));

For this person_info table we need to note two points:

    • We did not define a primary key in the table, so the innoDB storage engine would default to our table by adding a row of row_id columns as the primary key.

    • Focus on the index we define idx_name_age_birthday , which is a federated index of 3 columns. So the records stored at the leaf nodes of the tree corresponding to this index are B+ reserved only name , the values of birthday phone_number these three columns, and the primary key values.

Below we draw the index idx_name_age_birthday , but since we have mastered InnoDB the B+ tree index principle, then we in the drawing time in order to make the diagram clearer, so in the record structure omit some unnecessary parts, such as the single-linked list of arrows, record the primary key value of what, only retain name , birthday, phone_number the values of these three columns, so it is long (the attention of the classmate to see, this actually and "high-performance MySQL" Example of the figure, I think this example is very good, so I borrowed a bit):

  

In order to make it easier for everyone to understand, we deliberately distinguish between the inner node and the leaf node with different colors. and a different color to distinguish between the contents of the node in the directory entry records and leaf nodes in the user records (because not clustered index, so the user record is incomplete, missing country column values). As you can see, the idx_name_age_birthday B+ sort of nodes and records in the tree corresponding to this index is as follows:

    • Sort by name the values of the columns first.

    • If name the values of the columns are the same, they are birthday sorted by the values of the columns.

    • If the birthday values of the columns are also the same, the phone_number values are sorted by.

This sort method is very, very special, very, huge, very very very important, because as long as the nodes and records are in order, we can use the dichotomy to quickly locate the search. The bottom of the content depends on this diagram, we understand the picture.

2.1 Full Value match

If the columns and index columns in our search criteria are consistent, this is called full-value matching, for example, the following lookup statement:

SELECT * from person_info WHERE name = ' Ashburn ' and birthday = ' 1990-09-27 ' and phone_number = ' 15123983239 ';

The index we build idx_name_age_birthday contains 3 columns that are shown in this query statement, and the order of the columns in the search criteria is consistent with the order in which the columns are defined. You can imagine this query process:

    • Because the B+ tree's data pages and records are name sorted first by column values, you can quickly locate the name Ashburn record location where the column's value is.

    • In the same column of the record is sorted by the value of the name birthday column, so in the name column value is Ashburn the record can quickly locate the birthday column value is a ‘1990-09-27‘ record.

    • If, unfortunately, the values of the columns are the name birthday same, the records are sorted by the values of the phone_number columns, so the three columns in the federated index may be used.

  if the order of the columns in the search criteria is inconsistent with the order of the indexed columns, let's say that the values of the birthday columns are matched first, because the B+ data pages and records in the tree are sorted first by name column values and cannot be 二分法 used directly Quickly locate records, so you can only scan all of the record pages.

2.2 Match to left column

In fact, in our search statements can also not include all the Union index columns, only the left side of the row, for example, the following query statement:

SELECT * from person_info WHERE name = ' Ashburn ';

Or you can include multiple left columns:

SELECT * from person_info WHERE name = ' Ashburn ' and birthday = ' 1990-09-27 ';

So why should the left column appear in the search criteria to use this B+ tree index? For example, does the statement below use this B+ tree index?

SELECT * from person_info WHERE birthday = ' 1990-09-27 ';

Yes, it does not, because the B+ tree's data pages and records are sorted by the name column values, you directly based on birthday the value of the search, the concubine can not do it ~ then if I want to use only birthday the value of the B+ tree index to find out what to do? This is good, you have to birthday build a B+ tree index on the line, the syntax to create the index do not have to nag me, if you do not go out left to see other articles.

However, it is important to note that the order of the columns in the search criteria must be the same as the order in which the indexed columns are defined, such as the order in which the indexed columns are defined, and, name birthday phone_number If we have only the and, if not, of the search criteria, name phone_number birthday It can only be used The index of the name column is birthday phone_number not used (because records with the name same value are sorted first, records with the same value birthday birthday phone_number are sorted by value).

Therefore, when using the index must be used in order to establish the index, so as to maximize the role. There is no skip.

2.3 Matching column prefixes

For indexed columns of string types, it is not necessary to match the value of the column exactly, but it is also possible to match its prefix, because the prefix itself is already well-sequenced. Let's say we want to query the ‘As‘ records that start with the name, so we can write the query statement:

SELECT * from Person_info WHERE name is like ' as% ';

  B+The data pages and records in the tree are name sorted by column, and only prefixes can be 二分法 quickly located. However, it is important to note that if you only give a suffix or a middle string , such as this:

SELECT * from Person_info WHERE name is like '%as% ';

  MySQLIt is not possible 二分法 to quickly locate the record position, because the string ‘As‘ in the middle of the string is not well-sequenced, so it can only be scanned in full table.

2.4 Matching range values

Looking back at our indexed tree, all the records are sorted by the values of the idx_name_age_birthday B+ indexed columns from small to large, so it's great for us to find the records of the indexed column values in a range. Let's say the following query statement:

SELECT * FROM person_info WHERE name > ‘Asa‘ AND name < ‘Barlow‘;

Since the B+ nodes in the tree and the data pages are name sorted by column first, the query process above is actually like this:

    • Locate name Asa the record for which the value is.

    • Locate name Barlow the record for which the value is.

    • Oh, because all the records are connected by Lian Lin (a single linked list between records, a double-linked list between data pages), so the records between them can be easily removed.

    • Locate the primary key values for these records, and 聚簇索引 then 回表 find the complete record in.

However, when searching for scopes, it is important to note that if you are scoping multiple columns at the same time, you can use the tree index only if you are scoping the leftmost column of the index B+ , for example:

SELECT * FROM person_info WHERE name > ‘Asa‘ AND name < ‘Barlow‘ AND birthday > ‘1980-01-01‘;

Why not? Because the above query can be divided into two parts:

    1. By criteria name > ‘Asa‘ AND name < ‘Barlow‘ , the name results of the lookup may have multiple name records with different values,

    2. Records that name differ from these values continue to be filtered through the birthday > ‘1980-01-01‘ condition.

However, it is important to note that only name the same value can be used to sort the values of the columns, that is, the result of a birthday name range lookup is not sorted by the birthday column, so it is not possible to continue to find columns in the search criteria birthday B+the tree index. To be told to take out a full scan of the records.

2.5 match a column exactly and the range matches another column

For the same federated index, although a range lookup for multiple columns can only be used for the leftmost index column, if the left column is an exact lookup, the right column can be scoped, for example:

SELECT * FROM person_info WHERE name = ‘Ashburn‘ AND birthday > ‘1980-01-01‘ AND birthday < ‘2000-12-31‘ AND phone_number > ‘15100000000‘;

The criteria for this query can be divided into 3 parts:

    1. name = ‘Ashburn‘, name you can use the tree index for exact lookup of columns B+ .

    2. birthday > ‘1980-01-01‘ AND birthday < ‘2000-12-31‘, because the name columns are exact lookups, the name = ‘Ashburn‘ values of the results obtained by the conditional lookup are the name same, and they are then birthday sorted by the values. Therefore, birthday it is possible to use a tree index for a range lookup on a column at this time B+ .

    3. phone_number > ‘15100000000‘, the values of the records that are birthday found through the range birthday may be different, so this condition can no longer take advantage B+ of the tree index, only the records obtained from the previous query can be traversed.

2.5 for sorting

When we write query statements, we often need to query the records according to some sort of rules, for B+ the case of non-applicable tree index, we can only load the records into memory, and then use some sort algorithm, such as fast sorting, merge sort, Oh, come on. Sort the records in the deposit, and then return the ordered result set back to the client. But if ORDER BY our index column is used in the clause, it is possible to omit the steps to sort in memory, such as this simple query statement below:

SELECT * from Person_info ORDER by name, birthday, Phone_number LIMIT 10;

The result set of this query needs to be sorted by value first, and if the values of the name records are the same, they need to be sorted name birthday , and if birthday the values are the same, they need to be phone_number sorted. You can go back and look at the index we built idx_name_age_birthday , because the B+ tree index itself is well-sequenced, so it's good to extract the data directly from the index. Simple, huh? Yes, the index is so diao~

However, there is a problem to note that ORDER BY the order of the columns behind the clauses must also be given in the order of the indexed columns, if given ORDER BY phone_number, birthday, name in the order, that is not used in the B+ tree index, the reverse order can not use the index reason we said above, this does not repeat. Similarly, ORDER BY name a ORDER BY name, birthday partial tree index can be used in the form of a column to the left of this matching index B+ . You can see that the form of the index is really very rigid!!!

2.6 for grouping

Sometimes in order to facilitate some of the information in the tables, the records in the table are grouped according to some columns. For example, the following group query:

SELECT name, birthday, phone_number, COUNT(*) FROM person_info GROUP BY name, birthday, phone_number

This query statement is equivalent to doing a 3-time grouping operation:

    1. The records are name grouped by value, and all name records with the same value are divided into a group.

    2. The records in the same grouping of each value are then grouped in the same value, and the name birthday birthday Records of the same value are placed in a small group, so it looks like a lot of small groupings in a large group.

    3. Then the small groups produced in the previous step are grouped phone_number into smaller groupings, so the overall look is like dividing the records into a large group, dividing them into 大分组 several pieces 小分组 , and then dividing several 小分组 into more 小小分组 .

Then the statistics for those 小小分组 , such as in our query statement is to count each 小小分组 of the number of records contained. If there is no index, this grouping process all need to be implemented in memory, and if there is an index, it happens that this grouping order B+ is consistent with the order of the index column in our tree, and our B+ tree index is sorted by index column, this is not exactly, so you can directly use B+The tree index is grouped.

It's a good idea to sort by using a tree index, and the B+ order of the grouped columns also needs to be consistent with the order of the indexed columns, or it can be grouped using only the columns to the left of the indexed column.

Iii. How to select an index

Above we take the idx_name_age_birthday index as an example of the index of the applicable conditions of the detailed nagging, below to see what we should be aware of when indexing or writing query statements should be taken into consideration.

1. Create indexes only for columns for searching, sorting, or grouping

That is, the index is created only for columns that appear in the clause, join WHERE columns in the JOIN clause, or columns that appear in the ORDER BY or GROUP BY clause. The columns that appear in the query list do not need to be indexed:

SELECT birthday, country from person_name WHERE name = ' Ashburn ';

Like in the query list birthday , country these two columns do not need to be indexed, we just need to create an index for the columns that appear in the WHERE clause name .

2, consider the cardinality of the column

  列的基数Refers to the number of distinct data in a column, say a column contains a value 2, 5, 8, 2, 5, 8, 2, 5, 8 , although there is a 9 record, but the cardinality of the column is 3 . That is, with a certain number of rows, the larger the cardinality of the column, the more dispersed the values in the column, and the smaller the cardinality of the column, the more concentrated the values in the column. This 列的基数 indicator is very important and directly affects whether we can effectively use the index. Assuming that the cardinality of a column is the same as that 1 of all records in that column, it is useless to index that column because all values are the same and cannot be sorted, and no dichotomy is found ~ So the conclusion is that it is best to index columns of those columns with large cardinality. Indexing for a base too small column may not be effective.

3. The type of index column is as small as possible

When we define the table structure, we explicitly specify the type of the column, take the integer type as an example, there are,,, and so on, and so on, TINYINT MEDIUMINT INT BIGINT They occupy the storage space in order increment, can represent the integer range of course also increment, we say 类型大小 here Refers to the size of the data range represented by the type. If we want to index an integer column, as long as the range of integers allowed, try to make the index columns use a smaller type, for example, if we can use it, do not use it INT BIGINT , you can use it MEDIUMINT instead of INT ~ this is because:

    • The smaller the data type, the faster the comparison operation at query time (this is the CPU level of the stuff)

    • The smaller the data type, the less storage space The index occupies, the more records can be dropped within a single data page, and thus reduce I/O the performance loss of the disk, which means that more data pages can be cached in memory, which speeds up read and write efficiency.

This recommendation is more appropriate for the primary key of the table, because not only is the primary key value stored in the clustered index, but the leaf nodes of all the other two-level indexes store a record's primary key value, which means more storage space is saved if the primary key applies to a smaller data type.

4. Prefix of index string value

We know that a string is actually made up of a number of characters, and if we apply a character set in the MySQL utf8 to store the string, it takes a byte to encode a character 1~3 . Assuming our string is very long, it takes a lot of storage space to store a string. When we need to index this string column, that means B+ there are two problems in the corresponding tree:

    • B+The records in the tree index need to store the full string of the column, and the longer the string, the greater the storage space in the index.

    • If B+ the indexed column in the tree index stores a long string, it takes more time to make a string comparison.

So the designer of the index proposed a scheme---only the first few characters of the string are indexed. Through the first few characters of the string we have been able to sort the string, the rest can be searched through the search ah, so only in the B+ tree to store the first few characters of the string encoding, saving space, but also reduce the string comparison time, but probably solve the problem of sorting, why not, Let's say we index only the name first 10 characters of a column in a table statement, so you can write:

The CREATE TABLE person_info (    name VARCHAR) is not null,    birthday DATE is not null, Phone_number CHAR (one-by-one) is not    null,    Country varchar (+) not NULL,    KEY Idx_name_age_birthday (Name (ten), Birthday, phone_number));    

  name(10)It means that only the B+ encoding of the first character of the record is preserved in the established tree index 10 , and this policy of indexing string-value prefixes is highly encouraged, especially if the string type can store more characters.

5. Use federated indexes as much as possible

If we have more than one column in our search criteria, it's best to create one for these columns instead 联合索引 of one for each column (because each index will maintain a B+ tree), like person_info the index of our table, idx_name_age_birthday which is name birthday phone_number , The federated Index of these three columns, so this federated index can be used to search for several combinations of columns below:

name, birthday, phone_number
name, birthday

If we do have a need for other column searches, then build an index for the other columns separately, although there is a cost to creating an index, but it is important to ensure efficient queries.

6. Let the index columns appear separately in the comparison expression

Let's say we have an integer column in the table my_col , and we have an index for that. The two clauses below WHERE , while semantically consistent, differ in efficiency:

    1. WHERE my_col * 2 < 4

    2. WHERE my_col < 4/2

The WHERE column in the 1th clause my_col does not appear as a separate column, but in the form of my_col * 2 such an expression, where the storage engine loops through all the records and evaluates whether the expression is less than the value 4 , so this is not the case with my_col columns created B+the tree index. The 2nd WHERE clause is listed in the form of a my_col separate column, which allows the tree index to be used directly B+ .

So the conclusion is that if the index column does not appear as a separate column in the comparison expression, but in the form of an expression or function call, it is not indexed.

8. Primary key Insertion Order

We know that for a InnoDB table that uses the storage engine, when we do not explicitly create an index, the data in the table is actually stored in 聚簇索引 the leaf node. And the records are stored in the data page, data pages and records are sorted according to the primary key value from small to large, so if we insert the primary key value of the record is increased sequentially, then we each fill a page of data into the next data page to continue to plug, and if we insert the primary key value is suddenly large and small, This is a bit of a hassle, assuming that a data page stores a record that is full, and that it stores the primary key values 1~100 between:

If you insert a record with a primary key value at this point 9 , it will be inserted in the same position as:

But this data page is full, and then inserted into it? We need to split the current page into two pages and move some of the records on this page to the newly created page. What does page splitting and record shifting mean? Means: Performance loss! So if we want to try to avoid this unnecessary loss of performance, it is best to let the inserted record of the primary key value increment, so that the performance loss will not occur. So we suggest: let the primary key have AUTO_INCREMENT , let the storage engine generate the primary key for the table itself, instead of manually inserting it, let's say we can define the table like this person_info :

CREATE TABLE person_info (    ID INT UNSIGNED NOT NULL auto_increment,    name VARCHAR (+) not NULL,    birthday DATE N OT null,    phone_number CHAR (one) not NULL,    country varchar (+) NOT NULL,    PRIMARY key (ID),    key idx_name_ Age_birthday (name (ten), Birthday, phone_number));    

Our custom primary key column id has AUTO_INCREMENT properties, and the storage engine automatically fills in the self-increment primary key value for us when inserting records.

9. Redundant and repetitive indexes

Sometimes some students intentionally or unintentionally create multiple indexes on the same column, for example, to write a table statement:

CREATE TABLE person_info (    ID INT UNSIGNED NOT NULL auto_increment,    name VARCHAR (+) not NULL,    birthday DATE N OT null,    phone_number CHAR (one) not NULL,    country varchar (+) NOT NULL,    PRIMARY key (ID),    key idx_name_ Age_birthday (name (ten), Birthday, Phone_number),    KEY idx_name (name (10)));    

We know that by idx_name_age_birthday indexing a column can be name quickly searched, and then creating a name column-specific index even if it is an 冗余 index, maintaining the index will only increase the cost of maintenance and will not benefit the search.

In another case, we might re-index a column, for example:

CREATE TABLE repeat_index_demo (
   c1 INT PRIMARY KEY,
   c2 INT,
   UNIQUE uidx_c1 (c1),
   INDEX idx_c1 (c1)
);  

We see that the primary key, which is c1 defined as a unique index, is also defined as a normal index, but the primary key itself will have a PLA clustered index, so the defined unique index and the normal index are duplicated, which is to be avoided.

10. Overlay Index

Let's say this query:

SELECT * from person_info WHERE name = ' Ashburn ';

  person_infoThere are idx_name_age_birthday only 3 index columns in the table, and there are 4 columns in the table, so in order to get the complete user record, the corresponding primary key value is obtained once by the idx_name_age_birthday index 聚簇索引 回表 . 回表operation is also a performance loss Ah, so we recommend: It is best to include only the index column in the query list, such as:

SELECT name, birthday, phone_number FROM person_info WHERE name = ‘Ashburn‘;

Because we only query name the birthday phone_number values of these three indexed columns, we idx_name_age_birthday do not have to find the remaining columns of the record, that is, the values of the columns, after getting the results from the index 聚簇索引 country . This eliminates the 回表 performance loss of the operation, which we call the query method that only needs to be indexed 索引覆盖 .

Of course, if the business needs to query a column other than the index, it is still to ensure that business needs are paramount. However, we do not encourage the use * of numbers as a query list, it is best to make the columns we need to query in sequence.

Summarize

Above is just some of the points we need to be aware of when creating and using B+ the tree index, and we will introduce more optimization methods and considerations in the back, so please look forward to it. The contents of this episode are summarized as follows:

    1. B+The tree index has a price in both space and time, so don't build indexes.

    2. B+The tree index applies to these situations:

      • Full value Matching

      • Match the column to the left

      • Match Range value

      • Match exactly one column and the range matches another column

      • Used for sorting

      • For grouping

    3. Here are some things to keep in mind when working with indexes:

      • Create indexes only for columns that are used for searching, sorting, or grouping

      • To create an index on a column with a large base column

      • The type of index column is as small as possible

      • You can index only the prefix of a string value

      • Indexes can be applied only if the index columns appear alone in the comparison expression

      • In order to minimize the 聚簇索引 occurrence of page splits and record shifts, it is recommended that the primary key have AUTO_INCREMENT properties.

      • Locating and deleting duplicate and redundant indexes in a table

      • As far as possible for the 覆盖索引 query, to avoid 回表 the performance loss.

Performance analysis of MySQL indexes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.