"Preface" The previous blog introduced the index mechanism of the INNODB engine, mainly around the establishment of B + tree, the primary key and page number in the Catalog item record, the binary location data in the page directory, the primary key and the index column in the secondary index, and the table operation. This article analyzes the performance of indexes around how to better use indexes.
I. The cost of the index
After familiar with the B+
tree index principle, the topic of this article is how to better use the index, although the index is a good thing, can not be built, before the introduction of how to better use the index to understand the cost of using this thing, it in space and time will be dragged down:
The cost of space
This is obvious, each set up an index for it to build a B+
tree, B+
each node of the tree is a data page, a page is to occupy 16KB
the storage space, a large B+
tree is composed of many data pages, it is a large piece of storage space.
The cost of the time
We have said that each layer of nodes is sorted by the values of the indexed columns in order from small to large, and the records within each page (whether user records or catalog item records) are formed as a one-way linked list in the order of the values of the indexed columns from small to large. When the data in the table to increase, delete, change the operation of the node and record may be corrupted, so the storage engine needs extra time to do some record shift, page splitting, page recycling what the operation to maintain good node and record sorting. If we built a lot of indexes, the tree of each index B+
should be related maintenance operation, this can not hinder performance?
Second, B + Tree index applicable conditions
Below we will be nagging many of the B+
techniques and considerations for maximizing the efficiency of the tree index, but it is important to understand that all of the techniques are derived from your understanding of the B+
tree index, so if you do not guarantee a full understanding of the tree index, B+
then go back and review it again, Otherwise reading the article is a torment to you. First, the B+
tree index is not omnipotent, and not all query statements can use the index we build. Here are a few things we can do with a B+
tree index to query. For the smooth development of the story, we need to first create a table, which is used to store some basic information about people:
The CREATE TABLE person_info ( name VARCHAR) is not null, birthday DATE is not null, Phone_number CHAR (one-by-one) is not null, Country varchar (+) not NULL, KEY idx_name_age_birthday (name, birthday, phone_number));
For this person_info
table we need to note two points:
We did not define a primary key in the table, so the innoDB
storage engine would default to our table by adding a row of row_id
columns as the primary key.
Focus on the index we define idx_name_age_birthday
, which is a federated index of 3 columns. So the records stored at the leaf nodes of the tree corresponding to this index are B+
reserved only name
, the values of birthday
phone_number
these three columns, and the primary key values.
Below we draw the index idx_name_age_birthday
, but since we have mastered InnoDB
the B+
tree index principle, then we in the drawing time in order to make the diagram clearer, so in the record structure omit some unnecessary parts, such as the single-linked list of arrows, record the primary key value of what, only retain name
, birthday
, phone_number
the values of these three columns, so it is long (the attention of the classmate to see, this actually and "high-performance MySQL" Example of the figure, I think this example is very good, so I borrowed a bit):
In order to make it easier for everyone to understand, we deliberately distinguish between the inner node and the leaf node with different colors. and a different color to distinguish between the contents of the node in the directory entry records and leaf nodes in the user records (because not clustered index, so the user record is incomplete, missing country
column values). As you can see, the idx_name_age_birthday
B+
sort of nodes and records in the tree corresponding to this index is as follows:
Sort by name
the values of the columns first.
If name
the values of the columns are the same, they are birthday
sorted by the values of the columns.
If the birthday
values of the columns are also the same, the phone_number
values are sorted by.
This sort method is very, very special, very, huge, very very very important, because as long as the nodes and records are in order, we can use the dichotomy to quickly locate the search. The bottom of the content depends on this diagram, we understand the picture.
2.1 Full Value match
If the columns and index columns in our search criteria are consistent, this is called full-value matching, for example, the following lookup statement:
SELECT * from person_info WHERE name = ' Ashburn ' and birthday = ' 1990-09-27 ' and phone_number = ' 15123983239 ';
The index we build idx_name_age_birthday
contains 3 columns that are shown in this query statement, and the order of the columns in the search criteria is consistent with the order in which the columns are defined. You can imagine this query process:
Because the B+
tree's data pages and records are name
sorted first by column values, you can quickly locate the name
Ashburn
record location where the column's value is.
In the same column of the record is sorted by the value of the name
birthday
column, so in the name
column value is Ashburn
the record can quickly locate the birthday
column value is a ‘1990-09-27‘
record.
If, unfortunately, the values of the columns are the name
birthday
same, the records are sorted by the values of the phone_number
columns, so the three columns in the federated index may be used.
if the order of the columns in the search criteria is inconsistent with the order of the indexed columns, let's say that the values of the birthday
columns are matched first, because the B+
data pages and records in the tree are sorted first by name
column values and cannot be 二分法
used directly Quickly locate records, so you can only scan all of the record pages.
2.2 Match to left column
In fact, in our search statements can also not include all the Union index columns, only the left side of the row, for example, the following query statement:
SELECT * from person_info WHERE name = ' Ashburn ';
Or you can include multiple left columns:
SELECT * from person_info WHERE name = ' Ashburn ' and birthday = ' 1990-09-27 ';
So why should the left column appear in the search criteria to use this B+
tree index? For example, does the statement below use this B+
tree index?
SELECT * from person_info WHERE birthday = ' 1990-09-27 ';
Yes, it does not, because the B+
tree's data pages and records are sorted by the name
column values, you directly based on birthday
the value of the search, the concubine can not do it ~ then if I want to use only birthday
the value of the B+
tree index to find out what to do? This is good, you have to birthday
build a B+
tree index on the line, the syntax to create the index do not have to nag me, if you do not go out left to see other articles.
However, it is important to note that the order of the columns in the search criteria must be the same as the order in which the indexed columns are defined, such as the order in which the indexed columns are defined, and, name
birthday
phone_number
If we have only the and, if not, of the search criteria, name
phone_number
birthday
It can only be used The index of the name
column is birthday
phone_number
not used (because records with the name
same value are sorted first, records with the same value birthday
birthday
phone_number
are sorted by value).
Therefore, when using the index must be used in order to establish the index, so as to maximize the role. There is no skip.
2.3 Matching column prefixes
For indexed columns of string types, it is not necessary to match the value of the column exactly, but it is also possible to match its prefix, because the prefix itself is already well-sequenced. Let's say we want to query the ‘As‘
records that start with the name, so we can write the query statement:
SELECT * from Person_info WHERE name is like ' as% ';
B+
The data pages and records in the tree are name
sorted by column, and only prefixes can be 二分法
quickly located. However, it is important to note that if you only give a suffix or a middle string , such as this:
SELECT * from Person_info WHERE name is like '%as% ';
MySQL
It is not possible 二分法
to quickly locate the record position, because the string ‘As‘
in the middle of the string is not well-sequenced, so it can only be scanned in full table.
2.4 Matching range values
Looking back at our indexed tree, all the records are sorted by the values of the idx_name_age_birthday
B+
indexed columns from small to large, so it's great for us to find the records of the indexed column values in a range. Let's say the following query statement:
SELECT * FROM person_info WHERE name > ‘Asa‘ AND name < ‘Barlow‘;
Since the B+
nodes in the tree and the data pages are name
sorted by column first, the query process above is actually like this:
Locate name
Asa
the record for which the value is.
Locate name
Barlow
the record for which the value is.
Oh, because all the records are connected by Lian Lin (a single linked list between records, a double-linked list between data pages), so the records between them can be easily removed.
Locate the primary key values for these records, and 聚簇索引
then 回表
find the complete record in.
However, when searching for scopes, it is important to note that if you are scoping multiple columns at the same time, you can use the tree index only if you are scoping the leftmost column of the index B+
, for example:
SELECT * FROM person_info WHERE name > ‘Asa‘ AND name < ‘Barlow‘ AND birthday > ‘1980-01-01‘;
Why not? Because the above query can be divided into two parts:
By criteria name > ‘Asa‘ AND name < ‘Barlow‘
, the name
results of the lookup may have multiple name
records with different values,
Records that name
differ from these values continue to be filtered through the birthday > ‘1980-01-01‘
condition.
However, it is important to note that only name
the same value can be used to sort the values of the columns, that is, the result of a birthday
name
range lookup is not sorted by the birthday
column, so it is not possible to continue to find columns in the search criteria birthday
B+
the tree index. To be told to take out a full scan of the records.
2.5 match a column exactly and the range matches another column
For the same federated index, although a range lookup for multiple columns can only be used for the leftmost index column, if the left column is an exact lookup, the right column can be scoped, for example:
SELECT * FROM person_info WHERE name = ‘Ashburn‘ AND birthday > ‘1980-01-01‘ AND birthday < ‘2000-12-31‘ AND phone_number > ‘15100000000‘;
The criteria for this query can be divided into 3 parts:
name = ‘Ashburn‘
, name
you can use the tree index for exact lookup of columns B+
.
birthday > ‘1980-01-01‘ AND birthday < ‘2000-12-31‘
, because the name
columns are exact lookups, the name = ‘Ashburn‘
values of the results obtained by the conditional lookup are the name
same, and they are then birthday
sorted by the values. Therefore, birthday
it is possible to use a tree index for a range lookup on a column at this time B+
.
phone_number > ‘15100000000‘
, the values of the records that are birthday
found through the range birthday
may be different, so this condition can no longer take advantage B+
of the tree index, only the records obtained from the previous query can be traversed.
2.5 for sorting
When we write query statements, we often need to query the records according to some sort of rules, for B+
the case of non-applicable tree index, we can only load the records into memory, and then use some sort algorithm, such as fast sorting, merge sort, Oh, come on. Sort the records in the deposit, and then return the ordered result set back to the client. But if ORDER BY
our index column is used in the clause, it is possible to omit the steps to sort in memory, such as this simple query statement below:
SELECT * from Person_info ORDER by name, birthday, Phone_number LIMIT 10;
The result set of this query needs to be sorted by value first, and if the values of the name
records are the same, they need to be sorted name
birthday
, and if birthday
the values are the same, they need to be phone_number
sorted. You can go back and look at the index we built idx_name_age_birthday
, because the B+
tree index itself is well-sequenced, so it's good to extract the data directly from the index. Simple, huh? Yes, the index is so diao~
However, there is a problem to note that ORDER BY
the order of the columns behind the clauses must also be given in the order of the indexed columns, if given ORDER BY phone_number, birthday, name
in the order, that is not used in the B+
tree index, the reverse order can not use the index reason we said above, this does not repeat. Similarly, ORDER BY name
a ORDER BY name, birthday
partial tree index can be used in the form of a column to the left of this matching index B+
. You can see that the form of the index is really very rigid!!!
2.6 for grouping
Sometimes in order to facilitate some of the information in the tables, the records in the table are grouped according to some columns. For example, the following group query:
SELECT name, birthday, phone_number, COUNT(*) FROM person_info GROUP BY name, birthday, phone_number
This query statement is equivalent to doing a 3-time grouping operation:
The records are name
grouped by value, and all name
records with the same value are divided into a group.
The records in the same grouping of each value are then grouped in the same value, and the name
birthday
birthday
Records of the same value are placed in a small group, so it looks like a lot of small groupings in a large group.
Then the small groups produced in the previous step are grouped phone_number
into smaller groupings, so the overall look is like dividing the records into a large group, dividing them into 大分组
several pieces 小分组
, and then dividing several 小分组
into more 小小分组
.
Then the statistics for those 小小分组
, such as in our query statement is to count each 小小分组
of the number of records contained. If there is no index, this grouping process all need to be implemented in memory, and if there is an index, it happens that this grouping order B+
is consistent with the order of the index column in our tree, and our B+
tree index is sorted by index column, this is not exactly, so you can directly use B+
The tree index is grouped.
It's a good idea to sort by using a tree index, and the B+
order of the grouped columns also needs to be consistent with the order of the indexed columns, or it can be grouped using only the columns to the left of the indexed column.
Iii. How to select an index
Above we take the idx_name_age_birthday
index as an example of the index of the applicable conditions of the detailed nagging, below to see what we should be aware of when indexing or writing query statements should be taken into consideration.
1. Create indexes only for columns for searching, sorting, or grouping
That is, the index is created only for columns that appear in the clause, join WHERE
columns in the JOIN clause, or columns that appear in the ORDER BY
or GROUP BY
clause. The columns that appear in the query list do not need to be indexed:
SELECT birthday, country from person_name WHERE name = ' Ashburn ';
Like in the query list birthday
, country
these two columns do not need to be indexed, we just need to create an index for the columns that appear in the WHERE
clause name
.
2, consider the cardinality of the column
列的基数
Refers to the number of distinct data in a column, say a column contains a value 2, 5, 8, 2, 5, 8, 2, 5, 8
, although there is a 9
record, but the cardinality of the column is 3
. That is, with a certain number of rows, the larger the cardinality of the column, the more dispersed the values in the column, and the smaller the cardinality of the column, the more concentrated the values in the column. This 列的基数
indicator is very important and directly affects whether we can effectively use the index. Assuming that the cardinality of a column is the same as that 1
of all records in that column, it is useless to index that column because all values are the same and cannot be sorted, and no dichotomy is found ~ So the conclusion is that it is best to index columns of those columns with large cardinality. Indexing for a base too small column may not be effective.
3. The type of index column is as small as possible
When we define the table structure, we explicitly specify the type of the column, take the integer type as an example, there are,,, and so on, and so on, TINYINT
MEDIUMINT
INT
BIGINT
They occupy the storage space in order increment, can represent the integer range of course also increment, we say 类型大小
here Refers to the size of the data range represented by the type. If we want to index an integer column, as long as the range of integers allowed, try to make the index columns use a smaller type, for example, if we can use it, do not use it INT
BIGINT
, you can use it MEDIUMINT
instead of INT
~ this is because:
The smaller the data type, the faster the comparison operation at query time (this is the CPU level of the stuff)
The smaller the data type, the less storage space The index occupies, the more records can be dropped within a single data page, and thus reduce I/O
the performance loss of the disk, which means that more data pages can be cached in memory, which speeds up read and write efficiency.
This recommendation is more appropriate for the primary key of the table, because not only is the primary key value stored in the clustered index, but the leaf nodes of all the other two-level indexes store a record's primary key value, which means more storage space is saved if the primary key applies to a smaller data type.
4. Prefix of index string value
We know that a string is actually made up of a number of characters, and if we apply a character set in the MySQL
utf8
to store the string, it takes a byte to encode a character 1~3
. Assuming our string is very long, it takes a lot of storage space to store a string. When we need to index this string column, that means B+
there are two problems in the corresponding tree:
B+
The records in the tree index need to store the full string of the column, and the longer the string, the greater the storage space in the index.
If B+
the indexed column in the tree index stores a long string, it takes more time to make a string comparison.
So the designer of the index proposed a scheme---only the first few characters of the string are indexed. Through the first few characters of the string we have been able to sort the string, the rest can be searched through the search ah, so only in the B+
tree to store the first few characters of the string encoding, saving space, but also reduce the string comparison time, but probably solve the problem of sorting, why not, Let's say we index only the name
first 10 characters of a column in a table statement, so you can write:
The CREATE TABLE person_info ( name VARCHAR) is not null, birthday DATE is not null, Phone_number CHAR (one-by-one) is not null, Country varchar (+) not NULL, KEY Idx_name_age_birthday (Name (ten), Birthday, phone_number));
name(10)
It means that only the B+
encoding of the first character of the record is preserved in the established tree index 10
, and this policy of indexing string-value prefixes is highly encouraged, especially if the string type can store more characters.
5. Use federated indexes as much as possible
If we have more than one column in our search criteria, it's best to create one for these columns instead 联合索引
of one for each column (because each index will maintain a B+
tree), like person_info
the index of our table, idx_name_age_birthday
which is name
birthday
phone_number
, The federated Index of these three columns, so this federated index can be used to search for several combinations of columns below:
name, birthday, phone_number
name, birthday
If we do have a need for other column searches, then build an index for the other columns separately, although there is a cost to creating an index, but it is important to ensure efficient queries.
6. Let the index columns appear separately in the comparison expression
Let's say we have an integer column in the table my_col
, and we have an index for that. The two clauses below WHERE
, while semantically consistent, differ in efficiency:
WHERE my_col * 2 < 4
WHERE my_col < 4/2
The WHERE
column in the 1th clause my_col
does not appear as a separate column, but in the form of my_col * 2
such an expression, where the storage engine loops through all the records and evaluates whether the expression is less than the value 4
, so this is not the case with my_col
columns created B+
the tree index. The 2nd WHERE
clause is listed in the form of a my_col
separate column, which allows the tree index to be used directly B+
.
So the conclusion is that if the index column does not appear as a separate column in the comparison expression, but in the form of an expression or function call, it is not indexed.
8. Primary key Insertion Order
We know that for a InnoDB
table that uses the storage engine, when we do not explicitly create an index, the data in the table is actually stored in 聚簇索引
the leaf node. And the records are stored in the data page, data pages and records are sorted according to the primary key value from small to large, so if we insert the primary key value of the record is increased sequentially, then we each fill a page of data into the next data page to continue to plug, and if we insert the primary key value is suddenly large and small, This is a bit of a hassle, assuming that a data page stores a record that is full, and that it stores the primary key values 1~100
between:
If you insert a record with a primary key value at this point 9
, it will be inserted in the same position as:
But this data page is full, and then inserted into it? We need to split the current page into two pages and move some of the records on this page to the newly created page. What does page splitting and record shifting mean? Means: Performance loss! So if we want to try to avoid this unnecessary loss of performance, it is best to let the inserted record of the primary key value increment, so that the performance loss will not occur. So we suggest: let the primary key have AUTO_INCREMENT
, let the storage engine generate the primary key for the table itself, instead of manually inserting it, let's say we can define the table like this person_info
:
CREATE TABLE person_info ( ID INT UNSIGNED NOT NULL auto_increment, name VARCHAR (+) not NULL, birthday DATE N OT null, phone_number CHAR (one) not NULL, country varchar (+) NOT NULL, PRIMARY key (ID), key idx_name_ Age_birthday (name (ten), Birthday, phone_number));
Our custom primary key column id
has AUTO_INCREMENT
properties, and the storage engine automatically fills in the self-increment primary key value for us when inserting records.
9. Redundant and repetitive indexes
Sometimes some students intentionally or unintentionally create multiple indexes on the same column, for example, to write a table statement:
CREATE TABLE person_info ( ID INT UNSIGNED NOT NULL auto_increment, name VARCHAR (+) not NULL, birthday DATE N OT null, phone_number CHAR (one) not NULL, country varchar (+) NOT NULL, PRIMARY key (ID), key idx_name_ Age_birthday (name (ten), Birthday, Phone_number), KEY idx_name (name (10)));
We know that by idx_name_age_birthday
indexing a column can be name
quickly searched, and then creating a name
column-specific index even if it is an 冗余
index, maintaining the index will only increase the cost of maintenance and will not benefit the search.
In another case, we might re-index a column, for example:
CREATE TABLE repeat_index_demo (
c1 INT PRIMARY KEY,
c2 INT,
UNIQUE uidx_c1 (c1),
INDEX idx_c1 (c1)
);
We see that the primary key, which is c1
defined as a unique index, is also defined as a normal index, but the primary key itself will have a PLA clustered index, so the defined unique index and the normal index are duplicated, which is to be avoided.
10. Overlay Index
Let's say this query:
SELECT * from person_info WHERE name = ' Ashburn ';
person_info
There are idx_name_age_birthday
only 3 index columns in the table, and there are 4 columns in the table, so in order to get the complete user record, the corresponding primary key value is obtained once by the idx_name_age_birthday
index 聚簇索引
回表
. 回表
operation is also a performance loss Ah, so we recommend: It is best to include only the index column in the query list, such as:
SELECT name, birthday, phone_number FROM person_info WHERE name = ‘Ashburn‘;
Because we only query name
the birthday
phone_number
values of these three indexed columns, we idx_name_age_birthday
do not have to find the remaining columns of the record, that is, the values of the columns, after getting the results from the index 聚簇索引
country
. This eliminates the 回表
performance loss of the operation, which we call the query method that only needs to be indexed 索引覆盖
.
Of course, if the business needs to query a column other than the index, it is still to ensure that business needs are paramount. However, we do not encourage the use *
of numbers as a query list, it is best to make the columns we need to query in sequence.
Summarize
Above is just some of the points we need to be aware of when creating and using B+
the tree index, and we will introduce more optimization methods and considerations in the back, so please look forward to it. The contents of this episode are summarized as follows:
B+
The tree index has a price in both space and time, so don't build indexes.
B+
The tree index applies to these situations:
Here are some things to keep in mind when working with indexes:
Create indexes only for columns that are used for searching, sorting, or grouping
To create an index on a column with a large base column
The type of index column is as small as possible
You can index only the prefix of a string value
Indexes can be applied only if the index columns appear alone in the comparison expression
In order to minimize the 聚簇索引
occurrence of page splits and record shifts, it is recommended that the primary key have AUTO_INCREMENT
properties.
Locating and deleting duplicate and redundant indexes in a table
As far as possible for the 覆盖索引
query, to avoid 回表
the performance loss.
Performance analysis of MySQL indexes