A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Index optimization for mysql performance optimization,
As a free and efficient database, mysql is the first choice. Good secure connection, built-in query parsing, SQL statement optimization, use read/write locks (refined to rows), transaction isolation, and multi-version concurrency control to improve concurrency and complete Transaction log records, the powerful storage engine provides efficient queries (with a table record of up to one million records). If InnoDB is used, it can complete restoration after a crash, with many advantages. Even if there are so many advantages, we still rely on people to do some optimization. After reading a book, write a summary to consolidate it. please correct me if there are any mistakes.
The complete mysql Optimization requires deep knowledge. Large companies even have dedicated mysql kernels, SQL optimization, mysql Server optimization, various parameter constant settings, and query statement optimization, master-slave replication, software and hardware upgrades, disaster tolerance backup, and SQL programming do not require a one-and-a-half-star knowledge and time to master. As a cainiao such as mine, it doesn't make sense to eat so much without being digested: It's useless. Besides, it's better to write the business at hand, that is, to write a better SQL statement, in addition, the optimization of many SQL statements is closely related to indexes.
First, the mysql Query Process is roughly as follows: the mysql client establishes a connection with the mysql server through the Protocol, sends query statements, and first checks the query cache. If hit, the query results are directly returned. Otherwise, the statements are parsed, there are a series of preprocessing tasks, such as checking whether the statement is correctly written, and then querying optimization (for example, whether to use index scanning, if it is an impossible condition, it will be terminated in advance) to generate a query plan, then the query engine starts and executes the query. The underlying storage engine calls the API to obtain data and returns the data to the client. How to store and retrieve data is related to the storage engine. Then, mysql uses the default B-tree index, and the general direction is that no matter how SQL is tossed, at least currently, mysql only uses one index in the table.
Mysql uses the storage engine to retrieve data, which is naturally related to the storage engine. Different Storage engine indexes are different, such as the full-text index of MyISAM, even if the index is called an internal organization with different names, the most common one is InnoDB (and MariaDB fully compatible with mysql. Its Mog engine is XtraDB, which is similar to InnoDB ), the InnoDB engine is written here. The implementation of indexes is also similar to that of the storage engine. According to the implementation method, there are currently only two types of InnoDB indexes: BTREE indexes and HASH indexes. Generally, the index we mentioned does not accidentally refer to the B-tree index. InnoDB's B-tree index is actually implemented using B + trees, because mysql always prints B-tree when viewing table indexes, therefore, it is referred to as B-tree index. As for the difference between Tree B and Tree B +, the data structure is not easy to learn, and it also needs to be supplemented.
The use of the B-tree index means that all the indexes are stored in order (in ascending order). mysql does this. The abstract structure of the B-tree index in mysl is as follows (refer to high-performance mysql ).
In the structure, each layer of nodes is arranged from left to right from small to large, key1 <key2 <... <keyN, for a node smaller than key1 or in [key1, key2) or other values, it is a range distribution when you enter the leaf node for search. At the same time, nodes at the same layer can be accessed directly, because there is a pointer link between them (the BTREE index of MyISAM does not exist ). Each search is a range search. If yes, It is found. If no, It is null. The index can speed up access, because it does not need to scan data in the entire table (not always like this). Based on the search value, it is usually used for Binary Search to compare with the value in the node, for sorted values, the average speed is almost the fastest.
Val points to where, for InnoDB, it points to table data, because the InnoDB table data itself is an index file, which is a significant difference from the MyISAM index, myISAM indexes point to the table data address (val points to something similar to 0x7DFF .. and so on ). For example, a primary key index of InnoDB may be like this.
The val value of the index node of InnoDB directly points to the table data, that is, its leaf node is the table data. They are connected together, and the table record rows are not separately stored in other places. The leaf node (data) can be accessed.
In the abstract structure of the btree, the index value node is placed on the page. There are two issues to note:
1. the value in the leaf page and page (top), that is, what the so-called page is, I added a node comment, that is, the minimum page here can be considered as a single node. We know that the storage space of a computer is one piece. Normally, the storage space is used up after the use of the other piece. If the storage space of the previous piece is only 5 kb, but we just need to apply for a space of 8 KB, you have to apply for the space on a new block, and then apply for the space after the 8 KB later, as long as the space of the block is sufficient, therefore, the 5kb of the previous part is usually called a "Fragmentation". When the computer is used more, there will be a lot of such scattered fragmentation space, so there will be fragmentation. In mysql, the page here can be understood as a block storage space, that is, the tree node of the index is stored in the page, and each page (called a logical page) has a fixed size, currently, InnoDB is 16 KB and one page is used up. When you continue to insert a table to generate a new index node, you can store the node on the new page, after a new node is created, it is placed behind the node on the new page.
2. In the case of page splitting, a page is always full and a new page is created to continue. This behavior is called page splitting. When to open a new page, mysql specifies a split factor. If it reaches 15/16 of the page storage space, it will be saved to the next page. The existence of page splitting may greatly affect the performance of index maintenance. It is usually recommended that you set a meaningless integer auto-increment index, which is conducive to index storage.
If non-auto-incrementing or non-integer indexes, such as non-auto-incrementing integers and MD5-like strings, use them as index values, because the value of the next data to be inserted is not necessarily greater than or even smaller than all values on the current page, you need to go to the first few pages to compare and find a suitable position, innoDB cannot simply Insert a new row to the end of the previous row. After finding and inserting an index, this page may reach the split factor threshold, which requires page splitting, this further leads to the split and sorting of all the subsequent index pages. It may be okay if the data volume is small. If the data volume is large, it may waste a lot of time and cause many fragments.
The primary key is always unique and not empty. InnoDB automatically creates an index (primary key) for it. For indexes created on non-primary key fields, also known as secondary indexes, index sorting is also ordered, it only includes a data field of the primary key value of this record. It is not a pointer to the current data row. When you use the secondary index to find the index value corresponding to this column, then, the row record is queried based on another data field, the primary key value, on the index node. That is, the row record is searched twice each time. The advantage of storing the primary key value in an additional data domain is that when a page is split, you do not need to modify the value of the data domain, because even if the page is split, the primary key value of the row remains unchanged, and the address changes. For example, the index of the name field is as follows:
The index that contains a column is called a single column index, and the index that contains multiple columns is called a composite index. Because the BTREE index is ordered, it is suitable for Range Query. However, in composite indexes, you should also pay attention to the number of columns, the order of the columns, and the impact of the columns queried in the previous range on the following columns.
For example, a table
Create table staffs (id int primary key auto_increment, name varchar (24) not null default ''comment 'name', age int not null default 0 comment 'age ', pos varchar (20) not null default ''comment' position ', add_time timestamp not null default current_timestamp comment 'onboarding Time') charset utf8 comment 'employee record table ';
Add a composite index with three columns
alter table staffs add index idx_nap(name, age, pos);
When using a B-tree index, you can use a part of the index or index in the following situations (you can simply view the usage using the explain ):
1. Full value matching
For example, select * from staffs where name = 'july' and age = '23' and pos = 'dev'. The idx_nap index is used for the key field.
2. Match the leftmost ColumnFor a composite index, it does not always match all field columns, but it can match the left column in the index.
For example, select * from staffs where name = 'july' and age = '23'. The key field displays the index. Note that the key_len field indicates the index length used in this statement) the value is smaller than the previous one, meaning that it does not use all index columns (usually this length can be used to estimate which index columns are used and to bury a hole). In fact, only the name and age columns are used.
Try Again select * from staffs where name = 'july', which also uses the index. The key_len value is smaller, and only the name column in the index is used.
3. Match the column prefixThat is, the first part of an index, which is mainly used for fuzzy match. For example, select * fromstaffs where name like 'J % '. The key field of the explain information indicates that the index is used, however, the B-tree index of mysql cannot be a fuzzy match with a non-column prefix, for example, select * from staffs where name like '% y' or like' % u % ', it is said that the underlying storage engine API restrictions
4. Matching rangeFor example, select * from staffs where name> 'Mary ', but I found that> Yes,> = is not, at least not in the string column (test mysql version 5.5.12 ), however, the time type (timestamp) is acceptable. If it is not tested, the Index = is used.
Out of curiosity, I tested the index of the integer field (idx_cn (count, name), and count is an integer). I found that the number of integer types is much less limited, and all the following indexes can be used, fuzzy match before connection
select * from indexTest1 where count > '10' select * from indexTest1 where count >= '10' select * from indexTest1 where count > '10%' select * from indexTest1 where count >= '10%' select * from indexTest1 where count > '%10%' select * from indexTest1 where count >= '%10%'
5. precisely match a column and match the range of adjacent columns on the rightThat is, the former column is a fixed value, and the latter column is a range value, which uses the index of the name and age columns (key_len speculation)
For example, select * from staffs where name = 'july' and age> 25
6. access only the index QueryFor example, in the staffs table, the index is created on (name, age, pos), and all the columns are read before. If the index of which columns are used, only the data of these columns is queried, that is, only the index query is accessed, as shown in figure
select name,age,pos from staffs where name = 'July' and age = 25 and pos = 'dev' select name,age from staffs where name = July and age > 25
The first clause uses all index columns, and the second clause only uses the first two columns of the index. The select field can only be the first two columns. The index in this query condition is called overwriting index in mysql, that is, the index contains (overwrite) All the queried fields. If index query is used, you need to check the information of the last Extra column in The explain statement. Using index indicates that the overwrite index is used, and Using where indicates that the where filter is also used.
7. prefix Index
Different from the column prefix (like fuzzy match in the form of like 'J % ') and the leftmost column index (sequential query of the left column in the index ), it takes only a part of a column as the index. When talking about the differences between InnoDB and MyISAM, there is an obvious difference: MyISAM supports full-text indexing, while InnoDB does not. Even for ultra-long strings or binary data such as text and blob, myISAM takes the first few characters as the index. The index of InnoDb prefix is similar to this one. Some columns, which are generally string-type and long, are used as indexes to greatly increase storage space, indexes also need to be maintained. To use long strings as index columns, you can take the first part (prefix) to represent a whole column as the index string. The problem is: how can I ensure that this prefix represents or roughly represents this column? Therefore, mysql has the concept of index selectivity, which refers to the ratio of the number (also called the base) of non-repeated values in the index to the total number of records (# T) in the entire table, for example, in a list (3/4,), the total number is 4, the number of unique values is 3, and the selectivity is. Therefore, the selectivity range is [1/# T, 1]. the larger the value, the more UNIQUE values in the column. The more suitable the index as the prefix, the more selective the UNIQUE index (unique key) is 1.
For example, there is a column a varchar (255) with its prefix index. For example, if we use seven tests and add one by one to check that the number of selective values remains unchanged, it indicates that the entire column can be represented, if the index column with the length is too much data, make a trade-off. However, if this choice is small, it would be a pity.
select count(distinct left(a, 7))/count(*) as non_repeat from tab;
Set the number of prefixes, for example, 9. This can be done when you add an index.
Alter table tab add index idx_pn (name (9) -- independent prefix index alter table tab add index idx_cpn (count, name (9) -- composite prefix index
The above is a common way to use indexes. In some cases, the indexes cannot be used or are not all used. In some cases, the inverse example is key (a, B, c ).
1. Skip columns. where a = 1 and c = 3. You can use index column a at most. where B = 2 and c = 3. One column cannot be used. It must start from the leftmost column.
2. query by range. where a = 1 and B> 2 and c = 3. Two index columns a and B are used at most;
3. The order is reversed. where c = 3 and B = 2 and a = 1, and one cannot be used;
4. the expression is used in the index column, for example, where substr (a, 1, 3) = 'hhh ', where a = a + 1. The expression is a big taboo, and mysql does not recognize it easily. Sometimes, when the data volume is not large enough to seriously affect the speed, it can be checked out first. For example, you can first check all the data with order records, and then filter orders starting with 'cp1001' in the program, instead of writing SQL statements to filter it;
5. when fuzzy match is performed, try to write where a like 'J % ', and place the string on the left. In this way, the index of column a may not be used, or even be unavailable. Of course, this depends on the data type, test it.
Impact of sorting on Indexes
Order by is a frequently used statement, and sorting also follows the principle of the leftmost prefix column, such as key (a, B). The following statement can be used (test is wonderful)
select * from tab where a > 1 order by b select * from tab where a > 1 and b > '2015-12-01 00:00：00' order by b select * from tab order by a, b
Unavailable in the following cases
1. For non-leftmost columns, select * from tab order by B;
2. select * from tab where B> '2017-12-01 00:00:00 'order by;
3. Multiple columns are sorted, but the order direction of the columns is inconsistent. select * from tab a asc, B desc.
Clustered index and covered Index
As mentioned above, there are only two types of mysql indexes, namely, B-tree and HASH. Only the columns to be queried are exactly the same as the index columns used, mysql directly scans the index and then returns data, which greatly improves the efficiency because it does not need to query or filter the original table. Indexes in this form are called overwrite indexes, such as key (, b) When querying, select a, B from tab where a = 1 and B> 2. The essential reason is that the BTREE index stores the original table data.
Clustering indexes are not independent indexes. Previously, we briefly wrote that BTREE indexes place data in indexes, that is, the leaf pages of indexes, including primary keys, the primary key is placed closely with the table data, because the table data has only one copy, and the key value of a column must be placed closely with each row of data, therefore, a table has only one clustered index. For mysql, it is the primary key column, which is the default.
Clustering indexes organize the table data together (refer to the previous primary key index diagram). The insertion is heavily dependent on the primary key sequence, preferably continuous auto-increment. Otherwise, frequent page splitting may occur, move a lot of data.
To put it simply, like a HASH table (HASH list) that is simply implemented in a data structure, when we use HASH indexes in mysql, it also calculates a hash value (similar to md5, sha1, and crc32) for the index column, and then sorts the hash values in order (in ascending order by default, at the same time, record the pointer of the row corresponding to the hash value in the data table. Of course, this is just a simple simulation diagram.
For example, create a hash index on the name column to generate hash values in order, but the hash values in order do not correspond to the records in the table, which can be reflected from the address pointer, the hash index may be created on two or more columns and obtain the hash value after multiple columns of data. It does not store table data. It first calculates the hash value of the column data, compares it with the hash value in the index, finds and compares whether the column data is equal, may involve other column conditions, and then returns data. Of course, there will be conflicts in hash, that is, collision. Unless there are many conflicts, the hash index is generally highly efficient, otherwise hash maintenance costs are high, so hash indexes are usually used on columns with higher selectivity. The structure of the hash index determines its features:
1. the hash index is only ordered by hash values. It has no relationship with table data and cannot be applied to order;
2. the hash index calculates the hash value for all its columns. Therefore, all columns must be included in the query, for example, a or B hash index, the query must be where a = 1 and B = 2, but none of them will work;
3. the hash index can only be used for comparison query = or IN. Other range queries are invalid because table data is not stored;
4. Once a collision occurs, the hash index must traverse all hash values and compare the data pointed to by the address until all rows that meet the conditions are found.
As mentioned above, we can roughly estimate which columns are used by using the key_len field of explain. The length of the index column is directly related to the Data Type of the index column. Generally, int Is 4 bytes, bigint8 byte, char is 1 byte. Considering that you must specify a character set when creating a table, such as utf8, it is also related to the selected character set (= !), Under utf8, a char is 3 bytes, but it still cannot be said that key_len means that the data type of the index column used represents the number of bytes? There is always a difference in fact, and the test method is relatively mechanical (the following is based on mysql 5.5.2)
Create tables, add indexes, int type
-- Test table create table keyLenTest1 (id int primary key auto_increment, typeKey int default 0, add_time timestamp not null default current_timestamp) charset utf8 -- add index alter table keyLenTest1 add index idx_k (typeKey );
We can see that the default length of an int index is 5, plus 1 on the basis of 4 bytes.
-- Changed to char type, with 1 character alter table keyLenTest1 modify typeKey char (1 );
-- Changed to char type, with two characters alter table keyLenTest1 modify typeKey char (2 );
It can be seen that the char type is initially 4 bytes (3 + 1 bytes), followed by 3 bytes increments
-- Changed to varchar type, with 1 character alter table keyLenTest1 modify typeKey varchar (1 );
-- Changed to varchar type, with two characters: alter table keyLenTest1 modify typeKey varchar (2 );
It can be seen that, in the varchar type, when one character is entered, key_len is 6, and later increases by 3 bytes
Therefore, if a statement uses int, char, varchar, key_len, how to calculate and which index columns are used, it should be clear.
For more details, including the meaning of each field, index details, show profiles, slow query logs, and so on (not detailed), we recommend that you check high-performance mysql, after all, I wrote too superficial.Articles you may be interested in:
Start building with 50+ products and up to 12 months usage for Elastic Compute Service