This article uses a case to see how the MySQL optimizer selects the index and JOIN sequence. For Table Structure and data preparation, refer to the test environment at the end of this article. Here we will mainly introduce the main execution process of the MySQL optimizer, rather than the various components of an optimizer (this is another topic ). We know that the MySQL optimizer has only two degrees of freedom: sequential selection; single
This article uses a case to see how the MySQL optimizer selects the index and JOIN sequence. For Table Structure and data preparation, see "test environment" at the end of this article ". Here we will mainly introduce the main execution process of the MySQL optimizer, rather than the various components of an optimizer (this is another topic ). We know that the MySQL optimizer has only two degrees of freedom: sequential selection; single
This article uses a case to see how the MySQL optimizer selects the index and JOIN sequence. For Table Structure and data preparation, see "test environment" at the end of this article ". Here we will mainly introduce the main execution process of the MySQL optimizer, rather than the various components of an optimizer (this is another topic ).
We know that the MySQL optimizer has only two degrees of freedom: sequential selection and single-Table access. Here we will detail the following SQL statements to see how the MySQL optimizer makes each step of selection.
explain select * from employee as A,department as B where A.LastName = 'zhou' and B.DepartmentID = A.DepartmentID and B.DepartmentName = 'TBX';
1. Possible options
Here we can see that the JOIN order can be A | B or B | A. There are also multiple access methods for A single table. For Table A, you can select: full table scan and index 'ind _ l_d' (. lastName = 'zhou') or 'ind _ did' (B. required mentid =. departmentID ). B Also has three options: full table scan, index IND_D, and IND_DN.
2. MySQL optimizer overview 2.1
The MySQL optimizer mainly includes the following parts: Query Rewrite (including Outer Join conversion), const table detection, range analysis, JOIN optimization (sequence and access method selection), and plan refinement. This case starts with range analysis.
2.2 range analysis
This section includes all Range and index merge cost evaluations (refer to 1 for Reference 2 ). Here, the isoexpression is also a range, so the cost will be evaluated here, and found records will be calculated (representing the corresponding isoexpression, and probably how many records will be selected ).
In this case, range analysis analyzes the conditions of table A. LastName = 'zhou' and Table B. DepartmentName = 'tbx' respectively. Where:
Table A. LastName = 'zhou' found records: 51 Table B. DepartmentName = 'tbx' focords: 1
These two conditions are not range, but the calculated values are still stored and used for later ref Access Method Evaluation. The value here is returned Based on the records_in_range interface, and InnoDB performs an index page sampling every time it calls this function, which is a performance-consuming operation, many other relational databases use histogram statistics to avoid this operation (we believe MariaDB will also implement histogram statistics in subsequent versions ).
2.3 Selection of sequence and access mode: exhaustive
MySQL can find the optimal execution sequence and access method by enumerating all left-deep trees (or all left-deep trees are the search space of the entire MySQL Optimizer.
2.3.1 sorting
The optimizer first sorts all tables according to the found records, and places the first few records. Therefore, the order here is B and.
2.3.2 greedy search
When the number of tables is small (less than search_depth, the default value is 63), convert it into a exhaustive search, the optimizer will enumerate all left-deep trees to find the optimal execution plan. In addition, the optimizer uses the prune_level parameter (which is enabled by default) to reduce the huge effort consumption caused by large search space ", you can refer to the complexity of JOIN sequence selection. However, at least three or more tables are associated to be "lazy", so this case is not applicable.
2.3.3 poor performance
The first JOIN table can be A or B. If A is selected for the first table, B can be selected for the second table. If B is selected for the first table, you can select A for the second table;
Because of the preceding sorting, The found records of Table B is less, so when the JOIN order is poor, select B for the first table (this is exquisite ).
(*) Select the first JOIN table as B (**) to determine the access mode of Table B. Because table B is the first table, the index IND_D (B. required mentid =. departmentID), but can only use IND_DN (B. departmentName = 'tbx') cost of using the IND_DN index: 1.2; IO cost is 1. Whether to use full table scan: here we will compare the I/O cost of using indexes and the I/O cost of full table scan. The former is 1 and the latter is 2. Therefore, the full table scan is ignored, the access method ref of Table B is used. The index IND_D (**) is used to select the second JOIN table from the remaining table. The remaining table here is (**) add Table A to JOIN and determine the index that can be used for access: 'ind _ L_D '(. lastName = 'zhou') or 'ind _ did' (B. required mentid =. calculate the cost of using index IND_L_D and IND_DID in sequence: (***) IND_L_D. lastName = 'zhou' indicates A in the range analysis stage. the record corresponding to LastName = 'zhou' is about 51. Therefore, the IO cost is calculated as follows: 51; When ref is used for IO cost calculation, it will be corrected as worst_seek (reference). After the IO cost is changed to: 15, the total cost is: 25.2 (***) IND_DID B. required mentid =. departmentID this is a cost that can be calculated only when you know the results of the preceding table. Therefore, range analysis cannot be analyzed. Here, we can see that the preceding table is B, and found_record is 1, so. departmentID only needs to correspond to one record. Because the specific value is unknown and there is no histogram, it can only be calculated based on index statistics: column A of index IND_DID. the Cardinality of DepartmentID is 1349, and the number of records in the full table is 1349. Therefore, each value corresponds to one record, while Table B has only one record, therefore, the found_record here is calculated as 1*1 = 1, so the IO cost is: 1, the total cost is 1.2 (***) IND_L_D cost is 25.2, And the IND_DID cost is 1.2, therefore, select the latter as the access method of the current table (**) to determine that A uses the index IND_DID, And the access method is ref (**) JOIN order B | A. The total cost is: 1.2 + 1.2 = 2.4 (*) Select the first JOIN table as A (**) to determine the access method of Table A because table A is the first table, therefore, the index 'ind _ did' (B. required mentid =. then you can only use the index 'ind _ l_d' (. lastName = 'zhou') calculate the cost of using the IND_L_D index. The total cost is 25.2. refer to the previous calculation. (**) the cost of accessing Table A is already 25.2, this is 2.4 higher than the previous optimal cost. Therefore, this search is complete.
Simplify the process as follows:
(*) Select B (**) as the first JOIN table to determine the access mode of Table B (**). Select the table for the second JOIN from the left table, the remaining table here is: A (**) adds Table A to JOIN and determines its access method (***) IND_L_D. lastName = 'zhou' (***) IND_DID B. required mentid =. the cost of using mentid (***) IND_L_D is 25.2; the cost of IND_DID is 1.2, so select the latter as the access method of the current table (**) to determine that A uses the index IND_DID, the access method is ref (**) JOIN order B | A. The total cost is: 1.2 + 1.2 = 2.4 (*). Select the first JOIN table as (**) determine the access method (**) of Table A. The cost of accessing Table A is already 25.2, Which is 2.4 higher than the previous optimal cost. Ignore this order.
So far, the MySQL optimizer has determined the optimal JOIN sequence and access method for all tables.
3. Test Environment
MySQL: 5.1.48-debug-log innodb plugin 1.0.9CREATE TABLE `department` ( `DepartmentID` int(11) DEFAULT NULL, `DepartmentName` varchar(20) DEFAULT NULL, KEY `IND_D` (`DepartmentID`), KEY `IND_DN` (`DepartmentName`)) ENGINE=InnoDB DEFAULT CHARSET=gbk;CREATE TABLE `employee` ( `LastName` varchar(20) DEFAULT NULL, `DepartmentID` int(11) DEFAULT NULL, KEY `IND_L_D` (`LastName`), KEY `IND_DID` (`DepartmentID`)) ENGINE=InnoDB DEFAULT CHARSET=gbk;for i in `seq 1 1000` ; do mysql -vvv -uroot test -e 'insert into department values (600000*rand(),repeat(char(65+rand()*58),rand()*20))'; donefor i in `seq 1 1000` ; do mysql -vvv -uroot test -e 'insert into employee values (repeat(char(65+rand()*58),rand()*20),600000*rand())'; donefor i in `seq 1 50` ; do mysql -vvv -uroot test -e 'insert into employee values ("zhou",27760)'; donefor i in `seq 1 200` ; do mysql -vvv -uroot test -e 'insert into employee values (repeat(char(65+rand()*58),rand()*20),27760)'; donefor i in `seq 1 1` ; do mysql -vvv -uroot test -e 'insert into department values (27760,"TBX")'; doneshow index from employee;+----------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |+----------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+| employee | 1 | IND_L_D | 1 | LastName | A | 1349 | NULL | NULL | YES | BTREE | || employee | 1 | IND_DID | 1 | DepartmentID | A | 1349 | NULL | NULL | YES | BTREE | |+----------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+show index from department;+------------+------------+----------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |+------------+------------+----------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+| department | 1 | IND_D | 1 | DepartmentID | A | 1001 | NULL | NULL | YES | BTREE | || department | 1 | IND_DN | 1 | DepartmentName | A | 1001 | NULL | NULL | YES | BTREE | |+------------+------------+----------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+
4. Construct a Bad case
Because MySQL uses index statistics for Cost Estimation in association conditions, it is easy to make incorrect judgments when data distribution is uneven. Let's simply construct the following case:
The table and index structure remain unchanged, and the data is constructed as follows:
for i in `seq 1 10000` ; do mysql -uroot test -e 'insert into department values (600000*rand(),repeat(char(65+rand()*58),rand()*20))'; donefor i in `seq 1 10000` ; do mysql -uroot test -e 'insert into employee values (repeat(char(65+rand()*58),rand()*20),600000*rand())'; donefor i in `seq 1 1` ; do mysql -uroot test -e 'insert into employee values ("zhou",27760)'; donefor i in `seq 1 10` ; do mysql -uroot test -e 'insert into department values (27760,"TBX")'; donefor i in `seq 1 1000` ; do mysql -uroot test -e 'insert into department values (27760,repeat(char(65+rand()*58),rand()*20))';done
explain select * from employee as A,department as B where A.LastName = 'zhou' and B.DepartmentID = A.DepartmentID and B.DepartmentName = 'TBX';+----+-------------+-------+------+-----------------+---------+---------+---------------------+------+-------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |+----+-------------+-------+------+-----------------+---------+---------+---------------------+------+-------------+| 1 | SIMPLE | A | ref | IND_L_D,IND_DID | IND_L_D | 43 | const | 1 | Using where || 1 | SIMPLE | B | ref | IND_D,IND_DN | IND_D | 5 | test.A.DepartmentID | 1 | Using where |+----+-------------+-------+------+-----------------+---------+---------+---------------------+------+-------------+
As you can see, the MySQL Execution Plan uses the index IND_D for the table department. Therefore, Table A hits A record (zhou, 27760). According to B. when mentid = 27760, 1010 records are returned, and then filtered Based on the condition DepartmentName = 'tbx.
Here we can see that if index IND_DN is selected for table B, the effect is better, because DepartmentName = 'tbx' only returns 10 records, and then filters out the results based on Condition A. Dimension mentid = B. Dimension mentid.
In this case, because the data volume is small, the performance is not much different, but if the data in the production environment is tens of millions or hundreds of millions of data, the performance will be very bad. A simple Hint can solve this problem.
Original article address: Case study: how to select the index and JOIN sequence of the MySQL optimizer, thanks to the original author for sharing.