Explore the Mysql Optimizer's choice of index and join order

Explore the Mysql Optimizer's choice of index and join order _mysql

Last Update:2017-01-19 Source: Internet

Author: User

Tags joins rand

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article looks at a case to see how the MySQL optimizer chooses the index and join order. Table structure and data preparation refer to the last section of this article, "Test environment." The main implementation process of the MySQL optimizer is described here, rather than the various components of an optimizer (this is another topic).

We know that the MySQL optimizer has only two degrees of freedom: Sequential selection, single table access, and here's a detailed analysis of the following SQL to see how the MySQL optimizer makes each step choice.

Explain
select *
from
 employee as a,department as B
where
   a.lastname = ' Zhou ' and
 B. DepartmentID = A.departmentid and
 b.departmentname = ' TBX ';

1. Possible options

Here you see the order of joins can be a| B or b| A, there are a number of Tanku access methods, for a table can choose: Full table scan and index ' ind_l_d ' (a.lastname = ' Zhou ') or ' ind_did ' (B.departmentid = A.departmentid). There are also three options for B: Full table scan, index ind_d, ind_dn.
2. How to do the MySQL optimizer
2.1 Overview

The MySQL optimizer's main work consists of the following sections: Query Rewrite (including outer JOIN conversion), const table detection, range analysis, JOIN optimization (Order and access selection) , plan refinement. This case begins with range analysis.
2.2 Range Analysis

This section includes all range and index merge cost evaluations (refer to 1 reference 2). Here, the equivalence expression is also a range, so the cost is evaluated here, and the found records (representing the corresponding equivalent expression, which will probably select how many records) is calculated.

In this case, range analysis will a.lastname = ' Zhou ' and B in table B for the condition of table A. Departmentname = ' TBX ' respectively do analysis. which

Table A a.lastname = ' Zhou ' found records:51
Table B b.departmentname = ' TBX ' found records:1

Neither of these conditions is a range, but the values computed here are still stored and used when the subsequent ref access method is evaluated. The value here is returned based on the Records_in_range interface, and for InnoDB to sample the index page each time the function is called, this is a very performance-consuming operation, and for many other relational databases it is using "histogram" statistics to avoid this operation ( I believe that the MARIADB subsequent version will also implement histogram statistics.
2.3 Choice of order and access mode: Poor lift

MySQL finds the optimal execution order and access mode by enumerating all the left-deep trees (or all the left-deep trees are the search space for the entire MySQL optimizer).
2.3.1 Sort

The optimizer first sorts all tables according to the found records, and records fewer fronts. So, the order here is B, a.
2.3.2 Greedy Search

When the number of tables is less (less than search_depth, default is 63), this directly degenerate into a exhaustive search, the optimizer will be exhaustive all left-deep tree to find the best execution plan. In addition, the optimizer in order to reduce the large amount of search space caused by the huge depletion, so using a "lazy" parameter prune_level (open by default), how to "lazy", you can refer to the complexity of the join sequence selection. This case does not apply, however, at least three more tables are required to have a "lazy" association.
2.3.3 Poor lift

The first table of the join can be: A or B; if the first table chooses a, the second table can choose B; if the first table chooses B, the second table can select A;

Because of the previous sort, B-table found records less, so the first table in the order of join is to choose b (this is fastidious).

(*) Select the first join table as B
(* *) determine how table B is accessed
Because table B is the first table, the index ind_d (b.departmentid = A.departmentid) cannot be used, and only ind_dn (b.departmentname = ' TBX ') can be used.
Cost calculation using IND_DN index: 1.2; IO cost is 1.
Whether to use full table scan: This compares the IO cost of using the index with the IO cost of the full table scan, the former is 1, the latter is 2; So ignore full table scan
So, b table access way ref, using index Ind_d

(* *) a table that selects a second join from the remaining tables, where the remaining tables are: A
(* *) joins a table and determines how it is accessed
The indexes that can be used are: ' ind_l_d ' (a.lastname = ' Zhou ') or ' ind_did ' (B.departmentid = A.departmentid)
Calculate the cost of using index ind_l_d and Ind_did in turn:
(***) Ind_l_d a.lastname = ' Zhou '
The A.lastname = ' Zhou ' corresponding record in the range analysis phase is approximately: 51.
So, calculate IO cost is: 51;ref do io cost calculation will make one time revise, revise it to Worst_seek (reference)
After the revised IO cost: 15, the total cost is: 25.2
(***) Ind_did B.departmentid = A.departmentid
This is a cost that needs to know the results of the preceding table in order to calculate. So range analysis is impossible to analyze.
Here, we see the previous table for B,found_record is 1, so a.departmentid just need to correspond to one record.
Because the specific values do not know, there is no histogram, so can only be based on index statistics to calculate:
The cardinality of the column A.departmentid of the index Ind_did is 1349, the total table record number is 1349
So, each value corresponds to a record, and the previous table B has only one record, so the found_record here is calculated as 1*1 = 1
So io cost is: 1, total cost is 1.2
(***) The ind_l_d cost is 1.2 for the 25.2;ind_did cost, so select the latter as the way to access the current table
(* *) determines that a uses index Ind_did to access the method ref
(**) Join Order b| A, the total cost is: 1.2+1.2 = 2.4

(*) Select the table for the first join as a
(* *) determine how a table is accessed
Because table A is the first table, you cannot use the index ' ind_did ' (B.departmentid = A.departmentid)
Then you can only use the index ' ind_l_d ' (a.lastname = ' Zhou ')
Cost calculation using Ind_l_d index, the total cost is 25.2; Refer to the preceding calculation;
(* * *) the cost of accessing table A is already 25.2, which is larger than the previous optimal cost of 2.4, ignoring the order
So this is the end of this exhaustive search.

Simplify the above process as follows:

(*) Select the first join table as B
(* *) determine how table B is accessed
(* *) a table that selects a second join from the remaining tables, where the remaining tables are: A
(* *) joins a table and determines how it is accessed
(***) Ind_l_d a.lastname = ' Zhou '
(***) Ind_did B.departmentid = A.departmentid
(***) The ind_l_d cost is 1.2 for the 25.2;ind_did cost, so select the latter as the way to access the current table
(* *) determines that a uses index Ind_did to access the method ref
(**) Join Order b| A, the total cost is: 1.2+1.2 = 2.4

(*) Select the table for the first join as a
(* *) determine how a table is accessed
(* * *) the cost of accessing table A is already 25.2, which is larger than the previous optimal cost of 2.4, ignoring the order

At this point, the MySQL optimizer determines the best join order and access for all tables.
3. Test environment

Mysql:5.1.48-debug-log InnoDB plugin 1.0.9 CREATE TABLE ' Department ' (' departmentid ' int () DEFAULT NULL, ' Departme Ntname ' varchar default NULL, key ' Ind_d ' (' DepartmentID '), key ' ind_dn ' (' Departmentname ')) Engine=innodb default

CHARSET=GBK; CREATE TABLE ' employee ' (' LastName ' varchar) default NULL, ' DepartmentID ' int (one) default null, KEY ' ind_l_d ' (' Las

Tname '), KEY ' Ind_did ' (' DepartmentID ')) Engine=innodb DEFAULT CHARSET=GBK; For i in ' seq 1 1000 '; Do mysql-vvv-uroot test-e ' inserts into department values (600000*rand (), Repeat (char (65+rand () *58), Rand () *20)) '; Done for i in ' seq 1 1000 '; Do mysql-vvv-uroot test-e ' insert into employee values (repeat (char (65+rand () *58), Rand () *20), 600000*rand ()) '; Done for i in ' seq 1 50 '; Do mysql-vvv-uroot test-e ' inserts into employee values ("Zhou", 27760) '; Done for i in ' seq 1 200 '; Do mysql-vvv-uroot test-e ' inserts into employee values (repeat (char (65+rand () *58), Rand () *20), 27760) '; Done for i in ' SEQ 1 1 '; Do mysql-vvv-uroot test-e ' inserts into department values (27760, "TBX") ';
Do show index from employee; +----------+------------+----------+--------------+--------------+-----------+-------------+----------+-------- +------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type |
Comment | +----------+------------+----------+--------------+--------------+-----------+-------------+----------+-------- +------+------------+---------+
|     Employee | 1 |      Ind_l_d | 1 | LastName |    A |   1349 | NULL | NULL | YES |     Btree | |
|     Employee | 1 |      Ind_did | 1 | DepartmentID |    A |   1349 | NULL | NULL | YES |     Btree |
| +----------+------------+----------+--------------+--------------+-----------+-------------+----------+--------
+------+------------+---------+ Show index from department; +------------+------------+----------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type |
Comment | +------------+------------+----------+--------------+----------------+-----------+-------------+----------+---- ----+------+------------+---------+
|     Department | 1 |      Ind_d | 1 | DepartmentID |    A |   1001 | NULL | NULL | YES |     Btree | |
|     Department | 1 |      ind_dn | 1 | Departmentname |    A |   1001 | NULL | NULL | YES |     Btree |
| +------------+------------+----------+--------------+----------------+-----------+-------------+----------+----

 ----+------+------------+---------+

4. Construct a bad case

Because MySQL uses index statistics to do cost estimation in the correlation condition, it is easy to make wrong judgment when the data distribution is uneven. Simple we construct the following case:

The table and index structures are unchanged, and the data is constructed in the following manner:

For i in ' seq 1 10000 '; Do mysql-uroot test-e ' inserts into department values (600000*rand (), Repeat (char (65+rand () *58), Rand () *20)) '; Done for i in ' seq 1 10000 '; Do mysql-uroot test-e ' insert into employee values (repeat (char (65+rand () *58), Rand () *20), 600000*rand ()) '; Done for i in ' SEQ 1 1 '; Do mysql-uroot test-e ' inserts into employee values ("Zhou", 27760) '; Done for i in ' seq 1 10 '; Do mysql-uroot test-e ' inserts into department values (27760, "TBX") '; Done for i in ' seq 1 1000 ';
Do mysql-uroot test-e ' inserts into department values (27760,repeat (char (65+rand () *58), Rand () *20)) '; Done explain SELECT * from employee as a,department as B where a.lastname = ' Zhou ' and B.departmentid = A.department
ID and b.departmentname = ' TBX '; +----+-------------+-------+------+-----------------+---------+---------+---------------------+------+--------- ----+
| ID | Select_type | Table | Type | Possible_keys | Key | Key_len | Ref | Rows |
Extra | +----+-------------+-------+------+-----------------+---------+---------+---------------------+------+-------------+
| 1 | Simple | A | Ref | Ind_l_d,ind_did | Ind_l_d | 43 |  Const | 1 | Using where | | 1 | Simple | B | Ref | ind_d,ind_dn | Ind_d | 5 | Test.  A.departmentid | 1 |
Using where | +----+-------------+-------+------+-----------------+---------+---------+---------------------+------+---------

 ----+

As you can see here, the MySQL execution plan uses index ind_d for table department, then a table hits a record as (zhou,27760); According to b.departmentid=27760, 1010 records are returned. Then filter according to the condition departmentname = ' TBX '.

Here you can see if Table B chooses index IND_DN, the effect is better because departmentname = ' TBX ' returns only 10 records, then according to condition A. Departmentid=b.departmentid Filtration.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More