Solve the problem that IN subqueries IN MySQL will cause indexing failure. mysql Index
Today, I saw an IN subquery optimization case for MySQL,
At first, I felt a bit skeptical (if it was done in SQL Server, this situation is absolutely impossible, and a simple test will be conducted later .)
Then I started to perform a test and verification according to what he said. I found that the IN subquery of MySQL is not doing well and the index cannot be used, the scenario is MySQL, and the end version is 5.7.18)
MySQL test environment
The test table is as follows:
create table test_table2( id int auto_increment primary key, pay_id int, pay_time datetime, other_col varchar(100))
Create a stored procedure to insert test data. The test data is characterized by repeated pay_id values. Here, the stored procedure is processed to insert million pieces of data cyclically, insert a duplicate pay_id for every 100 data records. The time field is random within a certain range.
CREATE DEFINER=`root`@`%` PROCEDURE `test_insert`(IN `loopcount` INT) LANGUAGE SQL NOT DETERMINISTIC CONTAINS SQL SQL SECURITY DEFINER COMMENT ''BEGIN declare cnt int; set cnt = 0; while cnt< loopcount do insert into test_table2 (pay_id,pay_time,other_col) values (cnt,date_add(now(), interval floor(300*rand()) day),uuid()); if (cnt mod 100 = 0) then insert into test_table2 (pay_id,pay_time,other_col) values (cnt,date_add(now(), interval floor(300*rand()) day),uuid()); end if; set cnt = cnt + 1; end while;END
Execute call test_insert (3000000); insert 303000 rows of data
Two-seed query statement
The query generally refers to querying data with a Business Id greater than 1 in a certain period of time, so there are two writing methods.
The first statement is as follows: IN subqueries are business IDs with business statistics greater than 1 IN a certain period of time, and the outer layer queries according to the results of IN subqueries, the pay_id column of the Business Id has an index, and the logic is relatively simple. This writing method is indeed inefficient when there is a large amount of data, and indexes are not used.
select * from test_table2 force index(idx_pay_id)where pay_id in ( select pay_id from test_table2 where pay_time>="2016-06-01 00:00:00" AND pay_time<="2017-07-03 12:59:59" group by pay_id having count(pay_id) > 1);
Execution result: 2.23 seconds
The second method is to join the subquery. This method is equivalent to the IN subquery method. The test below shows that the efficiency has indeed improved a lot.
select tpp1.* from test_table2 tpp1, ( select pay_id from test_table2 WHERE pay_time>="2016-07-01 00:00:00" AND pay_time<="2017-07-03 12:59:59" group by pay_id having count(pay_id) > 1) tpp2 where tpp1.pay_id=tpp2.pay_id
Execution result: 0.48 seconds
The execution plan of the In subquery finds that the outer query is a full table scan method and the index on the pay_id is not used.
The execution plan of join self-query. The outer layer (query of tpp1 alias) uses the index on pay_id.
The following describes how to use a forced index for the first query method. Although no error is reported, it is useless.
If the subquery is a direct value, the index can be used normally.
It can be seen that MySQL does not support IN subqueries very well.
IN addition, when a temporary table is added, although it is more efficient to query than many join queries, it is also more efficient to directly use IN subqueries. IN this case, you can also use indexes. However, in this simple case, there is no need to use temporary tables.
The following is a test case similar to the case in sqlserver 2014, where tens of thousands of identical test tables have the same structure and quantity. In this case, the two statements are used, SQL Server can be considered as the same (Execution Plan + efficiency), which is much better than MySQL.
The following is the test environment script in sqlserver.
create table test_table2( id int identity(1,1) primary key, pay_id int, pay_time datetime, other_col varchar(100))begin trandeclare @i int = 0while @i<300000begin insert into test_table2 values (@i,getdate()-rand()*300,newid()); if(@i%1000=0) begin insert into test_table2 values (@i,getdate()-rand()*300,newid()); end set @i = @i + 1endCOMMITGOcreate index idx_pay_id on test_table2(pay_id);create index idx_time on test_table2(pay_time);GOselect * from test_table2 where pay_id in ( select pay_id from test_table2 where pay_time>='2017-01-21 00:00:00' AND pay_time<='2017-07-03 12:59:59' group by pay_id having count(pay_id) > 1 );select tpp1.* from test_table2 tpp1, ( select pay_id from test_table2 WHERE pay_time>='2017-01-21 00:00:00' AND pay_time<='2017-07-30 12:59:59' group by pay_id having count(pay_id) > 1) tpp2 where tpp1.pay_id=tpp2.pay_id
Conclusion: For MySQL data, as of version 5.7.18, you must use this method with caution for IN subqueries.