Optimize MySQL redo operation to the extreme three rounds (a): Using indexes and variables skillfully

Source: Internet
Author: User
Tags create index rounds

New Year's Day holiday received a call from Ali Wu, was told that MySQL checked to the extreme SQL optimization: 1 million raw data, of which 500,000 repeats, the deduplication of the 500,000 data written to the target table only 9 seconds. This is a staggering number, and it will take some time to know that only the insert 500,000 records. Then came the interest, his own experiment, thinking, summed up to do it.

I. Questions raised
The source table T_source structure is as follows:
item_id int,
Created_time datetime,
Modified_time datetime,
Item_name varchar (20),
Other varchar (20)

1. There are 1 million data in the source table, of which 500,000 created_time and item_name are duplicated.
2. To write the 500,000 data to the target table after the deduplication.
3. Repeat created_time and Item_name multiple data, can retain any one, do not make the rule limit.

Second, the experimental environment
Linux virtual machine: CentOS release 6.4;8g memory, 100G mechanical hard disk, dual physical CPU dual core, a total of four processors; MySQL 5.6.14.

Iii. Establishment of test tables and data
1. Create a source table
CREATE TABLE T_source  (  item_id int,  created_time datetime,  modified_time datetime,  item_name varchar (), other  varchar (+)  );  

2. Set up the target table

3. Generate 1 million test data with 500,000 created_time and Item_name duplicates

delimiter//CREATE PROCEDURE Sp_generate_data () begin set @i: = 1;          While @i<=500000 do set @created_time: = Date_add (' 2017-01-01 ', Interval @i second);          Set @modified_time: = @created_time;          Set @item_name: = Concat (' A ', @i);          INSERT into T_source values (@i, @created_time, @modified_time, @item_name, ' other ');        Set @i:[email protected]+1;      End while;            Commit      Set @last_insert_id: = 500000; Insert INTO T_source Select item_id + @last_insert_id, Created_time, Date_add (modified_time,i      Nterval @last_insert_id second), Item_name, ' other ' from T_source;         Commit;end//delimiter;  Call Sp_generate_data (); 

The source table does not have a primary key or uniqueness constraint, there may be two identical data, so insert a record to simulate the situation.

INSERT INTO T_source  select * from T_source where item_id=1;  Commit  

Query the total number of records and the number of records after the de-weight shown in Figure I.

Select COUNT (*), COUNT (distinct created_time,item_name) from T_source;
Figure A
As you can see, there are 1,000,001 records in the source table, and the target table should have 500,000 records.

Three, no index contrast test
1. Using related subqueries
Truncate T_target;  INSERT INTO T_target  select distinct t1.* from T_source t1 where item_id in   (select min (item_id) from T_source T2 where T1.created_time=t2.created_time and t1.item_name=t2.item_name);  Commit  

This statement will not come out for a long time, just look at the execution plan. Two, 1 million * 1 million-time table scan, it is no wonder that the results.

Figure II

2. Use the table connection to check the weight
Truncate T_target;  INSERT INTO T_target  select distinct t1.* from T_source T1,  (select min (item_id) item_id,created_time,item_name From T_source GROUP by created_time,item_name) T2  where t1.item_id = t2.item_id;  Commit  
This method takes 35 seconds, as shown in query plan three.


Might

(1) The inner query scans the 1 million rows of the T_source table, establishes the temporary table, and uses the file sort to find the minimum item_id, and generates the export table DERIVED2, this export table has 500,000 rows.
(2) MySQL automatically creates an index auto_key0 of the item_id field on the temporary table derived2.
(3) The outer query also scans the T_source table for 1 million rows of data, and when linking to the staging table, the item_id for each row of the T_source table, uses the AUTO_KEY0 index to find the matching rows in the staging table, and optimizes the distinct operation at this time, The action to find the same value stops after the first matching row is found.

3. Using variables
Set @a:= ' 0000-00-00 00:00:00 ';  Set @b:= ";  Set @f:=0;  Truncate T_target;  INSERT INTO T_target  select Item_id,created_time,modified_time,item_name,other    from   (select T0.*,if (@a =created_time and @b=item_name,@f:=0,@f:=1) F, @a:=created_time,@b:=item_name    from   (SELECT * from T_source Order by Created_time,item_name) t0) t1 where f=1;  Commit  

This method takes 14 seconds, as shown in query plan four.


Figure Four

(1) The most inner query scans the 1 million rows of the T_source table and uses the file sort to generate the export table Derived3.
(2) The second-level query to scan the DERIVED3 1 million rows, generate the export table Derived2, complete the comparison and assignment of variables, and automatically create an export column F on the index auto_key0.
(3) The outermost layer uses the AUTO_KEY0 index to scan the derived2 to get the result row to be de-weighed.

Compared with method 2, the variable method eliminates the table association and the query speed increases by 2.7 times times.

at this point, we have not created any indexes on the source table. Regardless of which notation you use, the Created_time and Item_name fields need to be sorted, so it's natural to think that if you build a federated index on both fields, you can use it to eliminate filesort and improve query performance.

Iv. establishing a joint index comparison test on created_time and Item_name
1. Establish a federated index of the Created_time and Item_name fields.
Create INDEX Idx_sort on T_source (created_time,item_name,item_id);  Analyze table T_source;  

2. Using related subqueries

Truncate T_target;  INSERT INTO T_target  select distinct t1.* from T_source t1 where item_id in   (select min (item_id) from T_source T2 where T1.created_time=t2.created_time and t1.item_name=t2.item_name);  Commit  
This method takes 20 seconds, as shown in query plan five.


Figure Five

(1) The T_source table of the outer query is the driver table, it needs to scan 1 million rows.
(2) for the item_id of each row of the driver table, a row of data is queried by the Idx_sort index.

3. Use the table connection to check the weight
Truncate T_target;  INSERT INTO T_target  select distinct t1.* from T_source T1,  (select min (item_id) item_id,created_time,item_name From T_source GROUP by created_time,item_name) T2  where t1.item_id = t2.item_id;  Commit  

This method takes 25 seconds, as shown in query plan six.


Figure Six

The subquery changes from a full-table scan to a full-index scan, but still needs to scan 1 million rows of records, compared to no indexes. Therefore, the query performance increased by 36%, not many.

4. Using variables
Set @a:= ' 0000-00-00 00:00:00 ';  Set @b:= ";  Set @f:=0;  Truncate T_target;  INSERT INTO T_target  select Item_id,created_time,modified_time,item_name,other    from   (select T0.*,if (@a =created_time and @b=item_name,@f:=0,@f:=1) F, @a:=created_time,@b:=item_name    from   (SELECT * from T_source Order by Created_time,item_name) t0) t1 where f=1;  Commit

This method takes 14 seconds, and the query plan is the same as when there is no index, as shown in four. A visible index has no effect on this notation. Can you eliminate nesting, using only one layer of query results?


5. Using variables and eliminating nested queries
Set @a:= ' 0000-00-00 00:00:00 ';  Set @b:= ";  Truncate T_target;  INSERT INTO T_target  SELECT * FROM T_source Force index (Idx_sort)   where (@a!=created_time or @b!=item_name) and ( @a:=created_time) is not NULL, and (@b:=item_name) is not NULL for   order by Created_time,item_name;  

This method takes 8 seconds, as shown in query plan seven.


Figure Seven

The statement has the following characteristics.
(1) Eliminate the nested subqueries, only need to do a full index scan of the T_source table, the query plan has reached the optimal.
(2) No need to distinct two times to check the weight.
(3) Variable judgments and assignments appear only in the WHERE clause.
(4) using the index to eliminate the filesort.

This statement is the single-threaded solution for teacher Wu. Careful analysis of this statement reveals that it skillfully exploits the logical query processing steps and indexing characteristics of SQL statements.
The logical steps for an SQL query are:
Step 1: Perform the Cartesian product (cross join)
Step 2: Apply the on filter (join condition)
Step 3: Add an external row (outer join)
Step 4: Apply the Where filter
Step 5: Grouping
Step 6: Apply cube or Rollup
Step 7: Apply the Having filter
Step 8: Process the select list
Step 9: Apply the DISTINCT clause
Step 10: Apply the ORDER BY clause
Step 11: Apply the LIMIT clause

The logical execution step for each query statement is a subset of these 11 steps. With this query statement, the order of execution is:
Enforces the use of the where filter, which is used by the index idx_sort to find data rows, applies the ORDER BY clause to the select list.

In order for the variables to be assigned and compared in the sort order of created_time and Item_name, the data rows must be found in index order. Here The Force index (IDX_SORT) hint has played this role, it must be written in order to make the entire check-up statement. Otherwise, because the table is scanned before sorting, there is no guarantee of the order in which the variables are assigned, and the query results are not guaranteed to be correct. The ORDER BY clause is also not negligible, or even with the Force index hint, MySQL uses a full table scan instead of a full index scan to make the result error.

The index also guarantees the order of created_time,item_name and avoids the sorting of files. The Force index (IDX_SORT) hint and the ORDER BY clause are integral, and the index idx_sort here is just right, double benefit.

Before the query begins, initialize the variable to a value that is not possible in the data, and then enter the WHERE clause to judge from left to right. The values of the variables and fields are compared before the values of the bank Created_time and Item_name are assigned to the variables, which are processed line by row in Created_time,item_name order. Item_name is a string type and (@b:=item_name) is not a valid Boolean expression, so write (@b:=item_name) is not null.
Finally, add the words "INSERT INTO T_target SELECT * from T_source Group by Created_time,item_name;" Because it is limited by "sql_mode= ' only_full_group_by '".

V. Summary
Seemingly a simple part of the check weight statement, to perfect optimization, but also must clearly understand a lot of knowledge points. Such as: the logical execution order of query statements, using index-optimized sorting, forcing the scan of tables by index order, index overrides, semi-join query optimization, Boolean expressions, etc. The foundation must be solid, the application should be flexible, can write the efficient SQL statement.

Optimize MySQL redo operation to the extreme three rounds (a): Using indexes and variables skillfully

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.