MySQL not in, left join, is NULL, not EXISTS efficiency problem record _mysql

Source: Internet
Author: User
Tags mysql in time 0
Not in, JOIN, is NULL, not exists efficiency comparison

Statement one: SELECT COUNT (*) from a where a.a to (select a from B)

Statement two: SELECT COUNT (*) from A left join B in A.A = B.A where B.A is null

Statement three: SELECT COUNT (*) from A where NOT exists (select A from B where a.a = B.A)

It has been a long time since the actual effects of the three statements are known to be the same, but there has been no scrutiny of efficiency comparisons. Always feel on the statement two is the fastest.
At work today, you need to delete more than 20 million rows of data because you want to purge data from a tens of millions of-line database. A lot of the above three statements are used to achieve the function. Originally used is statement one, but the result is the execution speed 1个小时32分, the log file occupies 21GB. Although the time is acceptable, but the footprint of hard disk space is indeed a problem. Therefore, all statements are replaced with statement two. We thought it would be quicker. After more than 40 minutes of execution, the first 50000 lines were not erased, but the SQL Server crashed and the results were surprising. Try to execute this statement alone, query nearly 10 million rows of the table, the statement used 4 seconds, statement two but used 18 seconds, the gap is very large. The efficiency of statement three is close to the statement.


The second type of writing is taboo and should be avoided as much as possible. The first and third are almost identical in nature.

Assuming that the buffer pool is large enough, writing two relative to the writing of the following several deficiencies:
(1) The left join itself consumes more resources (more resources are required to handle the resulting intermediate result set)
(2) The middle result set of the left join is not smaller than table a
(3) Writing II also requires that the intermediate results produced by the LEFT join are being null conditional filtering, while the writing one completes the filter at the same time as the two set join, which is an additional

These three points are combined to make a significant difference when dealing with massive amounts of data (mainly memory and CPU overhead). I suspect that when the landlord in the test buffer pool may have been saturated, so that the extra cost of writing two have to use virtual memory on the disk, when the SQL Server to do a page change, due to the slow I/O operation so that the gap is more obvious.

The log file is too large, this is also normal, because the deletion of a lot of records. You can consider setting the recovery model to simple based on the purpose of the database, or truncate the log off after the deletion and shrink the file.


Because there was a previous script to the library for unconditional deletion, which is to delete all the data in the table with a large amount of data, but because the customer request, can not use TRUNCATE TABLE, afraid of destroying the existing library structure. So can only delete delete, at that time also encountered a log file too big problem, at that time the method is to delete in batches, in SQL2K with SET ROWCOUNT @chunk, in the sql2k5 with the delete top @chunk. Such operations not only make the deletion time greatly reduced, but also let the log volume greatly reduced, only increased by 1G or so.
But the task of clearing the data this time is to add the condition that deletes a from a where .... There's a condition behind it. Again using the method of batch deletion, but has no effect.
I wonder if you know why.

MySQL not in and left join efficiency issues record

First, the function of this SQL is to query that collection A does not have data in collection B.
The wording of not in
Copy Code code as follows:

Select ADD_TB. Ruid
From (SELECT DISTINCT Ruid
From Usermsg
where Subjectid =12
and createtime> ' 2009-8-14 15:30:00 '
and createtime<= ' 2009-8-17 16:00:00 '
) ADD_TB
where ADD_TB. Ruid
Not in (SELECT DISTINCT Ruid
From Usermsg
where Subjectid =12
and createtime< ' 2009-8-14 15:30:00 '
)

Returns 444 Line records when 0.07SEC
Explain results
+----+--------------------+------------+----------------+---------------------------+------------+---------+--- ---+------+--

----------------------------+
| ID | Select_type | Table | Type | Possible_keys | Key | Key_len | Ref | Rows |

Extra |
+----+--------------------+------------+----------------+---------------------------+------------+---------+--- ---+------+--

----------------------------+
| 1 | PRIMARY | <derived2> | All | NULL | NULL | NULL | NULL | 452 |

Using where |
| 3 | DEPENDENT subquery | usermsg | Index_subquery | Ruid,subjectid,createtime | Ruid | 96 | Func | 2 |

Using index; Using where |
| 2 | DERIVED | usermsg | Range | Subjectid,createtime | Createtime | 9 | NULL | 1857 |

The Using where; Using Temporary |
+----+--------------------+------------+----------------+---------------------------+------------+---------+--- ---+------+--

----------------------------+
Analysis: This query fast reason for id=2 SQL query out of the results are relatively small, so id=1sql run faster, id=2 use of temporary table, do not know whether to use the index at this time?
One of the left joins
Copy Code code as follows:

Select A.ruid,b.ruid
From (SELECT DISTINCT Ruid
From Usermsg
where Subjectid =12
and Createtime >= ' 2009-8-14 15:30:00 '
and createtime<= ' 2009-8-17 16:00:00 '
) a LEFT join (
SELECT DISTINCT Ruid
From Usermsg
where Subjectid =12 and createtime< ' 2009-8-14 15:30:00 '
) b on a.ruid = B.ruid
where B.ruid is null

Returns 444 Line records when 0.39SEC
Explain results
+----+-------------+------------+-------+----------------------+------------+---------+------+------+---------- -------------

-------+
| ID | Select_type | Table | Type | Possible_keys | Key | Key_len | Ref | Rows | Extra

|
+----+-------------+------------+-------+----------------------+------------+---------+------+------+---------- -------------

-------+
| 1 | PRIMARY | <derived2> | All | NULL | NULL | NULL | NULL | 452 |

|
| 1 | PRIMARY | <derived3> | All | NULL | NULL | NULL | NULL | 1112 | The Using where; NOT EXISTS

|
| 3 | DERIVED | usermsg | Ref | Subjectid,createtime | Subjectid | 5 | | 6667 | The Using where; Using

Temporary |
| 2 | DERIVED | usermsg | Range | Subjectid,createtime | Createtime | 9 | NULL | 1838 | The Using where; Using

Temporary |
+----+-------------+------------+-------+----------------------+------------+---------+------+------+---------- -------------

-------+
Analysis: Two temporary tables were used, and two temporary tables made Cartesian product, resulting in the inability to use indexes and large amounts of data
Another kind of left join
Copy Code code as follows:

SELECT DISTINCT A.ruid
From Usermsg A
Left JOIN Usermsg b
On a.ruid = B.ruid
and B.subjectid =12 and B.createtime < ' 2009-8-14 15:30:00 '
where A.subjectid =12
and A.createtime >= ' 2009-8-14 15:30:00 '
and A.createtime <= ' 2009-8-17 16:00:00 '
and B.ruid is null;

Returns 444 Line records when 0.07SEC
Explain results
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+-- -------------

--------------------+
| ID | Select_type | Table | Type | Possible_keys | Key | Key_len | Ref | Rows | Extra

|
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+-- -------------

--------------------+
| 1 | Simple | A | Range | Subjectid,createtime | Createtime | 9 | NULL | 1839 | The Using where;

Using Temporary |
| 1 | Simple | B | Ref | Ruid,subjectid,createtime | Ruid | 96 | Dream.a.ruid | 2 | The Using where;

Not exists; Distinct |
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+-- -------------

--------------------+
Analysis: Two times the query is indexed, and the query at the same time, so the query efficiency should be very high
SQL with NOT EXISTS
Copy Code code as follows:

SELECT DISTINCT A.ruid
From Usermsg A
where A.subjectid =12
and A.createtime >= ' 2009-8-14 15:30:00 '
and A.createtime <= ' 2009-8-17 16:00:00 '
And NOT EXISTS (
SELECT DISTINCT Ruid
From Usermsg
where Subjectid =12 and Createtime < ' 2009-8-14 15:30:00 '
and Ruid=a.ruid
)

Returns 444 Line records when 0.08SEC
Explain results
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+ ------+------

------------------------+
| ID | Select_type | Table | Type | Possible_keys | Key | Key_len | Ref | Rows | Extra

|
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+ ------+------

------------------------+
| 1 | PRIMARY | A | Range | Subjectid,createtime | Createtime | 9 | NULL | 1839 | Using

where Using Temporary |
| 2 | DEPENDENT subquery | usermsg | Ref | Ruid,subjectid,createtime | Ruid | 96 | Dream.a.ruid | 2 | Using

where |
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+ ------+------

------------------------+
Analysis: Ibid is basically the same, just decomposed 2 query order execution, query efficiency less than the 3rd

In order to verify the efficiency of the data query, the restriction conditions of the Subjectid =12 in the above query are removed, and the results of the statistic query are as follows
0.20s
21.31s
0.25s
0.43s

Laserhe Help analyze the problem summary
Copy Code code as follows:

Select A.ruid,b.ruid
From (SELECT DISTINCT Ruid
From Usermsg
where Createtime >= ' 2009-8-14 15:30:00 '
and createtime<= ' 2009-8-17 16:00:00 '
A LEFT join Usermsg b
On a.ruid = B.ruid
and B.createtime < ' 2009-8-14 15:30:00 '
where b.ruid is null;

Execution Time 0.13s
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+------------- -------------

----+
| ID | Select_type | Table | Type | Possible_keys | Key | Key_len | Ref | Rows | Extra

|
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+------------- -------------

----+
| 1 | PRIMARY | <derived2> | All | NULL | NULL | NULL | NULL | 1248 |

|
| 1 | PRIMARY | B | Ref | Ruid,createtime | Ruid | 96 | A.ruid | 2 | The Using where; NOT EXISTS

|
| 2 | DERIVED | usermsg | Range | Createtime | Createtime | 9 | NULL | 3553 | The Using where; Using

Temporary |
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+------------- -------------

----+
Execution efficiency is similar to the efficiency of not in

The basic principle of database optimization: Let the Cartesian product occur between the smallest possible set, MySQL in the join can be scanned directly through the index, and embedded in the subquery, the query rules

The paddle device is not known with the appropriate index.
An SQL is optimized in a database: First, SQL is parsed into a bunch of parse trees, a tree-like data structure, and then in this data structure, the Query planner looks for any suitable

Index, and then make an arrangement based on the circumstances, and then compute the cost of each of the permutations (similar to the computer-readable version of the explain output) and compare

Minimum face cost, select and execute. So:
Explain select A.ruid,b.ruid from (select distinct ruid from usermsg where Createtime >= ' 2009-8-14 15:30:00 '

and createtime<= ' 2009-8-17 16:00:00 ') a LEFT join usermsg b on a.ruid = B.ruid and B.createtime < ' 2009-8-14 15:30: 00 '

where b.ruid is null;
And
Explain select ADD_TB. Ruid
-> from (SELECT distinct Ruid
-> from Usermsg
-> where createtime> ' 2009-8-14 15:30:00 '
-> and createtime<= ' 2009-8-17 16:00:00 '
->) ADD_TB
-> where ADD_TB. Ruid
-> not in (select DISTINCT Ruid
-> from Usermsg
-> where createtime< ' 2009-8-14 15:30:00 '
->);
Explain
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------ +------------

------------------+
| ID | Select_type | Table | Type | Possible_keys | Key | Key_len | Ref | Rows | Extra

|
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------ +------------

------------------+
| 1 | PRIMARY | <derived2> | All | NULL | NULL | NULL | NULL | 1248 | Using where

|
| 3 | DEPENDENT subquery | usermsg | Index_subquery | Ruid,createtime | Ruid | 96 | Func | 2 | Using index;

Using where |
| 2 | DERIVED | usermsg | Range | Createtime | Createtime | 9 | NULL | 3509 | The Using where;

Using Temporary |
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------ +------------

------------------+
The overhead is exactly the same, and the cost can be derived from the field of rows (basically the product of the values of the rows in the rows, which is the Cartesian product).
But: here's the following:
Explain select A.ruid,b.ruid from (select distinct ruid from usermsg where Createtime >= ' 2009-8-14 15:30:00 '

and createtime<= ' 2009-8-17 16:00:00 ') a LEFT join (select distinct ruid from usermsg where Createtime < ' 2009-8-14

15:30:00 ') b on a.ruid = B.RUID where b.ruid is null;
Execution Time 21.31s
+----+-------------+------------+-------+---------------+------------+---------+------+-------+---------------- -------------

-+
| ID | Select_type | Table | Type | Possible_keys | Key | Key_len | Ref | Rows | Extra

|
+----+-------------+------------+-------+---------------+------------+---------+------+-------+---------------- -------------

-+
| 1 | PRIMARY | <derived2> | All | NULL | NULL | NULL | NULL | 1248 |

|
| 1 | PRIMARY | <derived3> | All | NULL | NULL | NULL | NULL | 30308 | The Using where; NOT EXISTS

|
| 3 | DERIVED | usermsg | All | Createtime | NULL | NULL | NULL | 69366 | The Using where; Using Temporary

|
| 2 | DERIVED | usermsg | Range | Createtime | Createtime | 9 | NULL | 3510 | The Using where; Using Temporary

|
+----+-------------+------------+-------+---------------+------------+---------+------+-------+---------------- -------------

-+
I've got some vague white
Why is it four lines?
and two lines in the middle are huge.
Logically speaking
The query planner should be able to optimize the query to be the same as the previous two
(At least I have the same confidence in my familiar Pgsql database)
But it's not inside MySQL.
So I feel like the query planner might still be a little rough.
As I said earlier, the basic principle of optimization is to have Cartesian product occur between the smallest possible sets.
Well, the last one, at least, doesn't violate that principle.
Although table B is very much in accordance with the conditions, the index is basically not used
However, the query optimizer should not be prevented from seeing the join on condition outside, and, like the previous two SQL, select a primary key to join
But I said earlier, the role of the query planner
Theoretically speaking.
Iterate through all the possibilities, calculate the overhead
It's reasonable.
I feel like the last way to do this is not to traverse the whole thing.
Is it possible that the subquery is implemented fairly easily?
Subqueries are a real challenge to the database
Because it's basically a recursive thing.
So it's not surprising that there's something wrong with this part.
In fact, you think carefully, the last one is nothing more than a variant of our first formulation, the key in table B where the where conditions put
If you put it in there, you don't use the index to join.
Put it on the outside.
This is, in itself, a possibility of permutations and combinations.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.