MySQL not in, left join, is NULL, not EXISTS efficiency problem record

Source: Internet
Author: User
Tags joins time 0

Original: MySQL not in, left join, is NULL, not EXISTS efficiency problem record

MySQL not in, left joins, is NULL, not EXISTS efficiency problem records, the need for friends can refer to.

No in , JOIN, is NULL, not exists efficiency comparison

Statement one: SELECT COUNT (*) from A where a.a not in (select A from B)

Statement two: SELECT COUNT (*) from A left join B in A.A = B.A where B.A is null

Statement three: SELECT COUNT (*) from A where NOT exists (select A from B where a.a = B.A)

It has been a long time since the actual effect of the above three statements has been known, but there has been no comparison of the efficiency between the two. Always feel that the sentence two is the fastest.
At work today, you need to delete more than 20 million rows of data in order to clean up a library of tens of millions of rows of data. A large number of uses of the above three statements to achieve the function. Originally used is statement one, but the result is the execution speed 1个小时32分, log file occupies 21GB. Although the time is acceptable, but the use of hard disk space is a problem. So replace all statements with statement two. Thought it would be quicker. Unexpectedly, after more than 40 minutes of execution, the first batch of 50000 lines did not delete, but let SQL Server crashed, the results are surprising. Try to execute this statement alone, query nearly 10 million rows of the table, the statement used for 4 seconds, statement two but used 18 seconds, the gap is very large. The efficiency of statement three is close to the statement.


The second way of writing is taboo and should be avoided as much as possible. The first and third formulations are almost identical in nature.

Assuming the buffer pool is large enough, there are a few disadvantages to writing two relative to the wording:
(1) The left join itself consumes more resources (more resources are needed to process the resulting intermediate result set)
(2) The size of the intermediate result set of the left join is not smaller than table a
(3) Writing two also need to the left join produced by the intermediate result is a null condition filter, and the wording of a two set of joins at the same time to complete the filter, this part of the overhead is additional

These three points combine to produce significant differences in processing large amounts of data (primarily memory and CPU overhead). I suspect that the landlord in the test when the buffer pool may have been saturated, so that the additional cost of writing two will have to rely on the virtual memory on disk, when the SQL Server to make a page change, because of the slow I/O operations involved so this gap is more obvious.

About the log file is too large, this is normal, because the deletion of more records. The recovery model can be set to simple based on the purpose of the database, or the log will be truncate and the file shrink down after the delete is finished.


Because there has been a script for the unconditional deletion of this library, is to delete all the data in a large table, but because of customer requirements, can not use TRUNCATE TABLE, afraid of destroying the existing library structure. So can only use delete delete, at that time also encountered a log file too large problem, then adopted the method is batch delete, in SQL2K with SET rowcount @chunk, in sql2k5 with delete top @chunk. Such operations not only greatly reduce the deletion time, but also greatly reduce the log volume, only about 1G growth.
But the job of erasing the data needs to be conditional, that is, delete a from a where .... There's a condition behind it. The method of batch deletion is used again, but it has no effect.
I wonder if you know why.

MySQL not in and left join efficiency issue record

First, the function of this SQL is to query the data of collection A that is not in collection B.
The notation of not in

Copy CodeThe code is as follows:
Select ADD_TB. Ruid
From (SELECT DISTINCT Ruid
From Usermsg
where Subjectid =12
and createtime> ' 2009-8-14 15:30:00 '
and createtime<= ' 2009-8-17 16:00:00 '
) ADD_TB
where ADD_TB. Ruid
Not in (SELECT DISTINCT Ruid
From Usermsg
where Subjectid =12
and createtime< ' 2009-8-14 15:30:00 '
)


Returns 444 rows elapsed Time 0.07sec
Explain results
+----+--------------------+------------+----------------+---------------------------+------------+---------+--- ---+------+--

----------------------------+
| ID | Select_type | Table | Type | Possible_keys | Key | Key_len | Ref | Rows |

Extra |
+----+--------------------+------------+----------------+---------------------------+------------+---------+--- ---+------+--

----------------------------+
| 1 | PRIMARY | <derived2> | All | NULL | NULL | NULL | NULL | 452 |

Using where |
| 3 | DEPENDENT subquery | usermsg | Index_subquery | Ruid,subjectid,createtime | Ruid | 96 | Func | 2 |

Using index; Using where |
| 2 | DERIVED | usermsg | Range | Subjectid,createtime | Createtime | 9 | NULL | 1857 |

Using where; Using Temporary |
+----+--------------------+------------+----------------+---------------------------+------------+---------+--- ---+------+--

----------------------------+
Analysis: The query speed fast reason for id=2 SQL query out less results, so id=1sql run faster, id=2 use a temporary table, do not know whether to use the index?
One of the left joins

Copy CodeThe code is as follows:
Select A.ruid,b.ruid
From (SELECT DISTINCT Ruid
From Usermsg
where Subjectid =12
and Createtime >= ' 2009-8-14 15:30:00 '
and createtime<= ' 2009-8-17 16:00:00 '
) a LEFT join (
SELECT DISTINCT Ruid
From Usermsg
where Subjectid =12 and createtime< ' 2009-8-14 15:30:00 '
) b on a.ruid = B.ruid
where B.ruid is null


Returns 444 rows elapsed Time 0.39sec
Explain results
+----+-------------+------------+-------+----------------------+------------+---------+------+------+---------- -------------

-------+
| ID | Select_type | Table | Type | Possible_keys | Key | Key_len | Ref | Rows | Extra

|
+----+-------------+------------+-------+----------------------+------------+---------+------+------+---------- -------------

-------+
| 1 | PRIMARY | <derived2> | All | NULL | NULL | NULL | NULL | 452 |

|
| 1 | PRIMARY | <derived3> | All | NULL | NULL | NULL | NULL | 1112 | Using where; NOT EXISTS

|
| 3 | DERIVED | usermsg | Ref | Subjectid,createtime | Subjectid | 5 | | 6667 | Using where; Using

Temporary |
| 2 | DERIVED | usermsg | Range | Subjectid,createtime | Createtime | 9 | NULL | 1838 | Using where; Using

Temporary |
+----+-------------+------------+-------+----------------------+------------+---------+------+------+---------- -------------

-------+
Analysis: Two temporary tables were used, and two temporal tables made Cartesian product, resulting in the inability to use indexes and large data volumes
Another kind of left join

Copy CodeThe code is as follows:
SELECT DISTINCT A.ruid
From Usermsg A
Left JOIN Usermsg b
On a.ruid = B.ruid
and B.subjectid =12 and B.createtime < ' 2009-8-14 15:30:00 '
where A.subjectid =12
and A.createtime >= ' 2009-8-14 15:30:00 '
and A.createtime <= ' 2009-8-17 16:00:00 '
and B.ruid is null;


Returns 444 rows of records when 0.07SEC
Explain results
+----+-------------+-------+-------+---------------------------+----- -------+---------+--------------+------+---------------

--------------------+
| id | select_type | table | Type | Possible_keys | Key | Key_len | Ref | Rows | Extra

|
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+----- -+---------------

--------------------+
| 1 | Simple | A | Range | Subjectid,createtime | Createtime | 9 | NULL | 1839 | Using where;

Using Temporary |
| 1 | Simple | B | Ref | Ruid,subjectid,createtime | Ruid | 96 | Dream.a.ruid | 2 | Using where;

NOT exists; Distinct |
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+----- -+---------------

--------------------+
Analysis: Two queries are indexed and queried at the same time, so query efficiency should be high
Use SQL with NOT EXISTS

Copy CodeThe code is as follows:
SELECT DISTINCT A.ruid
From Usermsg A
where A.subjectid =12
and A.createtime >= ' 2009-8-14 15:30:00 '
and A.createtime <= ' 2009-8-17 16:00:00 '
And NOT EXISTS (
SELECT DISTINCT Ruid
From Usermsg
where Subjectid =12 and Createtime < ' 2009-8-14 15:30:00 '
and Ruid=a.ruid
)


Returns 444 rows elapsed Time 0.08sec
Explain results
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+ ------+------

------------------------+
| ID | Select_type | Table | Type | Possible_keys | Key | Key_len | Ref | Rows | Extra

|
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+ ------+------

------------------------+
| 1 | PRIMARY | A | Range | Subjectid,createtime | Createtime | 9 | NULL | 1839 | Using

where Using Temporary |
| 2 | DEPENDENT subquery | usermsg | Ref | Ruid,subjectid,createtime | Ruid | 96 | Dream.a.ruid | 2 | Using

where |
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+ ------+------

------------------------+
Analysis: Ibid basically is the same, just decomposed 2 query order execution, query efficiency is less than the 3rd

In order to verify the efficiency of data query, the Subjectid =12 in the above query is removed, and the result is as follows
0.001
21.31s
0.25s
0.43s

Laserhe help to analyze the problem summary

Copy CodeThe code is as follows:
Select A.ruid,b.ruid
From (SELECT DISTINCT Ruid
From Usermsg
where Createtime >= ' 2009-8-14 15:30:00 '
and createtime<= ' 2009-8-17 16:00:00 '
) A LEFT join Usermsg b
On a.ruid = B.ruid
and B.createtime < ' 2009-8-14 15:30:00 '
where b.ruid is null;


Execution Time 0.13s
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+------------- -------------

----+
| ID | Select_type | Table | Type | Possible_keys | Key | Key_len | Ref | Rows | Extra

|
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+------------- -------------

----+
| 1 | PRIMARY | <derived2> | All | NULL | NULL | NULL | NULL | 1248 |

|
| 1 | PRIMARY | B | Ref | Ruid,createtime | Ruid | 96 | A.ruid | 2 | Using where; NOT EXISTS

|
| 2 | DERIVED | usermsg | Range | Createtime | Createtime | 9 | NULL | 3553 | Using where; Using

Temporary |
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+------------- -------------

----+
Efficiency of execution similar to not in

The basic principle of database optimization: To make the Cartesian product occur between the smallest possible set, MySQL can be scanned directly through the index when the join, and embedded in the subquery, query rules

The scribe does not know to use the appropriate index.
A SQL is so optimized in the database: First SQL is parsed into a bunch of analysis trees, a tree-like data structure, and in this data structure, the query Planner will look for a suitable

Index, then make a permutation based on the situation, then calculate the cost of each of the permutations (similar to the computer-readable version of the explain output), and then compare the

The minimum cost of the polygon is selected and executed. So:
Explain select A.ruid,b.ruid from (select distinct ruid from usermsg where Createtime >= ' 2009-8-14 15:30:00 '

and createtime<= ' 2009-8-17 16:00:00 ') a LEFT join usermsg b on a.ruid = B.ruid and B.createtime < ' 2009-8-14 15:30: 00 '

where b.ruid is null;
And
Explain select ADD_TB. Ruid
-From (SELECT distinct Ruid
From Usermsg
, where createtime> ' 2009-8-14 15:30:00 '
createtime<= ' 2009-8-17 16:00:00 '
) ADD_TB
, where ADD_TB. Ruid
-Not IN (select DISTINCT Ruid
From Usermsg
, where createtime< ' 2009-8-14 15:30:00 '
);
Explain
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------ +------------

------------------+
| ID | Select_type | Table | Type | Possible_keys | Key | Key_len | Ref | Rows | Extra

|
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------ +------------

------------------+
| 1 | PRIMARY | <derived2> | All | NULL | NULL | NULL | NULL | 1248 | Using where

|
| 3 | DEPENDENT subquery | usermsg | Index_subquery | Ruid,createtime | Ruid | 96 | Func | 2 | Using index;

Using where |
| 2 | DERIVED | usermsg | Range | Createtime | Createtime | 9 | NULL | 3509 | Using where;

Using Temporary |
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------ +------------

------------------+
The overhead is exactly the same, and the overhead can be derived from the Rows field (which is basically the product of the values of rows in that field, that is, the Cartesian product)
But: here's the following:
Explain select A.ruid,b.ruid from (select distinct ruid from usermsg where Createtime >= ' 2009-8-14 15:30:00 '

and createtime<= ' 2009-8-17 16:00:00 ') a LEFT join (select distinct ruid from usermsg where Createtime < ' 2009-8-14

15:30:00 ') b on a.ruid = B.RUID where b.ruid is null;
Execution Time 21.31s
+----+-------------+------------+-------+---------------+------------+---------+------+-------+---------------- -------------

-+
| ID | Select_type | Table | Type | Possible_keys | Key | Key_len | Ref | Rows | Extra

|
+----+-------------+------------+-------+---------------+------------+---------+------+-------+---------------- -------------

-+
| 1 | PRIMARY | <derived2> | All | NULL | NULL | NULL | NULL | 1248 |

|
| 1 | PRIMARY | <derived3> | All | NULL | NULL | NULL | NULL | 30308 | Using where; NOT EXISTS

|
| 3 | DERIVED | usermsg | All | Createtime | NULL | NULL | NULL | 69366 | Using where; Using Temporary

|
| 2 | DERIVED | usermsg | Range | Createtime | Createtime | 9 | NULL | 3510 | Using where; Using Temporary

|
+----+-------------+------------+-------+---------------+------------+---------+------+-------+---------------- -------------

-+
I'm just a little vague.
Why four lines?
And the middle two lines are huge.
Supposedly,
The query planner should be able to optimize the query to be the same as the previous two
(At least I have the same confidence in my familiar Pgsql database)
But not in MySQL.
So I think the query planner might be a little rough.
As I said before, the basic principle of optimization is to have the Cartesian product occur between the smallest possible set
So the last of the above is not a violation of this principle at least.
Although the B-table is very well-qualified, it is basically not indexed
However, it should not prevent the query optimizer from seeing the join on condition outside, which, like the previous two SQL, chooses a primary key for join
But as I said before, the role of the query planner
Theoretically speaking,
Iterate over all possible, calculate the overhead
It's reasonable.
I feel like the last way to do this is not to traverse all possible
Perhaps the reason is that the implementation of subqueries is still relatively simple?
Subqueries are a real challenge to the database.
Because it's basically a recursive thing.
So there's something wrong with this link, not surprising.
In fact, you think about it, the last one is nothing more than a variant of our first writing, the key in the Where condition of table B where
If you put it inside, you don't use the index to join.
Put it outside and you'll
This is, in itself, a possibility of permutations and combinations.

MySQL not in, left join, is NULL, not EXISTS efficiency problem record

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.