Comparison of not in, JOIN, is null, not exists Efficiency
Statement 1: select count (*) from A where A. a not in (select a from B)
Statement 2: select count (*) from A left join B on A. a = B. a where B. a is null
Statement 3: select count (*) from A where not exists (select a from B where A. a = B.)
It has been a long time to know that the actual effects of the above three statements are the same, but the efficiency comparison has not been further explored. I always feel that statement 2 is the fastest.
At work today, more than 20 million rows of data need to be deleted because a database with tens of millions of rows of data needs to be cleared. A large number of functions required by the preceding three statements are used. Statement 1 is used, but the result is that the execution speed is 1 hour and 32 minutes, and the log file occupies 21 GB. Although time is acceptable, it is a problem to occupy the hard disk space. Therefore, replace all statements with Statement 2. I thought it would be faster. Unexpectedly, after more than 40 minutes of execution, the first batch of 50000 rows were not deleted. Instead, the SQL SERVER crashed and the results were surprising. I tried to execute this statement separately to query tables with nearly 10 million rows. It took 4 seconds for a statement and 18 seconds for Statement 2, which is quite different. Statement 3 is more efficient than Statement 3.
The second method is taboo, which should be avoided as much as possible. The first and third methods are essentially the same.
Assuming that the buffer pool is large enough, there are several shortcomings in writing method 2 compared with Writing Method 1:
(1) left join itself consumes more resources (more resources are needed to process the intermediate result set)
(2) The size of the intermediate result set of left join is not smaller than that of Table.
(3) Statement 2 also requires conditional filtering of is null for the intermediate results produced by left join, while Statement 1 completes filtering at the same time of join of two sets, this part of overhead is additional
These three points combine to produce obvious differences (mainly memory and CPU overhead) when processing massive data ). I suspect that the buffer pool may be saturated during the test. In this case, the additional overhead of Statement 2 has to rely on the virtual memory on the disk. When SQL Server performs page feed, this gap is more obvious because of the slow I/O operations.
It is normal that the log file is too large because many records are deleted. You can set the recovery model to simple based on the purpose of the database, or truncate the log after deletion and delete the file shrink.
In the past, a script was used to unconditionally Delete the database, that is, to delete all the data in the table with a large amount of data. However, due to customer requirements, the truncate table cannot be used, the existing database structure may be damaged. Therefore, you can only use delete to delete logs. At that time, the log file was too large. At that time, you used batch deletion, set rowcount @ chunk in SQL2K, and delete top @ chunk in SQL2K5. This operation not only greatly reduces the deletion time, but also reduces the log volume by about 1 GB.
However, A condition must be added for this data clearing operation, that is, delete A from A where... is followed by A condition. The batch deletion method is no longer effective.
I wonder why.
Mysql not in and left join efficiency issue records
The function of this SQL statement is to query data in which set a is not in Set B.
Not in statement
Copy codeThe Code is as follows:
Select add_tb.RUID
From (select distinct RUID
From UserMsg
Where SubjectID = 12
And CreateTime> '2017-8-14 15:30:00'
And CreateTime <= '2017-8-17 16:00:00'
) Add_tb
Where add_tb.RUID
Not in (select distinct RUID
From UserMsg
Where SubjectID = 12
And CreateTime <'2017-8-14 15:30:00'
)
0.07sec when 444 rows of records are returned
Explain result
+ ---- + -------------------- + ------------ + ------------------ + ------------------------- + ------------ + --------- + ------ + --
---------------------------- +
| Id | select_type | table | type | possible_keys | key | key_len | ref | rows |
Extra |
+ ---- + -------------------- + ------------ + ------------------ + ------------------------- + ------------ + --------- + ------ + --
---------------------------- +
| 1 | PRIMARY | <derived2> | ALL | NULL | 452 |
Using where |
| 3 | dependent subquery | UserMsg | index_subquery | RUID, SubjectID, CreateTime | RUID | 96 | func | 2 |
Using index; Using where |
| 2 | DERIVED | UserMsg | range | SubjectID, CreateTime | 9 | NULL | 1857 |
Using where; Using temporary |
+ ---- + -------------------- + ------------ + ------------------ + ------------------------- + ------------ + --------- + ------ + --
---------------------------- +
Analysis: the reason for the fast query speed is that the SQL query result of id = 2 is relatively small, so id = 1 SQL runs fast, and id = 2 uses a temporary table, I wonder if the index is used at this time?
One of the left join
Copy codeThe Code is as follows:
Select a. ruid, B. ruid
From (select distinct RUID
From UserMsg
Where SubjectID = 12
And CreateTime> = '2017-8-14 15:30:00'
And CreateTime <= '2017-8-17 16:00:00'
) A left join (
Select distinct RUID
From UserMsg
Where SubjectID = 12 and CreateTime <'2017-8-14 15:30:00'
) B on a. ruid = B. ruid
Where B. ruid is null
0.39sec when 444 rows of records are returned
Explain result
+ ---- + ------------- + ------------ + ------- + ---------------------- + ------------ + --------- + ------ + -----------------------
------- +
| Id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
|
+ ---- + ------------- + ------------ + ------- + ---------------------- + ------------ + --------- + ------ + -----------------------
------- +
| 1 | PRIMARY | <derived2> | ALL | NULL | 452 |
|
| 1 | PRIMARY | <derived3> | ALL | NULL | 1112 | Using where; Not exists
|
| 3 | DERIVED | UserMsg | ref | SubjectID, CreateTime | SubjectID | 5 | 6667 | Using where; Using
Temporary |
| 2 | DERIVED | UserMsg | range | SubjectID, CreateTime | 9 | NULL | 1838 | Using where; Using
Temporary |
+ ---- + ------------- + ------------ + ------- + ---------------------- + ------------ + --------- + ------ + -----------------------
------- +
Analysis: two temporary tables are used, and the two temporary tables are Cartesian. As a result, indexes cannot be used and the data volume is large.
Another left join
Copy codeThe Code is as follows:
Select distinct a. RUID
From UserMsg
Left join UserMsg B
On a. ruid = B. ruid
And B. subjectID = 12 and B. createTime <'2017-8-14 15:30:00'
Where a. subjectID = 12
And a. createTime> = '2017-8-14 15:30:00'
And a. createtime <= '2017-8-17 16:00:00'
And B. ruid is null;
0.07sec when 444 rows of records are returned
Explain result
+ ---- + ------------- + ------- + --------------------------- + ------------ + --------- + ---------------- + ------ + ---------------
-------------------- +
| Id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
|
+ ---- + ------------- + ------- + --------------------------- + ------------ + --------- + ---------------- + ------ + ---------------
-------------------- +
| 1 | SIMPLE | a | range | SubjectID, CreateTime | 9 | NULL | 1839 | Using where;
Using temporary |
| 1 | SIMPLE | B | ref | RUID, SubjectID, CreateTime | RUID | 96 | dream. a. RUID | 2 | Using where;
Not exists; Distinct |
+ ---- + ------------- + ------- + --------------------------- + ------------ + --------- + ---------------- + ------ + ---------------
-------------------- +
Analysis: Both queries use indexes and are executed simultaneously. Therefore, the query efficiency should be high.
Use not exists SQL
Copy codeThe Code is as follows:
Select distinct a. ruid
From UserMsg
Where a. subjectID = 12
And a. createTime> = '2017-8-14 15:30:00'
And a. createTime <= '2017-8-17 16:00:00'
And not exists (
Select distinct RUID
From UserMsg
Where subjectID = 12 and createTime <'2017-8-14 15:30:00'
And ruid = a. ruid
)
0.08sec when 444 rows of records are returned
Explain result
+ ---- + -------------------- + --------- + ------- + --------------------------- + ------------ + --------- + -------------- + ------
------------------------ +
| Id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
|
+ ---- + -------------------- + --------- + ------- + --------------------------- + ------------ + --------- + -------------- + ------
------------------------ +
| 1 | PRIMARY | a | range | SubjectID, CreateTime | 9 | NULL | 1839 | Using
Where; Using temporary |
| 2 | dependent subquery | UserMsg | ref | RUID, SubjectID, CreateTime | RUID | 96 | dream. a. RUID | 2 | Using
Where |
+ ---- + -------------------- + --------- + ------- + --------------------------- + ------------ + --------- + -------------- + ------
------------------------ +
Analysis: basically the same as above, only two query sequences are executed, and the query efficiency is lower than 3rd.
To verify the data query efficiency, remove the subjectID = 12 constraint in the preceding query. The result is as follows:
0.20 s
21.31 s
0.25 s
0.43 s
Laserhe helps to analyze the problem summary
Copy codeThe Code is as follows:
Select a. ruid, B. ruid
From (select distinct RUID
From UserMsg
Where CreateTime> = '2017-8-14 15:30:00'
And CreateTime <= '2017-8-17 16:00:00'
) A left join UserMsg B
On a. ruid = B. ruid
And B. createTime <'2017-8-14 15:30:00'
Where B. ruid is null;
Execution time: 0.13 s
+ ---- + ------------- + ------------ + ------- + ----------------- + ------------ + --------- + -------- + --------------------------
---- +
| Id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
|
+ ---- + ------------- + ------------ + ------- + ----------------- + ------------ + --------- + -------- + --------------------------
---- +
| 1 | PRIMARY | <derived2> | ALL | NULL | 1248 |
|
| 1 | PRIMARY | B | ref | RUID, CreateTime | RUID | 96 | a. RUID | 2 | Using where; Not exists
|
| 2 | DERIVED | UserMsg | range | CreateTime | 9 | NULL | 3553 | Using where; Using
Temporary |
+ ---- + ------------- + ------------ + ------- + ----------------- + ------------ + --------- + -------- + --------------------------
---- +
Execution efficiency is similar to not in efficiency.
The basic principle of database optimization: Let Descartes accumulate between as few sets as possible. mysql can directly scan through indexes during join, and embed them into subqueries. query rules
The splitter does not know how to use an appropriate index.
An SQL statement is optimized in the database as follows: first, the SQL statement is analyzed into a heap of analysis trees and a tree-like data structure. Then, in this data structure, the query planner will find out if it is appropriate.
Then, make an arrangement and combination according to the actual situation, calculate the overhead of each sort in the arrangement and combination (similar to the computer readable version output by explain), and then compare
Select and execute the minimum overhead. So:
Explain select a. ruid, B. ruid from (select distinct RUID from UserMsg where CreateTime> = '2017-8-14 15:30:00'
And CreateTime <= '2017-8-17 16:00:00 ') a left join UserMsg B on a. ruid = B. ruid and B. createTime <'2017-8-14 15:30:00'
Where B. ruid is null;
And
Explain select add_tb.RUID
-> From (select distinct RUID
-> From UserMsg
-> Where CreateTime> '2017-8-14 15:30:00'
-> And CreateTime <= '2017-8-17 16:00:00'
->) Add_tb
-> Where add_tb.RUID
-> Not in (select distinct RUID
-> From UserMsg
-> Where CreateTime <'2017-8-14 15:30:00'
-> );
Explain
+ ---- + -------------------- + ------------ + ------------------ + ----------------- + ------------ + --------- + ------ + ------------
------------------ +
| Id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
|
+ ---- + -------------------- + ------------ + ------------------ + ----------------- + ------------ + --------- + ------ + ------------
------------------ +
| 1 | PRIMARY | <derived2> | ALL | NULL | 1248 | Using where
|
| 3 | dependent subquery | UserMsg | index_subquery | RUID, CreateTime | RUID | 96 | func | 2 | Using index;
Using where |
| 2 | DERIVED | UserMsg | range | CreateTime | 9 | NULL | 3509 | Using where;
Using temporary |
+ ---- + -------------------- + ------------ + ------------------ + ----------------- + ------------ + --------- + ------ + ------------
------------------ +
The overhead is exactly the same. The overhead can be obtained from the rows field (basically the product of the values of each row of the rows field, that is, Cartesian Product)
But below:
Explain select a. ruid, B. ruid from (select distinct RUID from UserMsg where CreateTime> = '2017-8-14 15:30:00'
And CreateTime <= '2014-8-17 16:00:00 ') a left join (select distinct RUID from UserMsg where createTime <'2014-8-14
15:30:00 ') B on a. ruid = B. ruid where B. ruid is null;
Execution time: 21.31 s
+ ---- + ------------- + ------------ + ------- + --------------- + ------------ + --------- + ------ + ------- + -----------------------------
-+
| Id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
|
+ ---- + ------------- + ------------ + ------- + --------------- + ------------ + --------- + ------ + ------- + -----------------------------
-+
| 1 | PRIMARY | <derived2> | ALL | NULL | 1248 |
|
| 1 | PRIMARY | <derived3> | ALL | NULL | 30308 | Using where; Not exists
|
| 3 | DERIVED | UserMsg | ALL | CreateTime | NULL | 69366 | Using where; Using temporary
|
| 2 | DERIVED | UserMsg | range | CreateTime | 9 | NULL | 3510 | Using where; Using temporary
|
+ ---- + ------------- + ------------ + ------- + --------------- + ------------ + --------- + ------ + ------- + -----------------------------
-+
I don't understand.
Why four rows?
And the two lines in the middle are huge.
Rationale
The query planner should be able to optimize the query like the two above.
(At least I have the same confidence in my familiar PostgreSQL database)
But not in mysql
So I think the query planner may still be rough.
As I mentioned earlier, the basic principle of optimization is to let Descartes accumulate between as small sets as possible.
The last statement above does not violate this principle at least.
Table B basically does not use an index because it meets many criteria.
However, it should not prevent the query optimizer from seeing the outer join on condition, so as to select the primary key for join like the previous two SQL statements.
However, I have mentioned the functions of the query planner.
Theoretically speaking
Traverse all possibilities and calculate the overhead
Is reasonable
I feel that the last writing method here does not traverse all possibilities.
The possible cause is the implementation of subqueries or is it relatively simple?
Subqueries are indeed a challenge to the database.
Because it is basically recursive.
So it's not surprising that there is something wrong with this link.
In fact, if you think about it, the last writing method is nothing more than a variant of our first writing method. The key is where the where condition of Table B is located.
Put it inside, so we won't use the index to join
Put it outside.
This is a possibility of arrangement and combination.