SQL optimizes--in and exists efficiency

Last Update:2014-12-08 Source: Internet

Author: User

Tags one table

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The system requires SQL optimization to optimize for less efficient SQL, making it more efficient to run, which requires a partial in/not in SQL to be modified to Exists/not exists

Here's how to modify it:

In SQL statement

SELECT ID, category_id, htmlfile, title, CONVERT (varchar), begintime,112) as Pubtime
From Tab_oa_pub WHERE is_check=1 and
category_id in (select ID from tab_oa_pub_cate where no= ' 1 ')
ORDER BY begintime Desc

SQL statement modified to exists
SELECT ID, category_id, htmlfile, title, CONVERT (varchar), begintime,112) as Pubtime
From Tab_oa_pub WHERE is_check=1 and
exists (select ID from tab_oa_pub_cate where Tab_oa_pub.category_id=convert (int,no) and no= ' 1 ')
ORDER BY begintime Desc

Is it really more efficient to analyze exists than in?

Let's talk about in and exists first.
SELECT * from t1 where x in (select y from T2)
In fact, it can be understood as:
SELECT *
from T1, (select distinct y from T2) T2
where t1.x = T2.y;
--If you have a certain SQL optimization experience, from this it is natural to think that T2 must not be a big table, because the T2 need to do a full table of "unique sort", if T2 very large this sort of performance is intolerable. But T1 can be very big, why? The most popular understanding is that T1.X=T2.Y can walk the index. But that's not a good explanation. Imagine that if both t1.x and t2.y have indexes, we know that the index is an orderly structure, so the best solution between T1 and T2 is to take the merge join. In addition, if there is an index on the t2.y, the sorting performance of the T2 is also greatly improved.
SELECT * from t1 where exists (select null from t2 where y = x)
Can be understood as:
For x in (SELECT * from T1)
Loop
if (exists (select null from t2 where y = x.x)
Then
OUTPUT the record!
End If
End Loop
-This is easier to understand, T1 is always a watch scan! Therefore, T1 must not be a large table, and T2 can be very large, because y=x.x can walk T2.Y index.

based on the above discussion of in/exists, we can draw a general conclusion: in the case that the appearance is large and the inner table is small, the exists is suitable for small appearance and large inner table.

We should be based on the actual situation to do the corresponding optimization, can not absolutely say who efficiency is low, all things are relative.

difference between in and exists and analysis of SQL execution efficiency

In this paper, the differences between in and exists and the efficiency of SQL execution are comprehensively analyzed.

Many forums have recently started to discuss the difference between in and exists and the efficiency of SQL execution,
This article deals with somedifference between in and exists and analysis of SQL execution efficiency

In SQL, in can be divided into three categories:

1, Shape like select * from T1 where F1 in (' A ', ' B '), should be compared with the following two kinds of efficiency

SELECT * from t1 where f1= ' a ' or f1= ' B '

or select * from t1 where f1 = ' A ' union ALL SELECT * from T1 f1= ' B '

You may not be referring to this category, there is no discussion here.

2. Shape like select * from T1 where F1 in (select F1 from T2 where t2.fx= ' x '),

Where the subquery in the condition is not affected by the outer query, such queries in general, the automatic optimization will be turned into exist statement, that is, efficiency and exist.

3. Shape like select * from T1 where F1 in (select F1 from T2 where t2.fx=t1.fx),

Where the sub-query conditions are affected by the outer query, the efficiency of such queries depends on the relevant conditions related to the index of the field and the amount of data, it is generally considered less efficient than exists.

Except that the first type in statements are SQL that can be converted into exists statements, the general programming habit should be to use exists instead of in, and seldom consider the execution efficiency of in and exists.

SQL execution Efficiency analysis in and exists

A, b two tables,

(1) When displaying only one table of data, such as a, when the relationship condition is only one such as ID, use in faster:

SELECT * from A where ID in (select ID from B)

(2) When displaying only one table of data, such as a, when the relationship condition is more than one such as Id,col1, using in is inconvenient, you can use exists:

SELECT * FROM A

where exists (select 1 from B where id = a.id and col1 = a.col1)

(3) When only two tables of data are displayed, using in,exists is not appropriate to use a connection:

SELECT * from A LEFT join B on id = a.id

So the way you use it depends on your requirements.

This is usually done in the test:

This is the result of my test:

SET STATISTICS IO on
SELECT * from sysobjects where exists (select 1 from syscolumns where id=syscolumns.id)
SELECT * from sysobjects where ID in (select ID from syscolumns)
SET STATISTICS IO off

(47 rows affected)

Table ' Syscolpars '. Scan count 1, logical read 3 times, physical read 0 times, read 2 times, LOB logic read 0 times, lob physical read 0 times, lob read 0 times.

Table ' Sysschobjs '. Scan count 1, logical read 3 times, physical read 0 times, read 0 times, LOB logic read 0 times, lob physical read 0 times, lob read 0 times.

(1 rows affected)

(44 rows affected)

Table ' Syscolpars '. Scan Count 47, logical read 97 times, physical read 0 times, read 0 times, LOB logic read 0 times, lob physical read 0 times, lob read 0 times.

Table ' Sysschobjs '. Scan count 1, logical read 3 times, physical read 0 times, read 0 times, LOB logic read 0 times, lob physical read 0 times, lob read 0 times.

(1 rows affected)

SET STATISTICS IO on
SELECT * from syscolumns where exists (select 1 from sysobjects where id=syscolumns.id)
SELECT * from syscolumns where ID in (select ID from sysobjects)
SET STATISTICS IO off

(419 rows affected)

Table ' Syscolpars '. Scan count 1, logical read 10 times, physical read 0 times, read 15 times, LOB logic read 0 times, lob physical read 0 times, lob read 0 times.

Table ' Sysschobjs '. Scan count 1, logical read 3 times, physical read 0 times, read 0 times, LOB logic read 0 times, lob physical read 0 times, lob read 0 times.

(1 rows affected)

(419 rows affected)

Table ' Syscolpars '. Scan count 1, logical read 10 times, physical read 0 times, read 0 times, LOB logic read 0 times, lob physical read 0 times, lob read 0 times.

Table ' Sysschobjs '. Scan count 1, logical read 3 times, physical read 0 times, read 0 times, LOB logic read 0 times, lob physical read 0 times, lob read 0 times.

(1 rows affected)

Test results (overall exists efficiency is higher than in):

Efficiency: The index of conditional factors is very critical

Take syscolumns as a condition: syscolumns data is greater than sysobjects

In

Scan Count 47, logic read 97 times,

With exists

Scan count 1, logic read 3 times

sysobjects as a condition: sysobjects data is less than syscolumns

exists more than 15 reads

I remember doing the following tests:

Table

Test

Structure

ID int identity (--ID), primary key \ self-increment

sort int,--category, every 1000 data is a category

Sid INT--Category ID

Inserting 600w Data

If you want to query the maximum SID for each category,

SELECT * FROM Test a
Where NOT EXISTS (select 1 from test where sort = a.sort and Sid > A.sid)

Than

SELECT * FROM Test a
Where SID in (the Select Max (SID) from test where sort = a.sort)

is more than three times times more efficient to execute. The specific execution time has been forgotten. But the result I remember very clearly. Before that I had been advocating the second way of writing, and then changed the first.

In and exists's SQL execution efficiency analysis, and then simply give an example:

Declare @t table (ID int identity (), v varchar (10))
Insert @t Select ' A '
UNION ALL select ' B '
UNION ALL SELECT ' C '
UNION ALL select ' d '
UNION ALL SELECT ' E '
UNION ALL select ' B '
UNION ALL SELECT ' C '
SQL notation for--A statement in
SELECT * from @t where V on (select V from @t Group by V has Count (*) >1)
SQL notation for--b statement exists
SELECT * from @t a where exists (select 1 from @t where Id!=a.id and V=A.V)

The two statement function is to find the table variable @t, V contains a record of duplicate values.

The first SQL statement uses in, but the subquery has no external connection.

The second SQL statement uses exists, but the subquery has a connection to the outside.

Everyone read the SQL query plan, very clear.

Selec v from @t Group by V have Count (*) > 1

This SQL statement, its execution does not depend on the main query main sentence (I do not know how to describe in the outside and inside, so call it, people understand it)

Then SQL is optimized at query time, and its result set is cached.

That's the cache.

V

---

B

C

Subsequent operations, the main query at each step, the equivalent in the processing where V in (' B ', ' C ') of course, the statement will not be so converted, just to illustrate the meaning, that is, the main query each processing a row (when recorded as CurrentRow, the subquery will no longer scan the table, only matches the cached results)

and

Select 1 from @t where Id!=a.id and V=A.V

This sentence, its execution results depend on each row in the main query.

When the first row of the main query is CurrentRow (id=1), the subquery is executed again in select 1 from @t where id!=1 and v= ' A ' scan the entire table, starting with the first row of Currentsubrow (id=1) to scan, with the same ID, over Filter, subquery row down, Currentsubrow (id=2) continue, ID is different, but V value does not match, subquery Row continues to move down ... Until Currentsubrow (id=7) does not find a match, the subquery processing ends, the first line CurrentRow (id=1) is filtered, and the main query record line moves down

When processing the second row, CurrentRow (id=2), subquery Select 1 from @t where id!=2 and v= ' B ', first row currentsubrow (id=1) v value mismatch, sub-query down, second row, ID same filter, third row, ... To line sixth, the IDs are different, the V Values match, and the matching results are found, that is, return, no longer processing the record down. The main query moves down.

Handle third row, etc. ...

SQL optimization, using in and exist? Mainly depends on whether your filter is on the main query or on a subquery.

SQL optimizes--in and exists efficiency

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More