The difference between in and not in,exists vs. exists in SQL

Source: Internet
Author: User
Tags joins first row

1, in and exists

In is the appearance and the inner table as a hash connection, and exists is the external loop loop, each loop loop and then query the internal table, has always been considered exists than in the high efficiency of the argument is inaccurate. If the size of the two tables of the query is equal, then the difference between the in and exists is not significant; if the two table is a smaller one, the subquery table is large with exists,

Sub-query table small with in;

Example: Table A (small table), table B (large table)

Select *  from where inch  (Select from B) --  Select*fromwhereexists(select  fromwhere cc=a.cc) -> High efficiency, using the index of the CC column on the B table. 

On the contrary:

 1  select  *  from  B where  cc in  ( select  cc from  A) -- > is efficient and uses the index of CC columns on table b  2  select  *  from  B where  exists  (select  cc from  A where  cc=  b.cc) -- > is inefficient and uses the index of the CC column on table A. 

2, not in and not exists

Not in logic is not exactly the same as not exists, if you misuse not in, be careful of your program has a fatal bug, see the following example:

1 Create Table#t1 (C1int, C2int);2 Create Table#t2 (C1int, C2int);3 Insert  into#t1Values(1,2);4 Insert  into#t1Values(1,3);5 Insert  into#t2Values(1,2);6 Insert  into#t2Values(1,NULL);7 Select *  from#t1whereC2 not inch(SelectC2 from#t2);--> Execution Result: none8 Select *  from#t1where  not exists(Select 1  from#t2where#t2. C2=#t1. C2)--> Execution Results: 1 3

As you can see, there is a logic error in the not-in-expected result set. If you look at the execution plan for the two SELECT statements above, the latter uses Hash_aj, so try not to use not-in (which calls the subquery), but instead use not EXISTS (which invokes the associated subquery). If any one of the records returned in the subquery contains a null value, the query will not return any records. If the subquery field has a non-empty limit, then you can use not in, and you can connect it with Hasg_aj or merge_aj by prompting it.

If the query statement uses not-in, then a full table scan is performed on the inner surface, and no index is used, whereas the exists of not is still used for indexes on the table. So no matter which table is large, using not exists is faster than not.

3, IN and = Difference

1 Select  from where inch ('Zhang','Wang','Zhao' );

And

1 Select  from where name='Zhang'or name='Wang  'or name='Zhao'

The result is the same.

The following article was reproduced from :http://www.cnblogs.com/CareySon/archive/2013/01/09/2853094.html

Three kinds of physical connection operations in SQL

Introduction

In SQL Server, the common table-to-table inner Join,outer Join is executed by the engine based on the selected column, the data is indexed, and the selected data is selectively converted to loop Join,merge Join,hash Join one of the three physical connections. Understanding these three physical connections is the basis for understanding the performance issues when connecting tables, so let me describe the principles of these three connections and the scenarios that apply to them.

nested loop joins (Nested loop join)

The loop nesting connection is the most basic connection, as the name implies, and the process of joining can be simply demonstrated by the need for a nested loop:

Figure 1: The first step in looping a nested connection

Figure 2: The second step of looping a nested connection

It is not difficult to see from the above two graphs that the loop nesting connection finds the number of internal loop tables equal to the number of rows in the outer loop, and the loop nesting ends when there are no more rows in the outer loop. In addition, it can be seen that this type of connection requires an internal loop of the table ordered (that is, there is an index), and the number of rows of the external circular table is less than the number of internal loops, or Query Analyzer is more inclined to hash Join (described later in this article).

Nested loops also show that as the amount of data grows, the consumption of performance will increase exponentially, so that the Query Analyzer will tend to do this when the amount of data is to a certain extent.

Let's take a look at the loop nesting connection, using Microsoft's AdventureWorks database:

Figure 3: A simple nested loop connection

The ProductID in Figure 3 is indexed, and there are 4,688 rows in the outer table of the Loop (product table) that conform to productid=870, so the corresponding SalesOrderDetail table needs to be looked up 4,688 times. Let's consider another example in the above query, 4.

Figure 4 Extra column brings additional bookmark lookup

As can be seen in Figure 4, due to the multiple selection of a UnitPrice column, resulting in the index of the connection can not overwrite the query, it must be done by bookmark Lookup, which is why we have to develop a select required columns of good habits, in order to solve the above problem, we can use the overlay index, You can also reduce the required columns to avoid bookmark lookups. In addition, there are only 5 lines above the ProductID line, so the Query Analyzer chooses the bookmark lookup, if we increase the qualifying rows, the Query Analyzer will tend to the table scan (usually more than 1% of the rows in the table is often a table scan instead of a bookmark lookup, But this is not absolute), as shown in 5.

Figure 5: Query Analyzer selects a table scan

As you can see, the Query Analyzer now chooses a table scan to connect, which is inefficient, so a good overlay index and select * are all things to be aware of. In addition, the above situation, even if it involves table scanning, is still an ideal situation, and worse, when using multiple inequalities as a connection, the Query Analyzer even know the statistical distribution of each column, but do not know the federated distribution of several conditions, resulting in an erroneous execution plan, 6 shows.

Figure 6: Due to the inability to estimate the joint distribution, the resulting deviation

From Figure 6, we can see that the estimated number of rows and the actual number of rows there is a huge deviation, so should use the table scan, but the Query Analyzer chose the bookmark lookup, this situation will have a greater performance impact than the table scan. How big is the concrete? We can do this by forcing table scans and querying the parser for the default schedule, as shown in 7.

Figure 7: Forced table scan Performance is better

merging joins (merge join)

When it comes to merging connections, it suddenly occurred to me that at the Seattle to attend the SQL Pass summit evening bars in the bar, as I and another buddy stood in the wrong position, it seems that we two in the queue, I hastened to say: I ' m sorry,i thought here is end of the line. The other side is all humorous said: "It's ok,in SQL server,we called it merge Join".

It is not difficult to see from the small story above that the Merge join is actually connecting two ordered queues, requiring both ends to be ordered, so there is no need to constantly look for tables inside loops like loop join. Second, the merge join requires at least one equal sign query parser in the table join condition to select the merge join.

The Merge join process can be described simply in the following diagram:

Figure 8. Merge Join First Step

The Merge join first takes the first row from the two input collections and, if it matches, returns the matching rows. If the two rows do not match, then the input set with the smaller values is shown in +1,9.

Figure 9: Input set for smaller values down 1

The merge join is represented in C # code as shown in code 1.

public class mergejoin{    //Assume so left and right is already sorted public    static Relation Sort (Relation left , Relation right)    {        Relation output = new Relation ();        while (!left. Ispastend () &&!right. Ispastend ())        {            if (left). Key = = right. Key)            {                output. Add (left. Key);                Left. Advance ();                Right. Advance ();            }            else if (left. Key < Right. Key) left                . Advance ();            else//(left. Key > right. Key) Right                . Advance ();        }        return output;    }}

Code 1. C # code representation of the Merge join

Therefore, the merge join is usually very efficient if the input ends are ordered, but if you need to use explicit sort to ensure an orderly implementation of the merge join, then the hash join will be a more efficient choice. However, there is an exception, that is, the existence of order By,group by,distinct in the query may cause the query parser to have to be explicitly ordered, then for the Query Analyzer, anyway, has been an explicit sort, Why not stone directly using the results of the sort to make a less costly merge JOIN? In this case, the Merge join will be a better choice.

In addition, we can see from the principle of merge join that the merge join is more efficient when the join condition is an inequality (but does not include! =), such as > < >=.

Let's take a look at a simple merge join, which is a clustered index and a nonclustered index to ensure that both ends of the merge join are ordered, as shown in 10.

Figure 10: Guaranteed input ends ordered by clustered index and nonclustered index

Of course, when the query parser has to use an explicit sort when order by,group by, and thus can with stone, it also chooses the merge Join instead of the hash join,11.

Figure 11. Merge Join for with Stone

Hash Match (hash Join)

The hash match connection is more complex than the previous two, but the hash match is better than the merge Join and loop join for a large amount of data, and in the case of unordered behavior. The query parser tends to use hash join in cases where the join column is not sorted (that is, there is no index).

The hash match is divided into two stages, namely the generation and probing phases, the first is the build phase, and the first phase of the build phase can be shown in 12.

Figure 12. First phase of hash matching

In Figure 12, each entry in the input source after the calculation of the hash function is put into a different hash bucket, where the hash function selection and the number of hash bucket is black box, Microsoft did not publish the specific algorithm, but I believe that is a very good algorithm. In addition, the entries within the hash bucket are unordered. Typically, the query optimizer uses whichever input set that is smaller in both ends of the connection as the input source for the first stage.

Next is the probing phase, for another input set, the same hash function for each row, determine the hash bucket that it should be in, match each row in this row and corresponding hash bucket, and return the corresponding row if matched.

By understanding the principle of hash matching is not difficult to see, hash matching involves the hash function, so the CPU consumption will be very high, in addition, in the hash bucket row is unordered, so the output is unordered. Figure 13 is a typical hash match in which the query parser uses a smaller product table as a build and uses a SalesOrderDetail table with a large amount of data as the probe.

Figure 13. A typical hash match connection

The situation above is that memory can hold the memory required for the build phase, and if memory is tight, it will also involve Grace Hash match and recursive Hash match, which may use tempdb to eat a lot of Io. Here is not a detail, interested students can go: http://msdn.microsoft.com/zh-cn/library/aa178403 (v=sql.80). aspx.

Summary

Below we summarize the consumption and usage scenarios of these connections in a single table:

Nested loops Join Merge connections Hash join
Applicable scenarios The outer loop is small and the memory loop condition is ordered Input both sides are ordered Large amount of data and no index
Cpu Low Low (if not explicitly sorted) High
Memory Low Low (if not explicitly sorted) High
Io May be high may be low Low May be high may be low

Understanding the physical connection of SQL Server is essential for performance tuning, and many times the Query Analyzer may not be as intelligent as it is when multiple table connections are in the filter condition, so understanding these types of connections is especially important for locating problems. In addition, we can reduce the likelihood of poor performance connectivity by reducing query scope from a business perspective.

The difference between in and not in,exists vs. exists in SQL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.