SQL Server Join method

Source: Internet
Author: User
Tags data structures joins

1. Test data Preparation

Reference: tables in SQL Server access by Table Scan, index Scan, index Seek the experimental data in this blog to prepare. These two blogs use the same experimental data.

Three ways to join in 2.SQL server

In SQL Server, each join command, when executed internally, is run with three more specific join methods. The three join methods are: Nested loops join, merge join, and hash join. None of these three methods is always the best, but they all have the most appropriate context. SQL Server chooses the most appropriate join method based on the table structure on which the two result sets are based, and the size of the result set. Of course, the user can also specify the join method in the statement , that is, add join Hint,sql Server will try to respect your choice. However, some queries may not perform the execution plan according to the specified join method, and SQL Server will give an error. It is also recommended that you do not use SQL hint, which is basically correct for SQL Server selection

SQL Server has three join methods, then there are three join hint, as shown below are the junctions executed by the three join hint and their corresponding execution plans.

View Code

Execution Plan:

2.1Nested Loop Join

Nested Loops is the most basic method of joining and is widely used by SQL Server . for two tables to be joined together, SQL Server chooses one to do outer table (at the top of the execution plan, salesorderheader_test), and the other to do inner table ( At the lower end of the execution plan, salesorderdetail_test). As shown

The algorithm is:

foreach (Row R1 in outer table)--Try small    foreach (row R2 in inner table)        if (R1, R2 match criteria)            output (r1, r2);

Take the above query as an example, SQL ServerSelected the Salesorderheader_testas outer table, Salesorderdetail_testAs inner table。 First SQL ServerIn Salesorderheader_test goes into inner table). Since outer Table salesorderheader_test There are 000 In the 659 was scanned 000 times, the embodiment of the implementation plan is: Clustered index seek returned row has 10000, and executes the number is 1

Nested Loops join is a basic way to join. It does not require SQL Server to create additional data structures for joins, so it also saves memory space and does not need to use tempdb space . The type of join it applies to is very extensive. Some joins are not possible with merge and hash, but nexted loops can do it. So the advantages of this type of connection are obvious, but the drawbacks are obvious.

1. The complexity of the algorithm equals inner table multiplied by outer table.

If the two tables are larger, especially if the outer table is larger, Inner table will be scanned many times. This time the complexity of the algorithm increases very quickly, the total resource consumption will increase quickly. so nested Loops join is more suitable for two smaller result sets to join, or at least outer table's result set is relatively small.

As in the previous example, the inner table is scanned 10 000 times because the data set of the outer table salesorderheader_test has 10 000 records. It's not a good deal. If you let SQL Server choose for itself without joining Hint,sql server does not select nested loops join mode.

2. Outer table data sets should preferably be sorted beforehand in order to improve retrieval efficiency.

If the datasets can be sorted in advance, doing nested loops will certainly be quicker. Of course, if there is no sort, Nested Loops join can be done, that is, cost will greatly increase.

3. It is best to have an index on Inner table that supports retrieval.

The nested loop algorithm takes each value of outer table one at a time, looking for all the qualifying records in the inner table, so finding the speed in the inner table can also affect the overall velocity to a great extent. If there is an index on the field that is being retrieved, the lookup will be much faster, and the Inner table DataSet is slightly larger. Otherwise, it is a waste of resources to scan the entire data set every time.

In summary, the Nested Loops join has the highest efficiency for smaller joins on datasets, so it is widely used in SQL Server. When SQL Server discovers the ability to select a very small data set as outer table, it tends to choose nested Loops, which is also better performance. However, the nested Loops join is too sensitive to the size of the dataset. If SQL Server predicts an error and makes a outer table with a large data set, performance can drop sharply. Many statement performance problems are caused by this.

2.2Merge Join

As mentioned earlier, the Nested Loops join works only when the outer table dataset is relatively small. If the dataset is large, SQL Server uses two other joins, the Merge join and the hash join. If two tables that need to be connected are sorted on the join column (for example, if they are obtained by scanning a sorted index), then the merge join is the fastest join operation. If the two join inputs are large, and the two inputs are the same size , the pre-ordered merge join provides a similar performance as the hash join. However, if the size of the two inputs varies greatly, the hash join operation is usually much faster.

The Merge join algorithm is as follows:

Get first row R1 from input 1 get first row R2 from input 2 and not at the end of either input begin     if (R1 joins WI th R2)     begin     Output (R1, R2)     get next row R2 from input 2 end else if (R1 < R2)    get next row R1 from Put 1 Else     

In other words, take a value from each of the data sets on both sides and compare it. If they are equal, the two lines are joined together to return. If not equal, throw away the small value and take a larger one in order. The data set on both sides ends with one side traversal, and the entire join process ends. So the complexity of the algorithm is O (m+n), which is a lot smaller than the complexity O (m*n) that multiplies the nested Loops join two datasets. Therefore, in the case of large datasets, the advantage of the Merge join is very obvious.

But from the above merge join algorithm to see that its limitations are also very strong, so in the actual statement, the use is not so common. Its limitations include:

1. The two datasets to be joined must be ordered in advance according to the field of the join.

This precondition is the basis of the merge join algorithm, and sorting large datasets is a complex matter. However, some datasets are based on the index on the field of join, so the order can be sorted without additional resources, which may be appropriate when using the merge join. For example, the sample query, two datasets are based on the SalesOrderID field index on the seek out, so do not need to do the sorting. The execution plan for the sample query is as follows:

From the query plan we can see that the sample query for the merge join can be decomposed into two queries,

SELECT * FROM dbo. Salesorderheader_test where SalesOrderID >43659 and salesorderid< 53660select count (SalesOrderID) from dbo. Salesorderdetail_test where SalesOrderID >43659 and salesorderid< 53660

The first query uses clustered index seek, because there is a clustered index, so the query results are definitely sorted by the clustered index column SalesOrderID. The second query, although SalesOrderID is not a clustered index key for the Salesorderdetail_test table, has a nonclustered index on the Salesorderdetail_test table and only needs to query count (SalesOrderID ), so it is queried on the nonclustered index, and the query results are sorted by SalesOrderID. Thus, the final two result sets are sorted according to SalesOrderID.

2. The merge join can only be used as a join with a "value equal" condition, and if the data set may have duplicate data, the merge join uses Many-to-many as a very resource-based join.

When SQL Server scans a dataset, if DataSet 1 has two or more record values equal, SQL Server must temporarily set up a data structure to store the scanned data in DataSet 2, which is useful if the next record in DataSet 1 is the value. This temporary data structure is called "worktable" and will be placed in tempdb or in memory . This is very resource-intensive, so in the above execution plan, the subtree cost of the two sentence sentences of the Merge join is 0.202 and 0.109 respectively. But the Many-to-many join clause subtree cost is 5.051. That is, the cost of the join itself is 4.74 (5.051–0.202–0.109 =4.74)). This is a no-small cost.

If you add a unique index on the SalesOrderID column of the [salesorderheader_test] table (or change the original clustered index to a unique clustered index),

The--salesorderid column already has a clustered index and now adds a unique index--adding a unique index will fail if there are duplicates on the SalesOrderID column. Create unique index Idx_uniq_salesorderid on salesorderheader_test (SalesOrderID);

SQL Server knows that the value of DataSet 1(Salesorderheader_test) is not duplicated, and there is no need to do a many-to-many Join. The execution plan has changed, and the estimated cost has been reduced by an order of magnitude.

Summarize:

The above two restrictions affect the merge JoinRange of use. But the merge JoinA unique benefit of by the way, the order of the result set is a problem. When we use the same query, we find that the result sets are sometimes arranged in the order we want them to be, sometimes not. or in SQL Server 2000,merge join", SQL server Choose which Join algorithm is least expensive. If the amount of data and data distribution let SQL Server cost, it will switch to nested Loops in a word, if you want the result set to return in a certain order, explicitly specify "Order By". If you do not specify, even the exact same query, the result set order this time and the last time is not the same is normal. Because the data changes, or the parameters are different, SQL server

2.3Hash Join

As the name implies, a hash join is a matching join algorithm using a hashing algorithm . A specific hashing algorithm can refer to one of my other blogs: hashmap implementation principle. In short, the hashing algorithm is divided into two steps, "building a hash bucket (build hashbucket)" and "probing the value in the hash bucket (Probe hash bucket)". In the "Build" phase, SQL Server selects one of the two datasets to make a join , creating a hash table in memory based on the recorded values . Then, in the "Probe" phase, SQL Server selects another dataset, brings the values of the records inside, and returns the rows that meet the criteria that can be joined. The specific algorithm is:

For each row R1 in the build table   begin      Calculate hash value on join key (s) of R1      insert R1 into the APPROPRI ATE hash bucket   endfor each row R2 in the probe table   begin      Calculate hash value on join key (s) of R2      for Each row R1 in the corresponding hash bucket         if R1 joins with R2            output (R1, R2)   end

Algorithm Description:

    1. Select one of the two tables that need to join, R1 for each record in a, calculate the hash value of the Join column, and then insert the R1 into the hash bucket based on the hash value.
    2. Select two outer table B, for each record in B R2, we also calculate the hash value of its join column, and then go to the hash bucket to find. If there is R1 on the hash bucket that can be connected to the R2, then a long output (R1,R2) of the join result may have multiple R1 records.

The structure can be referenced as follows:

The above 0-15 is the hash bucket, and the right node is R1.

Compared with the other two join algorithms, the advantage of Hash join is obvious.

1. Its algorithm complexity is to traverse each side of the data set again and again.

This is a large join for datasets, and its complexity can be controlled within a reasonable range. Compared to the already sequenced merge Join,hash join, which computes the hash value a bit more, the complexity is higher than the merge Join, but it is much simpler than the nested loops.

2. It does not require the data set to be sorted in advance, nor does it require an index on it.

Because the join uses a hashing algorithm, there is no restriction on the input, and SQL Server is not required to prepare an ordered input in advance, as in the case of a merge join. Since the hash join always has to scan both sides of the data set, so there is no index actually help is not big. Without indexes, there is no significant impact on performance.

3. You can easily upgrade to a parallel execution plan that uses multiple processors.

Because the algorithm does not require any order of incoming data, it is relatively easy to do it in parallel with multiple CPUs.

In summary, Hash join is a case where the data set to join is large and there are no suitable indexes on it. As in the example, there is a join between a 10 000 record dataset and a data set of 50 577 records. Use nested loops to cycle 10 000 times, the cost is relatively high. The estimated cost for SQL Server is 2.233. When using the merge join, although both datasets are sorted, SQL Server has to use the Many-to-many join method, because there may be duplicate values, and the cost is high, estimated at 5.882. With a hash Join, the estimated cost is 0.727, which is smaller than the first two. So SQL Server uses a hash join for this sentence by default if you do not join hint.

However, Hash join is not an optimal join algorithm, just a choice for SQL Server when the input is not optimized (the data set of the join is larger, or there is no suitable index on it). This is because hash join is one of the most resource-intensive join algorithms. It creates a hash table in memory before making a join. The established process requires CPU resources, and the hash table needs to be stored in memory or tempdb. The join process also uses CPU resources to calculate ("Probe"). If there are many users at the same time using the hash algorithm to join, the overall burden on SQL Server is relatively heavy. From the point of view of reducing overall SQL Server load, try to minimize the size of the data set of the Join input, with the appropriate index, and guide SQL Server to use nested Loops join or merge join as much as possible.

The following table compares these three join methods.

Nested Loops Join

Merge Join

Hash Join

Best suited for

A relatively small two data set, inner table has an index on the field where the join is made

Input dataset size is medium or large, and index help is sorted on the join field, or the statement requires that a well-ordered result set be returned

The input data set is large. Especially suitable for complex query statements in the Data Warehouse environment

Concurrency of

Ability to support large numbers of concurrent users running concurrently

Many-to-one join with index support concurrency is better, many-to-many is bad

It's best to have only a few users running at the same time

Two fields are equal when join

No

to (except for full outer join)

To

Whether to use memory resources

Do not use

Not used (may be used if you want to sort the merge join)

Use

Whether to use tempdb

Do not use

Many-to-many Join to use

Use

Whether to sort the input datasets

No

To

No

Want to sort the input datasets No

Want outer input to be sorted

Yes

No

When SQL Server joins, it measures the possible indexes according to the structure of the table on which the input datasets are based, and, based on statistics, estimates the size of two input datasets and selects one of the three join methods. If the selection is not correct, it may cause the join to be very slow.

Practical application

This is an interview question, you can refer to a blog I have written before: query in a table is not a record of another table. Now found before the writing is not clear enough, I will revise the blog, the specific solution is also written in the blog.

The general meaning of the topic is:

If you want to query a record that exists in table A, but does not exist in table B, how to query. For illustrative purposes, we assume that both A and B tables have only one field id,a the record in the table is {2,4} in the {1,2,3,4,5},b table, then we need to get a result set of {1,3,5} from a SQL query. There is a and B table in the ID is not necessarily sorted, a table of large datasets, B table data set is small.

Source: https://www.cnblogs.com/xwdreamer/archive/2012/09/07/2674756.html

SQL Server Join method

Related Article

E-Commerce Solutions

Leverage the same tools powering the Alibaba Ecosystem

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.