In SQL Server, the Inner Join and Outer Join operations between common tables are performed by the execution engine based on the selected columns and whether the data has an index, the selected data is selectively converted to one of the Loop Join, Merge Join, and Hash Join physical connections. Understanding these three physical connections is the basis for solving performance problems during table connections. The following describes the principles and applicable scenarios of these three connections.
Nested Loop Join)
Loop nested join is the most basic link. As shown in its name, loop nesting is required. nested loop is the only method that supports inequality connection among the three methods, the process of this connection method can be simply described as follows:
Figure 1. Step 1 of loop nested join
Figure 2. Step 2 of loop nested join
From the above two figures, it is not difficult to see that the number of times the nested loop table searches for internal loops is equal to the number of rows in the External Loop. When there are no more rows in the external loop, the nested loop ends. In addition, we can see that this connection method requires an internal cycle of the table Order (that is, there is an index), and the number of rows in the external cycle table is smaller than the number of rows in the internal cycle, otherwise, the query analyzer is more inclined to Hash Join (which will be discussed later in this article ).
Through nested loop connections, we can see that with the increase of data volume, the consumption of performance will increase exponentially. Therefore, when the data volume reaches a certain level, this method is often used by the query analyzer.
The following is an example of a nested loop connection that uses Microsoft's AdventureWorks database:
Figure 3. A simple nested loop connection
In Figure 3, ProductID is indexed and there are 870 rows in the circular External table (Product table) that meet ProductID = 4688. Therefore, the corresponding SalesOrderDetail table needs to be searched for 4688 times. Let's consider another example in the preceding query, as shown in figure 4.
Figure 4. Additional bookmarked search for extra columns
As shown in figure 4, because an UnitPrice column is selected, the connected index cannot overwrite the query, and must be searched by bookmarks, this is also why we need to develop a good habit of selecting only required columns. To solve the problem above, we can both overwrite the index or reduce the required columns to avoid bookmarking. In addition, there are only five ProductID-compliant rows. Therefore, the query analyzer selects bookmarks to search for the rows. If we increase the number of rows that meet the conditions, the query analyzer tends to scan tables. (generally, if the number of rows in a table is more than 1%, table scan is usually performed instead of bookmarks. But this is not absolute.
Figure 5. Table scan selected by the query Analyzer
It can be seen that the query analyzer selects table scanning for connection at this time, which is much less efficient. Therefore, both good coverage index and Select * are important. In addition, even if Table scanning is involved, it is still ideal. Even worse, when multiple inequalities are used as the join, the query analyzer knows the statistical distribution of each column, but do not know the Joint Distribution of several conditions, resulting in an incorrect execution plan, as shown in figure 6.
Figure 6. Deviations caused by inability to predict the Joint Distribution
As shown in figure 6, we can see that the estimated number of rows is greatly different from the actual number of rows, so we should use the table scan, but the query analyzer selects the bookmark search, this will have a greater impact on performance than table scanning. To what extent? We can compare the default plan of the forced table scan and query analyzer, as shown in 7.
Figure 7. Forced table scan performance is better
Merge Join)
When talking about merging connections, I suddenly remembered that at the SQL Pass summit in Seattle, I had a queue for wine at the bar in the evening. Because I was wrong with another guy, it seems that we were in the same queue, i'm sorry, I thought here is end of line. The other party said humorously: "It's OK, In SQL Server, We called it merge join ".
From the story above, it is not difficult to see that Merge Join is actually connecting two ordered queues, and both ends need to be in order, so it is not necessary to constantly look up the tables in the Loop like Loop Join. Second, Merge Join requires at least one equal sign query analyzer in the table Join condition to select Merge Join.
The Merge Join process can be described in the following figure:
Figure 8. Merge Join Step 1
Merge Join first retrieves the first row from the two input sets. If the row matches, the matching row is returned. If two rows do not match, the input set with a smaller value + is shown.
Figure 9. Input sets with smaller values go down to 1
Code 1 shows how to use C # code to represent Merge Join.
Copy codeThe Code is as follows:
Public class MergeJoin
{
// Assume that left and right are already sorted
Public static Relation Sort (Relation left, Relation right)
{
Relation output = new Relation ();
While (! Left. IsPastEnd ()&&! Right. IsPastEnd ())
{
If (left. Key = right. Key)
{
Output. Add (left. Key );
Left. Advance ();
Right. Advance ();
}
Else if (left. Key <right. Key)
Left. Advance ();
Else // (left. Key> right. Key)
Right. Advance ();
}
Return output;
}
}
Code 1. C # code representation of Merge Join
Therefore, if the two ends of Merge Join are ordered, the Merge Join efficiency is very high. However, if you need to use the explicit Sort to ensure the orderly implementation of Merge Join, hash Join is a more efficient choice. However, there is also an exception, that is, the query analyzer may have to perform explicit sorting due to the existence of order by, group by, distinct, and so on. For the query analyzer, in all cases, the explicit Sort has been implemented. Why not use the result of the Sort directly to perform merge join with a lower cost? In this case, Merge Join is a better choice.
In addition, we can see from the Merge Join principle that when the Join condition is an inequality (but not included! =), For example,> <> =, and so on, Merge Join has a better efficiency.
Let's look at a simple Merge Join. This Merge Join ensures the order of the two ends of Merge Join by clustered index and non-clustered index, as shown in 10.
Figure 10. Ensure the order at both ends of the input by clustered index and non-clustered Index
Of course, when Order By and Group By are used, the query analyzer has to use the explicit Sort function. Therefore, Merge Join instead of Hash Join will be selected, as shown in Figure 11.
Figure 11. Merge Join
Hash Join)
Hash matching is more complex than the previous two methods, but hash matching is better than Merge Join and Loop Join in case of a large amount of data and disorder. If the Join columns are not sorted (that is, there is no index), the query analyzer tends to use Hash Join.
Hash matching is divided into two phases: Generation and detection. The first is the generation phase. The specific process of the first phase is 12.
Figure 12. The first phase of hash matching
In Figure 12, each entry in the input source is computed by the Hash Function in different Hash buckets. The selection of Hash Function and the number of Hash buckets are both black boxes, microsoft has not published specific algorithms, but I believe they are already very good algorithms. In addition, entries in the Hash Bucket are unordered. Generally, the query optimizer uses a small input set at both ends of the connection as the first-stage Input Source.
Next is the test phase. For another input set, the Hash function is also used for each row to determine the Hash Bucket in which the row should be matched with each row in the corresponding Hash Bucket, if yes, the corresponding row is returned.
By understanding the principle of Hash matching, it is not difficult to see that Hash matching involves Hash functions, so the CPU consumption is very high. In addition, the rows in the Hash Bucket are unordered, therefore, the output results are unordered. Figure 13 is a typical hash match. The query analyzer uses the Product table with a small table data volume as the generation, and the SalesOrderDetail table with a large data volume as the test.
Figure 13. A typical hash matching connection
In the above cases, the memory can accommodate the memory required for the next generation phase. If the memory is tight, it will also involve Grace hash matching and recursive hash matching, this may use TempDB to consume a large amount of IO. Here I will not elaborate, interested students can move: http://msdn.microsoft.com/zh-cn/library/aa178403 (v = SQL .80). aspx.
Summary
The following table briefly summarizes the consumption and use cases of these connection methods:
|
Nested loop connection |
Merge connections |
Hash connection |
Applicable scenarios |
Small outer loop, ordered memory loop condition Columns |
Both ends of the input are ordered. |
Large data volume without Indexing |
CPU |
Low |
Low (if no explicit sorting is available) |
High |
Memory |
Low |
Low (if no explicit sorting is available) |
High |
IO |
High or low |
Low |
High or low |
Understanding these physical connection methods of SQL Server is essential for Performance Tuning. In many cases, the query analyzer may not be so intelligent when there are many multi-Table connections in filtering conditions, therefore, understanding these connection methods is particularly important for locating problems. In addition, we can reduce the query scope from the business perspective to reduce the possibility of Low-performance connections.
References:
Http://msdn.microsoft.com/zh-cn/library/aa178403 (v = SQL .80). aspx
Http://www.dbsophic.com/SQL-Server-Articles/physical-join-operators-merge-operator.html
Article from: http://www.cnblogs.com/CareySon/