Nested loop join)
Loop nested join is the most basic link. As shown in its name, loop nesting is required. nested loop is the only method that supports inequality connection among the three methods, the process of this connection method can be simply described as follows:
Figure 1. Step 1 of loop nested join
Figure 2. Step 2 of loop nested join
From the above two figures, it is not difficult to see that the number of times the nested loop table searches for internal loops is equal to the number of rows in the External Loop. When there are no more rows in the external loop, the nested loop ends. In addition, we can see that this connection method requires an internal cycle of the table Order (that is, there is an index), and the number of rows in the external cycle table is smaller than the number of rows in the internal cycle, otherwise, the query analyzer is more inclined to hash join (which will be discussed later in this article ).
Through nested loop connections, we can see that with the increase of data volume, the consumption of performance will increase exponentially. Therefore, when the data volume reaches a certain level, this method is often used by the query analyzer.
Merge join)
In merge join, two ordered queues are connected, and both ends must be in order. Therefore, you do not need to constantly search the tables in the loop like loop join. Second, merge join requires at least one equal sign query analyzer in the table join condition to select merge join.
The merge join process can be described in the following figure:
Figure 8. Merge join Step 1
Merge join first retrieves the first row from the two input sets. If the row matches, the matching row is returned. If two rows do not match, the input set with a smaller value + is shown.
. The input set with smaller values goes down to 1.
Therefore, if the two ends of merge join are ordered, the merge join efficiency is very high. However, if you need to use the explicit sort to ensure the orderly implementation of merge join, hash join is a more efficient choice. However, there is also an exception, that is, the query analyzer may have to perform explicit sorting due to the existence of order by, group by, distinct, and so on. For the query analyzer, in all cases, the explicit sort has been implemented. Why not use the result of the sort directly to perform merge join with a lower cost? In this case, merge join is a better choice.
In addition, we can see from the merge join principle that when the join condition is an inequality (but not included! =), For example,> <> =, and so on, merge join has a better efficiency.
Hash join)
Hash matching is more complex than the previous two methods, but hash matching is better than merge join and loop join in case of a large amount of data and disorder. If the join columns are not sorted (that is, there is no index), the query analyzer tends to use Hash join.
Hash matching is divided into two phases: Generation and detection. The first is the generation phase. The specific process of the first phase is 12.
Figure 12. The first phase of hash matching
In Figure 12, each entry in the input source is computed by the hash function in different hash buckets. The selection of hash function and the number of hash buckets are both black boxes, microsoft has not published specific algorithms, but I believe they are already very good algorithms. In addition, entries in the hash bucket are unordered. Generally, the query optimizer uses a small input set at both ends of the connection as the first-stage Input Source.
Next is the test phase. For another input set, the hash function is also used for each row to determine the hash bucket in which the row should be matched with each row in the corresponding hash bucket, if yes, the corresponding row is returned.
By understanding the principle of hash matching, it is not difficult to see that hash matching involves hash functions, so the CPU consumption is very high. In addition, the rows in the hash bucket are unordered, therefore, the output results are unordered.
In the above cases, the memory can accommodate the memory required for the next generation phase. If the memory is tight, it will also involve Grace hash matching and recursive hash matching, this may use tempdb to consume a large amount of Io. Here I will not elaborate, interested students can move: http://msdn.microsoft.com/zh-cn/library/aa178403 (V = SQL .80). aspx.
Summary
The following table briefly summarizes the consumption and use cases of these connection methods:
|
Nested loop connection |
Merge connections |
Hash connection |
Applicable scenarios |
Small outer loop, ordered memory loop condition Columns |
Both ends of the input are ordered. |
Large data volume without Indexing |
CPU |
Low |
Low (if no explicit sorting is available) |
High |
Memory |
Low |
Low (if no explicit sorting is available) |
High |
Io |
High or low |
Low |
High or low |
References:
Http://msdn.microsoft.com/zh-cn/library/aa178403 (V = SQL .80). aspx
Http://www.dbsophic.com/SQL-Server-Articles/physical-join-operators-merge-operator.html
The following blog has done some time-consuming experiments on three types of Join Operations. You can refer to the analysis:
Http://www.jasongj.com/2015/03/07/Join1/
Intermediate join --------- hash join & merge join & nested loop join