Talking about three kinds of physical connection operations (performance comparison) in SQL Server _mssql

Source: Internet
Author: User
Tags joins first row
In SQL Server, the inner join,outer Join that is common between tables and tables is executed by the engine based on the selected column, whether the data is indexed, and the selectivity of the selected data is converted to loop Join,merge Join,hash Join one of the three physical connections. Understanding that these three physical connections are the basis for solving performance problems when tables are connected, let me describe the principles of these three connections.

Nested loop joins (Nested Loop join)
Loop nested joins are the most basic connections, as their name shows, which require looping nesting, and nested loops are the only way to support inequality connections in three ways, and the process of this connection can be simply shown in the following illustration:



Figure 1. The first step in looping nested joins

Figure 2. Second step of looping nested joins

It is easy to see from the above two graphs that the number of loops nested connections looking for Internal loop tables equals the number of rows in the outer loop, and the loop nesting ends when there are no more rows in the outer loop. In addition, it can be seen that this type of connection requires an orderly (i.e. indexed) table with internal loops, and that the outer loop table has fewer rows than the inner loop, or the Query Analyzer prefers a hash Join (described later in this article).

Nested loops join also shows that as the amount of data grows this way the cost of performance will grow exponentially, so the Query Analyzer tends to do this when the data is measured to a certain extent.

Let's take a look at a loop nested connection using Microsoft's AdventureWorks database:

Figure 3. A simple nested loop join

The ProductID in Figure 3 is indexed and there are 4,688 rows in the outer table of the loop (the product table) that conform to productid=870, so the corresponding SalesOrderDetail table needs to be found 4,688 times. Let's consider another example in the previous query, as shown in Figure 4.

Figure 4. Additional Bookmark lookup for extra columns

As you can see from Figure 4, because of the choice of a UnitPrice column, which causes the index of the connection to be unable to overwrite the query, it must be done through a bookmark lookup, which is why we have a good habit of selecting only the columns we need, and in order to solve the above problems, we can use the overlay index, You can also reduce the columns you need to avoid bookmark lookups. In addition, there are only 5 lines in line with ProductID, so the Query Analyzer chooses bookmark lookup, and if we increase the qualifying rows, the Query Analyzer will tend to scan the table (usually more than 1% of the number of rows in the table is often done with a table scan instead of a bookmark lookup, But that's not absolute, as shown in Figure 5.

Figure 5. The Query Analyzer has selected a table scan

As you can see, the Query Analyzer chooses a table scan to connect at this point, which is much less efficient, so a good overlay index and select * are the places to be noted. In addition, the above situation even involves a table scan, it is still the ideal situation, and even worse, when using multiple inequalities as a connection, the Query Analyzer, even though it knows the statistical distribution of each column, does not know the joint distribution of several conditions, resulting in an incorrect execution plan, as shown in Figure 6.

Fig. 6. Due to the inability to predict the joint distribution, the deviations caused

From Figure 6, we can see that there is a huge deviation between the estimated number of rows and the actual number of rows, so that the table scan should be used but the Query Analyzer chooses the bookmark lookup, which will have a greater impact on performance than table scans. How big is the concrete? We can compare them by forcing the table scan and the Query Analyzer's default schedule, as shown in Figure 7.

Figure 7. Forced table Scan Performance is better

Merge joins (merge join)

When it comes to merging connections, I suddenly think of the Seattle at the bar in the evening bars in line for a drink, because I and another buddy stand in the wrong place, looks like we two in the queue, I hastened to say: I ' m sorry,i thought are end of lines. The other side is all humorous said: "It's ok,in SQL server,we called it merge Join".

It is easy to see from the story above that the Merge join actually connects two ordered queues, requiring both ends to be ordered, so you do not have to look up the loop's internal tables as you would with the loop join. Second, the merge join requires at least one equal sign Query Analyzer in the table join condition to select the merge join.

The process of the Merge join can be described simply in the following illustration:

Figure 8. Merge Join First Step

The Merge join first takes the first row from the two input collections and, if it matches, returns a matching row. If you add two rows that do not match, the input set +1 with a smaller value is shown in Figure 9.

Figure 9. Smaller input set down to 1

The merge join is represented in C # code as shown in code 1.

Copy Code code as follows:

public class Mergejoin
{
Assume and right are already sorted
public static relation Sort (relation left, relation right)
{
Relation output = new relation ();
while (!left. Ispastend () &&!right. Ispastend ())
{
if (left. Key = = right. Key)
{
Output. Add (left. Key);
Left. Advance ();
Right. Advance ();
}
else if (left. Key < Right. Key)
Left. Advance ();
else//(left. Key > right. Key)
Right. Advance ();
}
return output;
}
}

Code 1. C # code representation of the Merge join

Therefore, the merge join is generally efficient if the input is ordered at both ends, but a hash join is a more efficient choice if you need to use explicit sort to guarantee an orderly merge join. But there is an exception, that is, there is an order By,group by,distinct in the query can cause Query Analyzer to have to do explicit ordering, then for Query Analyzer, anyway, has been explicit sort, Why not stone a less expensive merge JOIN with the result of sort? In this case, the Merge join will be a better choice.

In addition, we can see from the principle of the merge join that the merge join is more efficient when the join condition is an inequality (but does not include!=), such as > < >=.

Let's look at a simple merge join, which is a clustered index and a nonclustered index to keep the ends of the merge join in order, as shown in Figure 10.

Figure 10. Guaranteed input ends ordered by clustered index and nonclustered index

Of course, when the query parser has to use explicit sort when the order is by,group by, it also selects the merge join instead of the hash join, as shown in Figure 11, when it can be stone.

Figure 11. Stone Merge Join

Hash Match (hash Join)

Hash matching joins are more complex than the previous two ways, but hash matches are much better than the merge Join and loop join for a large amount of data and are unordered. For a connection column that is not sorted (that is, no index), the Query Analyzer prefers to use a hash Join.

Hash matching is divided into two stages, namely, the generation and detection phase, the first is the generation phase, the first phase of the build phase of the specific process can be shown in Figure 12.

Figure 12. First phase of hash matching

In Figure 12, the input source of each entry through the hash function of the calculation are placed in a different hash bucket, where the hash function selection and the number of hash bucket is black box, Microsoft has not published a specific algorithm, but I believe that is a very good algorithm. In addition the entries within the hash bucket are unordered. Generally speaking, the query optimizer uses which of the smaller input sets at both ends of the connection as the first phase of the input source.

Next is the probing phase, for another input set, the same hash function for each row, the hash Bucket that it should be in, matches each row in the row and the corresponding hash Bucket, and returns the corresponding row if matched.

By understanding the principle of hash matching, it is easy to see that hash matching involves hashing functions, so the CPU consumption will be very high, in addition, in the hash bucket row is unordered, so the output is unordered. Figure 13 is a typical hash match in which the query parser uses a smaller table of product tables as a build and uses a SalesOrderDetail table with a large amount of data as a probe.

Figure 13. A typical hash-matching connection

All of the above is memory that can accommodate the generation phase of memory, and if the memory is tight, it will also involve grace hash matching and recursive hash matching, which may use tempdb to eat a lot of Io. Here is not to elaborate, interested students can go: http://msdn.microsoft.com/zh-cn/library/aa178403 (v=sql.80). aspx.


Summary

Here we summarize the consumption and usage scenarios of these types of connections by a single table:

Nested loops Join Merging connections Hash connection
Applicable scenarios Outer loop small, memory cycle condition ordered Input ends are ordered Large amount of data and no index
Cpu Low Low (if not explicitly sorted) High
Memory Low Low (if not explicitly sorted) High
Io may be high or low Low may be high or low

Understanding SQL Server These physical connections are essential for performance tuning, and many times the Query Analyzer may not be so intelligent when multiple tables are connected, so understanding these connections is particularly important for locating problems. In addition, we can reduce the likelihood of low performance connectivity by reducing the scope of the query from a business perspective.

Reference documents:

http://msdn.microsoft.com/zh-cn/library/aa178403 (v=sql.80). aspx
Http://www.dbsophic.com/SQL-Server-Articles/physical-join-operators-merge-operator.html

Article from: http://www.cnblogs.com/CareySon/
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.