SQL Server Three table join principle

Source: Internet
Author: User
Tags joins microsoft website

http://msdn.microsoft.com/zh-cn/library/dn144699.aspx Introduction

In SQL Server, the common table-to-table inner Join,outer Join is executed by the engine based on the selected column, the data is indexed, and the selected data is selectively converted to loop Join,merge Join,hash Join one of the three physical connections. Understanding these three physical connections is the basis for understanding the performance issues when connecting tables, so let me describe the principles of these three connections and the scenarios that apply to them.

Nested loop joins (Nested Loop join)

The loop nesting connection is the most basic connection, as the name implies, loop nesting, nested loops are the only way to support inequalities in three ways, the process of such a connection can be simply demonstrated:

Figure 1: The first step in looping a nested connection

Figure 2: The second step of looping a nested connection

It is not difficult to see from the above two graphs that the loop nesting connection finds the number of internal loop tables equal to the number of rows in the outer loop, and the loop nesting ends when there are no more rows in the outer loop. In addition, it can be seen that this type of connection requires an internal loop of the table ordered (that is, there is an index), and the number of rows of the external circular table is less than the number of internal loops, or Query Analyzer is more inclined to hash Join (described later in this article).

Nested loops also show that as the amount of data grows, the consumption of performance will increase exponentially, so that the Query Analyzer will tend to do this when the amount of data is to a certain extent.

Let's take a look at the loop nesting connection, using Microsoft's AdventureWorks database:

Figure 3: A simple nested loop connection

The ProductID in Figure 3 is indexed, and there are 4,688 rows in the outer table of the Loop (product table) that conform to productid=870, so the corresponding SalesOrderDetail table needs to be looked up 4,688 times. Let's consider another example in the above query, 4.

Figure 4 Extra column brings additional bookmark lookup

As can be seen in Figure 4, due to the multiple selection of a UnitPrice column, resulting in the index of the connection can not overwrite the query, it must be done by bookmark Lookup, which is why we have to develop a select required columns of good habits, in order to solve the above problem, we can use the overlay index, You can also reduce the required columns to avoid bookmark lookups. In addition, there are only 5 lines above the ProductID line, so the Query Analyzer chooses the bookmark lookup, if we increase the qualifying rows, the Query Analyzer will tend to the table scan (usually more than 1% of the rows in the table is often a table scan instead of a bookmark lookup, But this is not absolute), as shown in 5.

Figure 5: Query Analyzer selects a table scan

As you can see, the Query Analyzer now chooses a table scan to connect, which is inefficient, so a good overlay index and select * are all things to be aware of. In addition, the above situation, even if it involves table scanning, is still an ideal situation, and worse, when using multiple inequalities as a connection, the Query Analyzer even know the statistical distribution of each column, but do not know the federated distribution of several conditions, resulting in an erroneous execution plan, 6 shows.

Figure 6: Due to the inability to estimate the joint distribution, the resulting deviation

From Figure 6, we can see that the estimated number of rows and the actual number of rows there is a huge deviation, so should use the table scan, but the Query Analyzer chose the bookmark lookup, this situation will have a greater performance impact than the table scan. How big is the concrete? We can do this by forcing table scans and querying the parser for the default schedule, as shown in 7.

Figure 7: Forced table scan performance is better

Merging joins (merge join)

When it comes to merging connections, it suddenly occurred to me that at the Seattle to attend the SQL Pass summit evening bars in the bar, as I and another buddy stood in the wrong position, it seems that we two in the queue, I hastened to say: I ' m sorry,i thought here is end of the line. The other side is all humorous said: "It's ok,in SQL server,we called it merge Join".

It is not difficult to see from the small story above that the Merge join is actually connecting two ordered queues, requiring both ends to be ordered, so there is no need to constantly look for tables inside loops like loop join. Second, the merge join requires at least one equal sign query parser in the table join condition to select the merge join.

The Merge join process can be described simply in the following diagram:

Figure 8. Merge Join First Step

The Merge join first takes the first row from the two input collections and, if it matches, returns the matching rows. With two rows of mismatches, the input set with the smaller values is shown in +1,9.

Figure 9: Input set for smaller values down 1

The merge join is represented in C # code as shown in code 1.

C#
public class mergejoin{    //Assume so left and right is already sorted public    static Relation Sort (Relation left , Relation right)    {        Relation output = new Relation ();        while (!left. Ispastend () &&!right. Ispastend ())        {            if (left). Key = = right. Key)            {                output. Add (left. Key);                Left. Advance ();                Right. Advance ();            }            else if (left. Key < Right. Key) left                . Advance ();            else//(left. Key > right. Key) Right                . Advance ();        }        return output;    }}

Code 1. C # code representation of the Merge join

Therefore, the merge join is usually very efficient if the input ends are ordered, but if you need to use explicit sort to ensure an orderly implementation of the merge join, then the hash join will be a more efficient choice. However, there is an exception, that is, the existence of order By,group by,distinct in the query may cause the query parser to have to be explicitly ordered, then for the Query Analyzer, anyway, has been an explicit sort, Why not stone directly using the results of the sort to make a less costly merge JOIN? In this case, the Merge join will be a better choice.

In addition, we can see from the principle of merge join that the merge join is more efficient when the join condition is an inequality (but does not include! =), such as > < >=.

Let's take a look at a simple merge join, which is a clustered index and a nonclustered index to ensure that both ends of the merge join are ordered, as shown in 10.

Figure 10: Guaranteed input ends ordered by clustered index and nonclustered index

Of course, when the query parser has to use an explicit sort when order by,group by, and thus can with stone, it also chooses the merge Join instead of the hash join,11.

Figure 11. Merge Join for with Stone

Hash Match (hash Join)

The hash match connection is more complex than the previous two, but the hash match is better than the merge Join and loop join for a large amount of data, and in the case of unordered behavior. The query parser tends to use hash join in cases where the join column is not sorted (that is, there is no index).

The hash match is divided into two stages, namely the generation and probing phases, the first is the build phase, and the first phase of the build phase can be shown in 12.

Figure 12. First phase of hash matching

In Figure 12, each entry in the input source after the calculation of the hash function is put into a different hash bucket, where the hash function selection and the number of hash bucket is black box, Microsoft did not publish the specific algorithm, but I believe that is a very good algorithm. In addition, the entries within the hash bucket are unordered. Typically, the query optimizer uses whichever input set that is smaller in both ends of the connection as the input source for the first stage.

Next is the probing phase, for another input set, the same hash function for each row, determine the hash bucket that it should be in, match each row in this row and corresponding hash bucket, and return the corresponding row if matched.

By understanding the principle of hash matching is not difficult to see, hash matching involves the hash function, so the CPU consumption will be very high, in addition, in the hash bucket row is unordered, so the output is unordered. Figure 13 is a typical hash match in which the query parser uses a smaller product table as a build and uses a SalesOrderDetail table with a large amount of data as the probe.

Figure 13. A typical hash match connection

The situation above is that memory can hold the memory required for the build phase, and if memory is tight, it will also involve Grace Hash match and recursive Hash match, which may use tempdb to eat a lot of Io. Here is not a detail, interested students can go: http://msdn.microsoft.com/zh-cn/library/aa178403 (v=sql.80). aspx.

Summarize

Below we summarize the consumption and usage scenarios of these connections in a single table:

Nested loops Join

Merge connections

Hash join

Applicable scenarios

The outer loop is small and the memory loop condition is ordered

Input both sides are ordered

Large amount of data and no index

Cpu

Low

Low (if not explicitly sorted)

High

Memory

Low

Low (if not explicitly sorted)

High

Io

May be high may be low

Low

May be high may be low

Understanding the physical connection of SQL Server is essential for performance tuning, and many times the Query Analyzer may not be as intelligent as it is when multiple table connections are in the filter condition, so understanding these types of connections is especially important for locating problems. In addition, we can reduce the likelihood of poor performance connectivity by reducing query scope from a business perspective.

Reference documents:

http://msdn.microsoft.com/zh-cn/library/aa178403 (v=sql.80). aspx

Http://www.dbsophic.com/SQL-Server-Articles/physical-join-operators-merge-operator.html

In a SQL Server database, the query optimizer typically uses three connections when it handles table joins:

      • Nested loops Join ( Nested Loop Join)
      • Merge Connections ( Merge Join)
      • Hash connection (hash join)

Fully understand the three kinds of table connection working principle, we can optimize the SQL Server connection in terms of code can be based on, in order to carry out optimization work to provide some ideas. Next, let's get to know these three kinds of connections.

1. Nested loops connection (Nested loop join)

This connection is often used in small data volumes and simple statements, and is a more common way to connect, such as the following example:

   1:  
   2:  
   3:  
   4:  
   5:  JOIN Sales.sale
   1:  
Sorderdetail D
   6:  
   7:  

The AdventureWorks2008 database is a sample of SQL Server and you can download it freely on the official Microsoft website. http://msftdbprodsamples.codeplex.com/releases/view/37109

We run this code in the database:

By executing the plan we can see that the database optimizer uses a nested connection (neasted Loops), The Sales.SalesOrderHeader table in the first row above is used as an external table because there is only one row of data, and SalesOrderDetail has 12 rows of data to use for internal tables.

How Nested Loops work:

Figure 1 Working with nested loops schematic diagram

The principle is to filter out the table according to the conditions of an external link table, loop from the external table to read a row of data, go to the internal table to match, pseudo code as follows:

For (i=0;i< number of outertable row;i++)

{

Outertable[i] Connect INNERTABLE[1,2.....N] to Create New Row

WHERE Outertable[i].data.value = Outertable[1,2.....n].data. Value

}

Knowing how nesting works, it's not hard to see how this kind of connection has some limitations:

1. Because the algorithm is cyclic, it is more suitable for tables with smaller data volumes to connect, especially for external table data.

2. Two sheets are preferably sorted. The criteria and join columns in a table are best indexed, especially if the internal tables must have indexes, so that productivity can multiply.

Loop nesting is efficient when the external table is small and the inner table is large and there are indexes on the connected fields. and nested loops are the only way to support inequality joins in three ways.

2. Merging connections (merge Join)

In a SQL Server database, if the query optimizer discovers two object tables to connect to, and the columns are sorted and indexed, the optimizer will be highly likely to choose the merge connection policy. The condition is that two tables are sorted, and at least one equals sign is connected in the table join condition, and Query Analyzer chooses the merge connection.

code example:

   1:  
   2:  
   3:  
   4:  
   5:  
   6:  
   7:  
   8:  
   9:  
  Ten:  
  11:  

According to the execution plan we can see that this connection operation uses the merge connection:

In these two tables, the data amount is 128 and 762 rows, respectively, the connection column is the primary key in the table and the data is ordered, so the query optimizer of the database automatically chooses the merge connection. The merge connection works as shown in the following:

Figure 2 How the merge connection works

After deciding to use a merge connection, the database optimizer matches the first row of data in two tables (the term called the input collection) in parallel, and the match returns a matching row and connects. If it does not match, then the small table (input collection), then the order to remove a row of data continue to try to match.

By how it works we can see that a merge connection can be seen as a mechanism similar to concurrency. Operations in two tables (input sets) to obtain data and compare, which requires that the two tables are ordered, orderly arrangement will greatly improve the efficiency of the work.

For table sorting problems, if you use the Sort keyword in a connection statement to sort the data table, the SQL Server optimizer prefers a hash Join. In a merge connection, the order by, group BY, and distinct keywords are not excluded, and the query optimizer has a great chance of choosing a merge connection when using these statements.

When we use some query constraints, such as inequalities (>,<,>=, etc.) to limit the scope of the conditions, then the efficiency of the merge connection will be better.

Qualifying conditions for a merge connection:

1. The connection column for two tables needs to be sorted

2. The connection column must have an index

3. Hash connection (hash join)

The query optimizer of SQL Server tries to use a hash join when we try to connect two tables with two data volumes that are not sorted and indexed.

code example:

   1:  
   2:  
   3:  
   4:  
   5:  
   6:  
   7:  

According to the execution plan we can see that this time the connection operation uses a hash connection:

This connection is more efficient at handling large amounts of unordered data, but consumes more processor and memory resources. The implementation process is as follows:

The execution of a Hash join connection is divided into two stages, set up and probed.

Establishment refers to a series of operations on an input table. First the optimizer scans each row of data in the input table into system memory and calculates the corresponding hash value based on the built-in hashing algorithm, and the data of the same hash value is divided into a hash pool. These hash values and data addresses are stored in a hash table and are provided for probing use. Normally the optimizer chooses a table with less data to create the input table.

After the establishment is complete, start probing work. The other connection table (which we call the probe input) will also be scanned and computed on a line-by-row basis, resulting in a hash value. The connection operation uses the hash value of the probe input and establishes a list of input hash values for scanning and matching work, and finally establishing the connection.

Is the hash connection workflow, next we can understand the implementation of the hashing algorithm mechanism, the following is the individual understanding of the algorithm, if biased please correct me.

The actual meaning of the hash is "hash" meaning, its main function is to set a set of data, through the algorithm, transformed into a fixed-length output, this output we call the hash value (hash value), usually in the security field, such as cryptography used more.

In SQL Server hash function is black box, there is no specific algorithm can be referenced. In fact, many developers in solving the massive data query, will use hash method, and develop suitable hashing algorithm. Some commonly used algorithms include some redundancy, MD2, MD4, MD5 and SHA-1, and so on.

Because of the algorithm, different data may generate the same hash value. It disperses large amounts of data into different data heaps or lists in accordance with rules, and establishes internal mapping relationships. We can think of him as a combination of arrays and linked lists, to achieve an easy-to-access data structure, and a hash table is a mapping between the data content and the data storage address.

The selection of the hash function determines how much of the hash table is affected by the number of hashes and the data contained in each key value. This is a mathematical question. There is no further discussion here.

Speaking of which, perhaps people still do not understand, we here for example:

For example, there are two tables:

Table A{a,f,c,d,b,e ...}

Table B{f,b,e,d,a,f ...}

And the amount of data in table A is less than table B, the two tables hash join process is as follows:

1. First the database will scan all the data in table A into memory.

2. The data of table A in memory, through the hash function to get the corresponding hash value (hash value).

3. Data for the same hash value (key value) in table A will be uniformly placed into a hash pool. Personally, the data in the hash pool is a collection of arrays and linked lists. The hash key value can be seen as the subscript of an array, and the data in the pool is concatenated in the array as a linked list.

Hash "Key Value"--Data 1--> Data 2 .....

A set of data in which data A and data C have the same hash value of 001, then they are assigned to a hash pool named 001.

4. The hash value and corresponding data are deposited into a hash table, and the end is established.

5. During the probing phase, the database reads each row of data in the scan table B sequentially and calculates a hash value by hashing function.

6. According to the hash value, go to the hash table and table a key value to match, find the corresponding hash pool.

7. Next, the data in table B goes to each data in the corresponding hash pool to compare and match. If the match succeeds, the data is connected.

Through the understanding of the principle, we can see this connection method, the need for a large number of computational operations, the CPU to bring some pressure. Usually the hash connection operation is done in memory, and if there is not enough memory, the database writes the data to the hard disk, affecting performance.

4. Summary

Features of three connection modes:

type

Connection column index

Table size

sort

connection clause

nested

Internal table: Must

External table: Have the best

Small

Optional

All types

Merge

Internal table: Must be clustered index or overwrite index

External table: must clustered index or overwrite Index

Large

Required

Equi-join

HASH

Internal table: not required

External table: optional, preferably with

Small external table, large internal table

Any

Not required

Equi-join

Three ways to pressure resources:

Nested loops Join

Merge connections

Hash join

Cpu

Low

Low (if not explicitly sorted)

High

Memory

Low

Low (if not explicitly sorted)

High

Io

May be high may be low

Low

May be high may be low

The above is personal understanding of three kinds of connections, the improper point please correct me.

Off Topic:

In fact, we can compare these three kinds of connections to a blind date.

Nested connection is the acquaintance of the introduction, relatives and friends according to your conditions, search under the resources around, and then arrange you and several girls meet, see can match on. If your condition is clear (external table index), and the friend is more familiar with the girl, the other party's requirements are also very clear (internal table index), then the success rate will be relatively high.

A merge connection is a small matchmaking fraternity that is organized by the community or website, such as the 8-minute face-to-head form of the movie "33 Days in Love". The men and women talk to each other in person (match judgment), and every few minutes change one to talk again, because everyone's condition and purpose is clear (all indexed), so the overall process efficiency will be relatively high.

Hash connection is like a universal matchmaking conference, such as Shanghai's Zhongshan Park (the condition of good married people cautious). The parents of single youth are randomly divided into small groups after entering the park for various reasons (hash function is divided into hash pool). Then the participants according to their own judgment (confirm hash key value), find the appropriate group (hash key value equal), conversation exchange conditions and information (try to match), see if there is no suitable candidate, there is a further understanding (matching success, connection).

November 14, 2013 Ralf Wang

SQL Server Three table join principle

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.