Introduction to SQL Connection operators (loop nesting, hash matching, and merge joins)

Last Update:2016-04-28 Source: Internet

Author: User

Tags joins

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today I'll cover the three types of connection operators in SQL Server: Loop nesting, hash matching, and merge joins. This paper mainly introduces the different and complexity of the three kinds of connections in the form of examples.

The sample database AdventureWorks is used in this article, as follows: http://msftdbprodsamples.codeplex.com/releases/view/4004

Introduction: What is a join operator

The join operator is an algorithm type, which is an operator that the SQL Server optimizer chooses to implement a logical connection between two data sets. The optimizer can select different algorithms for each set of data based on different scenarios, such as request query, available index, statistics, and estimated row count

You can find out how operators are used by viewing the execution plan. Next, let's look at how to use it.

NESTED LOOPS (Loop nesting)

Let's take a look at the following example (Query July 2001 data):

Selectoh. OrderDate, OD. OrderQty, OD. ProductID, OD. Unitpricefromsales.salesorderheader as ohjoinsales.salesorderdetail as Odonoh. SalesOrderID = OD. Salesorderidwhereoh.orderdate between ' 2001-07-01 ' and ' 2001-07-31 '

The results of the implementation plan are as follows:

At the top right of the figure is called "outer input", below which is called "inner input"

Essentially, the "Nested Loops" operator is to find matching rows for the internal input for each record's external input.

Technically, this means that the appearance of the clustered index is scanned for external input related records, and then the inner table clustered index looks for each record that matches the outer index.

We can verify this information by placing the mouse over the clustered index scan operator, see this tooltip:

Looking at the estimated number of rows for this execution is 1, the index lookup ToolTip is as follows:

The estimated number of rows performed by this discovery is 179, which is very close to the returned external input line.

Calculate by complexity (assuming that n is the number of rows outside the output, M is the total number of lines in the Salesorderdetai table): The query complexity is O (NLOGM), where LOGM is the complexity of each lookup in the internal input table.

The SQL Server optimizer prefers to select this type of operator (Nested Loop) when the external input is small and the internal input has an index on the connected field. The greater the gap between the external and internal input data rows, the higher the performance this operator provides.

Merge Join (merge connection)

The "Merge" algorithm is the most efficient way to connect two large, sequentially stored keys on the connection. Take a look at the following query example (the query returns the ID of the user and the sales table):

Selectoc. CustomerID, OH. Salesorderidfromsales.salesorderheader as ohjoinsales.customer as Oconoh. CustomerID = OC. CustomerID

The query execution plan is as follows:

First we notice that two sets of data are ordered on CustomerID: Because the clustered index is ordered and the field is a nonclustered index on the SalesOrderHeader table.
Depending on the arrow in the operator (placed on top of the mouse), we can see that each returned result has a large number of rows.
In addition, use the = operator after the ON clause.

That is, these three factors cause the optimizer to select the merge Join query operator.

The maximum performance of using this join operator is to execute one of the two input operators at a time. We can put the mouse on top of two data to see the number of executions are 1, that is, the algorithm is very efficient.

The merge connection reads two inputs at a time and compares them. If the match is returned, the smaller number of rows is discarded, because the inputs are ordered. The discarded rows no longer match any rows.

Knowing that one of the tables has been repeatedly matched, even if another table has data, the biggest time-consuming complication is that two tables have completely different key values, then the maximum complexity is: O (n+m).

Hash match (hashes matching)

The "Hash" connection is what we call the "the Go-to Guy" operator. This operator is selected when a scene is not supported by other join operators. For example, when a table is not sorted, or if there is no index. When the optimizer chooses this operator, it is generally possible that we are not doing some basic work (for example, indexing). But there are cases (complex queries) that have no other way, and only select it.

Take a look at the following query (get the dataset containing the Sales ID field starting with "John" in the first and last names in the Contacts table):

Selectoc. FirstName, OC. LastName, OH. Salesorderidfromsales.salesorderheader as ohjoinperson.contact as Oconoh. ContactID = OC. Contactidwhereoc.firstname like ' john% '

The execution plan looks like this:

Because the ContactID column does not have an index, the hash operator is selected.

Before delving into this example, two important concepts are introduced: one is "Hashing" function and the other is "Hash Table".

A function is a procedural function that receives 1 or more values and then converts them to a symbolic value (usually a number). This function is usually unidirectional, meaning that the original value cannot be reversed, but certainty guarantees that if you provide the same value, the symbol value is exactly the same. In other words, several different input values can have the same hash value.

"Hash Table" is a data structure that puts all the rows in a bucket of the same size. Each bucket represents a hash value. This means that when you activate the line of a function, you will know which bucket it belongs to, using the result.

With statistics, SQL Server chooses a smaller two data input to provide construction input, and the input is used to create a hash table in memory. If there is not enough memory, a physical disk is used in tempdb. After the hash table is established, SQL Server will get the data from the larger table called the probe input. The hash match function is used to compare with the hash table, and then the matching rows are returned. In the graph execution plan, construct the input query above, and probe the input query below.

As long as the smaller table is very small, the algorithm is very effective. But if two tables are very large, this could be a very inefficient execution plan.

Query hints

Using hints, breaking SQL Server uses the specified connection type. But this is strongly not recommended, especially in production environments, because there is no best choice for eternity (because data is changing), and the optimizer is usually correct.

Add the OPTION clause as the end of the query, using the keyword LOOP join, or the MERGE join or HASH join to force the connection.

See how it's implemented:

SELECT OC. CustomerID, OH. Salesorderidfrom Sales.SalesOrderHeader as Ohjoin Sales.Customer as Ocon OH. CustomerID = OC. Customeridoption (HASH JOIN) SELECT OC. FirstName, OC. LastName, OH. Salesorderidfrom Sales.SalesOrderHeader as Ohjoin person.contact as Ocon OH. ContactID = OC. Contactidwhere OC. FirstName like ' john% ' OPTION (LOOP JOIN) SELECT OC. FirstName, OC. LastName, OH. Salesorderidfrom Sales.SalesOrderHeader as Ohjoin person.contact as Ocon OH. ContactID = OC. Contactidwhere OC. FirstName like ' john% ' OPTION (MERGE JOIN)

Summary nested Loops

Complexity: O (NLOGM).
When one of the tables is very small.
Larger tables allow you to find connection fields using an index.

Merge Join

Complexity: O (n+m).
The connection fields for the two inputs are ordered.
Using the = operator
Suitable for very large tables

Hash Match

Complexity: O (N*HC+M*HM+J)
The last default operator
Match rows with hash tables and dynamic hash matching functions

This essay details three kinds of link operators and their triggering mechanism, of course, these are also dynamic, as the previous said there is no best operator, only the most suitable, according to the actual request to choose a different operator.

Introduction to SQL Connection operators (loop nesting, hash matching, and merge joins)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More