Today I'll cover the three types of connection operators in SQL Server: Loop nesting, hash matching, and merge joins. This paper mainly introduces the different and complexity of the three kinds of connections in the form of examples.
The sample database AdventureWorks is used in this article, as follows: http://msftdbprodsamples.codeplex.com/releases/view/4004
Introduction: What is a join operator
The join operator is an algorithm type, which is an operator that the SQL Server optimizer chooses to implement a logical connection between two data sets. The optimizer can select different algorithms for each set of data based on different scenarios, such as request query, available index, statistics, and estimated row count
You can find out how operators are used by viewing the execution plan. Next, let's look at how to use it.
NESTED LOOPS (Loop nesting)
Let's take a look at the following example (Query July 2001 data):
Selectoh. OrderDate, OD. OrderQty, OD. ProductID, OD. Unitpricefromsales.salesorderheader as ohjoinsales.salesorderdetail as Odonoh. SalesOrderID = OD. Salesorderidwhereoh.orderdate between ' 2001-07-01 ' and ' 2001-07-31 '
The results of the implementation plan are as follows:
At the top right of the figure is called "outer input", below which is called "inner input"
Essentially, the "Nested Loops" operator is to find matching rows for the internal input for each record's external input.
Technically, this means that the appearance of the clustered index is scanned for external input related records, and then the inner table clustered index looks for each record that matches the outer index.
We can verify this information by placing the mouse over the clustered index scan operator, see this tooltip:
Looking at the estimated number of rows for this execution is 1, the index lookup ToolTip is as follows:
The estimated number of rows performed by this discovery is 179, which is very close to the returned external input line.
Calculate by complexity (assuming that n is the number of rows outside the output, M is the total number of lines in the Salesorderdetai table): The query complexity is O (NLOGM), where LOGM is the complexity of each lookup in the internal input table.
The SQL Server optimizer prefers to select this type of operator (Nested Loop) when the external input is small and the internal input has an index on the connected field. The greater the gap between the external and internal input data rows, the higher the performance this operator provides.
Merge Join (merge connection)
The "Merge" algorithm is the most efficient way to connect two large, sequentially stored keys on the connection. Take a look at the following query example (the query returns the ID of the user and the sales table):
Selectoc. CustomerID, OH. Salesorderidfromsales.salesorderheader as ohjoinsales.customer as Oconoh. CustomerID = OC. CustomerID
The query execution plan is as follows:
- First we notice that two sets of data are ordered on CustomerID: Because the clustered index is ordered and the field is a nonclustered index on the SalesOrderHeader table.
- Depending on the arrow in the operator (placed on top of the mouse), we can see that each returned result has a large number of rows.
- In addition, use the = operator after the ON clause.
That is, these three factors cause the optimizer to select the merge Join query operator.
The maximum performance of using this join operator is to execute one of the two input operators at a time. We can put the mouse on top of two data to see the number of executions are 1, that is, the algorithm is very efficient.
The merge connection reads two inputs at a time and compares them. If the match is returned, the smaller number of rows is discarded, because the inputs are ordered. The discarded rows no longer match any rows.
Knowing that one of the tables has been repeatedly matched, even if another table has data, the biggest time-consuming complication is that two tables have completely different key values, then the maximum complexity is: O (n+m).
Hash match (hashes matching)
The "Hash" connection is what we call the "the Go-to Guy" operator. This operator is selected when a scene is not supported by other join operators. For example, when a table is not sorted, or if there is no index. When the optimizer chooses this operator, it is generally possible that we are not doing some basic work (for example, indexing). But there are cases (complex queries) that have no other way, and only select it.
Take a look at the following query (get the dataset containing the Sales ID field starting with "John" in the first and last names in the Contacts table):
Selectoc. FirstName, OC. LastName, OH. Salesorderidfromsales.salesorderheader as ohjoinperson.contact as Oconoh. ContactID = OC. Contactidwhereoc.firstname like ' john% '
The execution plan looks like this:
Because the ContactID column does not have an index, the hash operator is selected.
Before delving into this example, two important concepts are introduced: one is "Hashing" function and the other is "Hash Table".
A function is a procedural function that receives 1 or more values and then converts them to a symbolic value (usually a number). This function is usually unidirectional, meaning that the original value cannot be reversed, but certainty guarantees that if you provide the same value, the symbol value is exactly the same. In other words, several different input values can have the same hash value.
"Hash Table" is a data structure that puts all the rows in a bucket of the same size. Each bucket represents a hash value. This means that when you activate the line of a function, you will know which bucket it belongs to, using the result.
With statistics, SQL Server chooses a smaller two data input to provide construction input, and the input is used to create a hash table in memory. If there is not enough memory, a physical disk is used in tempdb. After the hash table is established, SQL Server will get the data from the larger table called the probe input. The hash match function is used to compare with the hash table, and then the matching rows are returned. In the graph execution plan, construct the input query above, and probe the input query below.
As long as the smaller table is very small, the algorithm is very effective. But if two tables are very large, this could be a very inefficient execution plan.
Query hints
Using hints, breaking SQL Server uses the specified connection type. But this is strongly not recommended, especially in production environments, because there is no best choice for eternity (because data is changing), and the optimizer is usually correct.
Add the OPTION clause as the end of the query, using the keyword LOOP join, or the MERGE join or HASH join to force the connection.
See how it's implemented:
SELECT OC. CustomerID, OH. Salesorderidfrom Sales.SalesOrderHeader as Ohjoin Sales.Customer as Ocon OH. CustomerID = OC. Customeridoption (HASH JOIN) SELECT OC. FirstName, OC. LastName, OH. Salesorderidfrom Sales.SalesOrderHeader as Ohjoin person.contact as Ocon OH. ContactID = OC. Contactidwhere OC. FirstName like ' john% ' OPTION (LOOP JOIN) SELECT OC. FirstName, OC. LastName, OH. Salesorderidfrom Sales.SalesOrderHeader as Ohjoin person.contact as Ocon OH. ContactID = OC. Contactidwhere OC. FirstName like ' john% ' OPTION (MERGE JOIN)
Summary nested Loops
- Complexity: O (NLOGM).
- When one of the tables is very small.
- Larger tables allow you to find connection fields using an index.
Merge Join
- Complexity: O (n+m).
- The connection fields for the two inputs are ordered.
- Using the = operator
- Suitable for very large tables
Hash Match
- Complexity: O (N*HC+M*HM+J)
- The last default operator
- Match rows with hash tables and dynamic hash matching functions
This essay details three kinds of link operators and their triggering mechanism, of course, these are also dynamic, as the previous said there is no best operator, only the most suitable, according to the actual request to choose a different operator.
Introduction to SQL Connection operators (loop nesting, hash matching, and merge joins)