Introduction to the physical connection Algorithm in SQL SERVER

Source: Internet
Author: User

In in-depth clustered index and non-clustered index (1) (2 ), we have analyzed in detail how SQL server uses heap and B-tree to organize tables and use these two data structures to help us query.

 

Here we will continue to discuss the connection algorithms in SQL SERVER.

 

A join algorithm is used to physically connect multiple data sources. SQL SERVER supports three join algorithms.

1. nested loop algorithm

2. merge algorithm

3. hash Algorithm

 

In fact, these algorithms are often used in programming and are not hard to understand.

 

1. nested loop

In general, a nested loop corresponds to a two-layer for loop. for each item in the outer for loop, it must be matched once in the internal loop.

Correspondingly, the outer for corresponds to the external input table, which is arranged in the execution plan graph. The inner for is the inner input table, which is arranged below the execution plan.

It is worth noting that the external input must be matched for each row, but not necessarily for each row in the internal table. Assume that there are N rows of external input and M rows of internal input tables. The worst time complexity is O (N * M ). In this worst case, the optimizer does not use nested loops, but uses the Hash matching algorithm. The Hash matching algorithm will be described later.

So we can get the following inference:

1. The smaller the external input, the better, because each line of the external input must be used for matching and cannot be reduced.

2. If the internal input table is used as a match, indexes can be used to reduce the range of matching conditions. In this way, matching rows can be obtained through a small number of searches.

Therefore, nested loop algorithms are most effective when the connection conditions are highly selective and have valid indexes available for internal input connection columns.

 

In the following example, the number of records in the MERs table is far smaller than that in the Orders table (the number of Customers must be smaller than the number of Orders ), therefore, the data in the MERs table is selected by the optimizer as an external input. We can see that the Customers table is above the Orders table.

When we directly query the index, SQL SERVER intelligently tells you what index is missing. In the following example, what we lack is the internal input "connection column ".CustidAnd query ColumnsOrderdate"Non-clustered Index

 

 

After we enter the following statement and add the index

Create index idx_nc_cid_od_ I _oid_eid_sid
ON dbo. Orders (custid, orderdate)
INCLUDE (orderid, empid, shipperid );

 

SQL SERVER still reports missing indexes because we can still use indexes for external input tables to filter the tables one round beforeReduce external input.

To achieve this goal, the non-clustered index is missing. The custname is used as the screening column, and the custid is also used for overwriting.

Create index idx_nc_cn_ I _cid
ON dbo. Customers (custname) INCLUDE (custid );

In this case, the index scan of the MERs table is changed to Index seek.

 

The execution plan we finally get after completing the index

 

 

 

 

Ii. Merge and sort

When the two input columns are ordered, the efficiency of the join operation is high.

There is no doubt about the importance of sorting. Binary Search and other searches are based on the order of input sequences.

Why the index first? Can we find a sorted data structure in the index? Yes, the leaf layer in Tree B is maintained in a certain logical order. That is to say, clustered indexes and non-clustered indexes can obtain ordered data at a lower cost by means of ordered scanning on the leaf layer. In this case, even if the size of the input table is relatively large, the merge join operation is quite powerful. If the scheduler analyzer determines that the elements in the record set on one side of the connection are uniquely identified, a one-to-many matching method is used (Multiple Elements on the other side are duplicated). In this case, the efficiency of merging and sorting should be the highest among several connection methods.

However, if the required data column does not have the preceding conditions, for large input, sorting is often a very costly operation (because the comparison-based sorting is the fastest n log n ), therefore, the optimizer usually does not choose to merge connections in this case. However, the consumption of small input sorting is acceptable. Small input can be obtained by filtering itself like in the previous example.

 

 

Analysis:

For the connected column custid, not to mention for the MERs table, it is the clustering key for the clustered index of the table. For the order table, we have created a non-clustered overwriting index for the custid above, therefore, ordered data can also be obtained at a lower cost according to ordered scanning.

The optimizer can obtain ordered data from the two tables at a low cost in the order of the custid in the connection column. Therefore, the optimizer selects the merge sort option.

 

It can be seen that the proportion of the total overhead of the Orders table scanning is the largest of 68%, because the Orders table has a lot of data. What if we have other filtering conditions on the Orders table? For example, limit orderdate

 

Because of the high selectivity of this filter, the obtained results only contain more than 1000 rows, accounting for only 1000000 of the total number of Orders tables in 0.1%. Under the two permissions, you can also discard the full table scan, but filter and then sort.

Of course, the scheduler may also estimate errors. If the obtained results are not highly selective, the overhead of sorting is usually very large. This is what we should pay attention to when discovering that the optimizer generates a Merge plan.

 

 

Iii. Hash join

For the principle, refer to the article I wrote. To reduce memory usage, use a table with a small amount of data to construct hashtable, then, scan another table row by row and use the hash function to determine whether the value already exists at a certain position of hashtable.

Hash join is usually used because existing indexes are missing, especially in OLTP applications.

In the first example where no index is created, Hash matching is used. Scan the Customer table row by row to construct hashtable. Because orderdate in the where condition can reduce the matching range, we first use clustered indexes to reduce the number of matched records in the Orders table, then, use the hash function to match the hashtable constructed above row by row.

 

If an appropriate index is missing, Hash matching may also be used. We have increased the orderdate range by dozens of times, from one day to several months of query, and then the merge join algorithm is no longer suitable.

 

 

Summary:

The Hash join algorithm is optimal in terms of time complexity. the time complexity of joining a table with M records and a table with N records is O (M + N ), it is better than the nested join algorithm O (M * log2n) with clustered indexes ). However, if the memory is insufficient to store the hashtable during the construction of Hashtable, a temporary space exchange will be generated, resulting in a large amount of IO, thus offsetting the benefits of the connection.

 

If you think it is helpful, click "recommendation" or leave some comments to discuss your ideas.

I plan to write the following:Object CAndGitA series of articles.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.