An introductory illustration of the execution engine in SQL Server

An introductory illustration of the execution engine in SQL Server _mssql

Last Update:2017-01-19 Source: Internet

Author: User

Tags hash joins scalar

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is intended to categorize information about each of the operations in the execution plan.

Data Access Operations

First, the most basic operation is to access the data. This can be done either by directly accessing the table or by accessing the index. The data in the table is organized into heaps (HEAP) and B-trees, when there is no clustered index in the table, the data is organized by the heap, which is unordered, and the data in the table after the clustered index and the nonclustered index are organized in B-tree mode, which is stored sequentially. Typically, a nonclustered index contains only a subset of the entire table, and for a filtered index, only some of the rows are included.

Except that the data is organized differently, access data is divided into two ways, scanning (Scan) and lookup (seek), scanning is all the data that scans the entire structure, and the lookup simply looks for some data in the entire structure. It can be seen, therefore, that because the heap is unordered, it is impossible to find (seek) operations on the heap, and that it is possible to find it in the B-tree relative to the order of the B-tree. Heap scans are performed when data is accessed against a table that is organized as a heap, as shown in Figure 1.

Figure 1. Table Scan

As you can see, the table scan icon clearly indicates the nature of the table scan, scanning through the table in an unordered organization.

For both clustered and nonclustered indexes of the B-tree structure, it is also possible to scan, usually by a clustered index scan in order to get all the data in the index table or to get the index row tree to occupy most of the data so that the cost of the scan is less than the lookup. As shown in Figure 2.

Figure 2. Clustered Index Scan

The icon of the clustered index Scan can also clearly indicate the nature of the clustered index scan, and after finding the leftmost leaf node, scan all leaf nodes sequentially to scan the entire structure. Of course, the same concept is true for nonclustered indexes, as shown in Figure 3.

Fig. 3. Scanning of nonclustered indexes

For only some of the data in the B-tree structure, index lookup (seek) makes the B-tree meaningful. Depending on the key value you are looking for, you can make a single path down from just the root of B, thus eliminating the need for scanning unnecessary pages, Figure 4 is an index lookup in the query plan.

Figure 4. Clustered Index Lookup

Index lookup icon is also very vivid, you can see the icon that line from the root node down to the leaf node. Which is to find the page where the data is being sought, it's not hard to see that if we need to look up multiple data and spread it across different pages, the lookup operation needs to be repeated many times, and when the number is large to a certain extent, SQL Server chooses to consume a lower index scan rather than repeat the index lookup. For nonclustered index lookup, the concept is the same, no longer on the picture.

Bookmark Lookup (Bookmark lookup)

You might think that a nonclustered index can quickly find the data, but unfortunately, nonclustered indexes do not contain the required columns. SQL Server then faces two choices, directly accessing the base table to fetch data or finding the data in the nonclustered index, and then the base table to obtain the columns that are not covered by the nonclustered index. This choice depends on statistics such as the estimated number of rows. The Query Analyzer chooses the one that consumes less.

A simple bookmark lookup is shown in Figure 5.

Figure 5. A Simple bookmark lookup

As you can see from Figure 5, the desired row is first found through a nonclustered index. But this index does not contain all the columns, so you have to find these columns in the base table, so to do the key lookup, if the base table is organized in a heap, then this key lookup (key lookup) becomes the RID lookup (RID Lookup), a key lookup and a RID lookup are collectively referred to as bookmark lookups.

However, sometimes the number of rows returned by the index lookup causes the bookmark lookup to be far less performance than the direct scan, so SQL Server then chooses a scan instead of a bookmark lookup. As shown in Figure 6.

Figure 6. The StateProvinceID column has a nonclustered index, but because the number of rows returned is too high, the parser chooses the scan instead of the bookmark lookup

This estimate is based on statistical information, on the statistics, you can see my previous blog: Talking about the impact of statistics on queries in SQL Server

Aggregation Operations (Aggregation)

Aggregate functions can cause aggregation operations. An aggregate function is the process of aggregating data from a collection into 1 data according to a rule, or grouping it into multiple data based on a rule. Some aggregate functions, such as avg,sum,min, and distinct keywords are likely to result in two types of aggregation operations: Flow aggregation (stream Aggregation) and hash Aggregation.

Stream Aggregation (Stream Aggregation)

Before a stream aggregation needs to perform the aggregate function again, the aggregated data set is ordered, and the ordered data can be obtained either by executing the sort in the plan or directly from the clustered or nonclustered index, and in addition, the aggregation operations without the group by are becoming scalar aggregates, This type of operation must perform a stream aggregation.

For example, we do scalar aggregation directly, as shown in Figure 7.

Figure 7. Flow Aggregation

However, for clauses with GROUP BY, you need sort to ensure that the data is ordered according to the columns following group by. Note that the sort operation is memory-intensive and consumes tempdb when there is not enough memory. SQL Server always selects the lowest cost in the sort action and hash match. An action that requires sort is shown in Figure 8.

Figure 8. Flow aggregations that need to be sorted

The sort operations in Figure 8 are sorted by ProductLine and then aggregated according to their respective groupings.

Hash aggregation (hash aggregation)

The flow aggregation above is suitable for less data, but for a relatively large table. The cost of using a hash set is lower than the order. A hash set implements aggregations by creating a hash table in memory, so there is no need to sort the data collection. The hash list that is established in memory takes the column following group by as the key value, as shown in Figure 9.

Figure 9. Hash aggregation

After the hash table is set up in memory, is followed by the value of group by as the key, followed by each piece of data in the collection, and when the key does not exist in the hash table, add an entry to the hash table, and when the key already exists in the hash, follow the rules (the rule is an aggregate function, such as SUM, Avg or something) calculates the value in the hash list.

Connect (Join)

When multiple table joins (bookmark lookup, index connections are counted), SQL Server uses three different types of connection: loop nested joins (Nested Loops join), merge joins (merge join), hash join (hash join). These kinds of connections are not better than the other, but each connection will adapt to a particular scenario.

Looping nested joins (Nested Loops join)

A simple loop nested connection can be seen from Figure 10.

Figure 10. An instance of a loop nested connection

The loop nested connection icon is also very vivid, in the above external input (Outer input), here is the clustered index scan. And in the following internal input (Inner input), here is the clustered index lookup. The external input is performed only once, and the internal input is searched for each row that satisfies the join condition based on the external input. This is done 290 times for internal input because it is 290 lines.

You can see it through the Properties window. As shown in Figure 11:

Figure 11. Number of execution times for internal input

According to the principle of nested loops it is not difficult to see, because the external input is a scan, the internal input is lookup, when the table of two join is relatively small external input result set, and the table found by internal input is very large, the query optimizer is more inclined to choose the loop nesting method.

Merge joins (merge join)

Unlike looping nesting, a merge join is a single access from each table. From this point of view, merging joins is a lot faster than looping nesting. Let's look at a typical merge connection, as shown in Figure 12.

Figure 12. Merging connections

From the principle of merging joins it is not hard to imagine that merging joins first requires both sides to be orderly. And the condition of the join is equal to the number. Since two input conditions are ordered, it is possible to take a row from each input set for comparison, equal returns, unequal discards, and it is not difficult to see from here why the merge join is only allowed after the join is equal to the number. We can see this principle from the icon in Figure 11.

If both sides of the input data are unordered, the Query Analyzer will not select a merge connection, and we can force the merge join through the index hint, and to do so, the execution plan must be followed by a sort step to achieve order, as shown in Figure 13.

Figure 13. To implement the merge Join by sorting

Hash joins (hash join)

Hash joins only need to access only 1 of the data on both sides. A hash join is implemented by creating a hash table in memory. This compares memory consumption and consumes tempdb if there is not enough memory. But it doesn't have to be ordered as a merge connection. A typical hash join is shown in Figure 14.

Figure 14. Hash join

Here I removed the Costomer clustered index, otherwise two sequential input SQL Server would choose a lower-cost merge connection. SQL Server uses two of the above input to generate a hash table, the following input to explore, you can see this information in the Properties window, as shown in Figure 15.

Figure 15. Hash key generation and hash key detection

In general, a hash match is used when two input data is large and the required data is not sorted on either side or both sides.

Parallel

When multiple tables are connected, SQL Server also allows query parallelism to be allowed in multiple CPUs or multi-core situations, which undoubtedly increases efficiency, a parallel example shown in Figure 16.

Figure 16. Parallel enhancement of efficiency

Summarize

This article briefly describes the most common principles of operation in SQL Server execution plans, and understanding these steps and principles is the basic skill of optimizing queries.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More