Getting Started with the execution engine in SQL Server

Source: Internet
Author: User
Tags joins scalar

Introduction

When the query Optimizer parses the T-SQL statement and chooses the most low-consumption execution plan from the execution plan, the execution is carried out by the execution engine (execution engine). This article is intended to classify information about each of the operations in the execution plan.

Data Access Operations

The first and most basic operation is to access the data. This can be done either by directly accessing the table or by accessing the index. The data in the table is organized into heap (heap) and B-tree, where the data is organized by the heap when there is no clustered index in the table, and this is unordered, and the data in the table is organized in the form of a B-tree, which is stored in an orderly manner. Typically, a nonclustered index contains only a subset of the entire table, and for filtered indexes, it contains only a subset of the rows.

Apart from the way data is organized, access to data is also divided into two ways, scan and lookup (Seek), which scans all the data in the entire structure, and finds only some of the data in the entire structure. Therefore, it can be seen that because the heap is unordered, it is impossible to find (seek) operations on the heap, and relative to the B-tree order, making it possible to find in the B-tree. When data access is made to a table that is organized in a heap, a heap scan is performed, as shown in 1.

Figure 1: Table scan

As you can see, the icon of the table scan clearly indicates the nature of the table scan and scans it from beginning to end in an unordered organization table.

The clustered index and nonclustered index of the B-tree structure can also be scanned, generally, in order to get all the data in the index table or to get the index row tree accounted for most of the data to make the scan less than the cost of the search, the clustered index scan. As shown in 2.

Figure 2: Clustered Index Scan

The icon of the clustered index Scan can also clearly indicate the nature of the clustered index scan, and after finding the leftmost leaf node, scan all the leaf nodes in turn to achieve the function of scanning the whole structure. Of course, the same concept is true for nonclustered indexes, as shown in 3.

Figure 3: Scanning of nonclustered indexes

For the selection of only part of the data in the B-tree structure, index lookup (seek) makes the B-tree meaningful. Depending on the key value you are looking for, you can take a single path down from just the root of B, eliminating the need to scan for unnecessary pages, and Figure 4 is an index lookup in the query plan.

Figure 4: Clustered index Lookup

The index lookup icon is also very expressive, you can see the icon of the line from the root node all the way down to the leaf node. That is to find the page of the data, it is not difficult to see, if we need to find multiple data and scattered in different pages, this lookup operation needs to be repeated a lot of times, when the number is large enough, SQL Server chooses to consume a lower index scan than to repeat the index lookup. For nonclustered index lookups, the concept is the same, and it is no longer a piece.

Bookmark Lookup (Bookmark lookup)

You might think that if a nonclustered index can quickly find the data you're looking for, it's a pity that the nonclustered index doesn't include what you want to do with it. SQL Server then faces two choices, accesses the base table directly to fetch the data, or finds the data in a nonclustered index, then goes to the base table to get the columns that the nonclustered index does not overwrite. This choice depends on statistics such as the estimated number of rows. The Query Analyzer chooses the one that consumes less.

A simple bookmark is found in 5.

Figure 5. A Simple bookmark lookup

As can be seen from Figure 5, the desired row is first found through a nonclustered index, but the index does not contain all the columns, so it is also necessary to find the columns in the base table, so the key lookup, if the base table is organized as a heap, then the key lookup becomes the RID lookup (RID Lookup), the key lookup and the RID lookup are collectively known as bookmark lookups.

However, sometimes the number of rows returned by an index lookup makes the performance of the bookmark lookup far less than the direct scan operation, so SQL Server chooses to scan instead of bookmark lookup. As shown in 6.

Figure 6. The StateProvinceID column has a nonclustered index, but because the number of rows returned is too large, the parser chooses to scan instead of the bookmark lookup

This estimate is based on statistical information, for statistical information, you can read my previous blog: Talking about the effect of statistics on queries in SQL Server

aggregation Operations (Aggregation)

Aggregation functions can cause aggregation operations. An aggregate function is the process of aggregating data from a collection into 1 data according to a rule, or by grouping them into multiple data based on a rule. Some aggregation functions, such as avg,sum,min, and the DISTINCT keyword can lead to two types of aggregation operations: Stream Aggregation (stream Aggregation) and hash aggregation (hash Aggregation).

Stream Aggregation (Stream Aggregation)

Before a stream aggregation needs to execute an aggregate function, the aggregated data collection is ordered, and the ordered data can be obtained either directly from a clustered or nonclustered index, or directly to the ordered data, and the aggregate operation without group by is made a scalar aggregation. This type of operation must be performed by a stream aggregation.

For example, we do scalar aggregation directly, as shown in 7.

Figure 7. Stream aggregation

But for a GROUP BY clause, because the data needs to be ordered by the column following group by, sort is required to guarantee the sort. Note that the sort operation is a memory-intensive operation that consumes tempdb when there is not enough memory. SQL Server always chooses the lowest cost in the sort operation and hash matching. An operation that requires sort 8 is shown.

Figure 8: Stream aggregation that needs to be sorted

The sort operations in Figure 8 are sorted by ProductLine and then aggregated according to their respective groupings.

Hash aggregation (hash aggregation)

The above stream aggregation is suitable for less data, but for a relatively large table. The cost of using the hash collection is lower than the sort. The hash collection implements aggregations by creating a hash table in memory, so there is no need to sort the data collection. The hash list established in memory is shown as the key value in the column following group by, 9.

Figure 9. Hash aggregation

After the hash list is established in memory, the value following the group by is used as the key, then each data in the collection is processed sequentially, and when the key does not exist in the hash list, an entry is added to the hash table, and when the key is already present in the hash table, the rule is the aggregate function, such as SUM, Avg something) calculates the value in the hash list (value).

connection (join)

When multiple tables are connected (bookmark lookup, connection between indexes), SQL Server takes three different types of connections: loop nested joins (Nested Loops join), merge joins (merge join), Hash joins (hash join). These types of connections are not better than others, but each of these connections adapts to a particular scenario.

Looping nested joins (Nested Loops join)

A simple loop nesting connection can be seen from Figure 10.

Figure 10. An instance of a loop nested connection

The icon for looping nested connections is also very expressive, in the above external input (Outer input), which is the clustered index scan. And in the following internal input (Inner input), here is the clustered index lookup. The external input is executed only once, and the internal input is looked up based on each row of the external input that satisfies the join condition. This is done 290 times for the internal input because it is 290 rows.

You can see. 11 in the Properties window:

Figure 11: Number of executions of internal inputs

According to the principle of nested loops it is not difficult to see that because the external input is scanned, the internal input is lookup, when the two join table outside the input result set is relatively small, and the internal input to find the table is very large, the query optimizer prefers to choose the loop nesting method.

Merging joins (merge join)

Unlike loop nesting, a merge connection is a single access from each table. From this point of view, merging connections is much faster than looping nesting. Let's look at a typical merge connection, as shown in 12.

Figure 12: Merge Connection

From the principle of merging joins it is not difficult to imagine that merging connections first requires both parties to be orderly. And the condition of the join is equal to the number. Since two input conditions are already in order, it is not difficult to see why the merge join is only allowed to be equal to the join after a row is compared from each input set, equal returns, and unequal discards. We can see this principle from the icon in Figure 11.

If both sides of the input data are unordered, the Query Analyzer does not select the merge connection, and we can also force the merge connection with the index hint, in order to achieve this, the execution plan must be ordered with a sort step, as shown in 13.

Figure 13: Sorting to implement merge Join

Hash connection (hash join)

Hash joins also only need to access data from both sides of the 1 times. Hash connections are implemented by establishing a hash table in memory. This consumes more memory and consumes tempdb if memory is low. But it's not like merging connections that requires both sides to be orderly. A typical hash join of 14 is shown.

Figure 14. Hash connection

Here I removed the clustered index for Costomer, otherwise two sequential input SQL Server chooses a lower cost merge connection. SQL Server uses two of the above inputs to generate a hash table, the following input to probe, which can be seen in the Properties window, as shown in 15.

Figure 15: Hash key generation and hash key detection

Typically, a hash match is used when the two input data is large, and the data is reached when one or both of the conditions are not sorted.

Parallel

when multiple tables are connected, SQL Server also allows query parallelism in the case of multiple CPUs or multicore, which undoubtedly improves efficiency, as shown in a parallel example of 16.

Figure 16: Parallel increase in efficiency

Summary

This article provides a brief introduction to the most common operational principles of SQL Server execution planning, and understanding these steps and principles is the basic function of optimizing queries.

Getting Started with the execution engine in SQL Server

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.