SQL Server query plan

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For a slow query, we usually need to know which steps are expensive first, and then try some improvement methods. In general, if you cannot solve the problem by improving the hardware performance or adjusting the OS or SQL server settings, the remaining options are usually the following:
1. Add the index of the corresponding field for the "scan" operation.
2. Sometimes re-indexing may also be effective. For details, refer to the following document.
3. Adjust the statement structure to guide sqlserver to use other query solutions for execution.
4. Adjust the table structure (by table or partition ).

How to query records on SQL Server

Speaking of this, I have to say that SQL Server indexes are used. Sqlserver has two types of indexes: clustered index and non-clustered index. The difference between the two is that [clustered Index] directly determines the storage location of records, or you can directly obtain records based on Clustered indexes. [Non-clustered Index] stores two pieces of information: 1. The value of the corresponding index field, 2. records the location of the corresponding clustered index (if the table does not have a clustered index, the record pointer is saved ). Therefore, if you can use [clustered Index] To find records, it is obviously the fastest.

SQL Server provides the following methods to find the data records you need:
1. [Table scan]: traverses the entire table and searches for matched record rows. This operation will check one row at a time. Of course, the efficiency is also the worst.
2. [index scan]: according to the index, some records are filtered out from the table, and then the matched record rows are searched. The display range is smaller than that of the first method, therefore, it is faster than [Table scan.
3. [index seek]: locate (obtain) the location where the record is stored Based on the index and then obtain the record. Therefore, it is faster than the first two methods.
4. Clustered index scan: Same as table scan. Note: Do not think that there is an index here, so it is different. In fact, it means that each row of records is scanned row by clustered index, because records are stored in the order of clustered index. [Table scan] Only says: the table to be scanned has no clustered index, so these two operations are essentially the same.
5. [clustered index seek]: Obtain records directly based on the clustered index, the fastest!

Therefore, when you find that a query is slow, you can first check which operations are expensive, and then check which operations are to find records, whether it is table scan or clustered index scan. If it is related to the two operation types, you need to add an index to solve the problem. However, adding an index will also affect the modification of the data table, because the index of the corresponding field must be updated when the data table is modified. Therefore, too many indexes may affect the performance. Another case is that it is not suitable for adding an index: the status of a field expressed as 0 or 1. For example, if the vast majority is 1, adding an index is meaningless. In this case, only the values 0 or 1 can be considered for separate storage. Table sharding or partition is a good choice.

If you cannot solve this problem by adding indexes and adjusting tables, you can try to adjust the statement structure and guide sqlserver to use other query solutions for execution. Requirements for this method: 1. The functions to be completed by the statement are clear; 2. The structure of the data table to be queried is clear; 3. The business background is clear. If this method can be used to solve the problem, it is also a good solution. However, sometimes sqlserver is intelligent, and even if you adjust the statement structure, it will not affect its execution plan.

How can we compare the performance of two statements with the same function? I suggest two methods: 1. directly place the two query statements in [SQL Server Management Studio] and view their [execution plan]. sqlserver will tell you the [query overhead] of the two queries in percentage ]. This method is simple and can be used as a reference. However, it is sometimes inaccurate. For the specific reason, proceed to the next step (the index statistics may be too old ).
2. according to the actual program call, write the corresponding test code to call: This method is more troublesome, but it is more representative of the actual call situation, and the results are more valuable for reference, therefore, it is worthwhile.

SQL Server join

In SQL Server, each join command runs in three more specific ways:

1. [nested loops join ],If one join input is small, and the other join input is large and an index has been created in its join column, the index nested loops connection is the fastest join operation.Because they require the least I/O and comparison.

Nested loop join is also known as "nested iteration". It uses a join input as an external input table (displayed as the top input in the graphic Execution Plan) and another join input as an internal (bottom) input table. External loops process external input tables row by row. The internal loop is executed for each external row and searches for matching rows in the internal input table. The following pseudo code can be used for understanding:
Foreach (row R1 in outer table) foreach (row R2 in inner table) if (R1, R2 meet the matching conditions) output (R1, R2 );
The simplest case is to scan the entire table or index during a search. This is called "nested loop join ". If an index is used for search, it is called "nested index loop join ". If an index is generated as part of a query plan (and the index is damaged immediately after the query is completed), it is called "temporary index nested loop join ". The query optimizer considers all these different situations.

Nested loop join is particularly effective if the external input is small and the internal input is large and the index is created in advance.In many small transactions (such as those that only affect a small group of rows), index nested loop joins are better than merge joins and hash joins. In large queries, nested loop join is usually not the best choice.

2. [merge join ],If the join inputs are not small but sorted in the join columns (for example, if they are obtained by scanning the sorted indexes), the merge join operation is the fastest join operation. If both join inputs are large and the sizes of these two inputs are similar, the performance of the pre-sorted merge join is similar to that of the hash join. However, if the sizes of the two inputs differ greatly, the hash join operation is usually much faster.

Merge join requires that both inputs are sorted on the merge columns, and the merge columns are defined by the equivalent (on) clause of the join predicates. Generally, the query optimizer scans the index (if an index exists in an appropriate group of columns), or places a sort operator under the merged join. In rare cases, although there may be multiple equivalent clauses, only some of the available equivalent clauses can be used to obtain the merged columns.

Since each input is sorted, the merge join operator obtains and compares a row from each input. For example, for an inner join operation, if the rows are equal, return. If the rows are not equal, the rows with a smaller value are discarded and the other row is obtained from the input. This process repeats until all rows are processed.

The merge join operation can be a regular operation or multiple-to-multiple operation. Use temporary tables to store rows for multiple-to-multiple join operations (which may affect the efficiency ). If there are repeated values in each input, when processing each repeated item in one of the input, the other input must be rewound to the start position of the repeated item. You can create a unique index to tell sqlserver that no duplicate value exists.

If a resident predicate exists, all rows that meet the merging predicate will take the value of the resident predicate, and only the rows that satisfy the resident predicate will be returned.

The merge join operation is fast, but it takes a lot of time to select a merge join if you want to sort the join operation. However, if the data volume is large and pre-ordered data can be obtained from the existing B-tree index, the merge join is usually the fastest available join algorithm.

3. [hash join], hash join can effectively process unordered large non-index input. They are useful for intermediate results of complex queries, because: 1. the intermediate results are not indexed (unless they have been explicitly saved to the disk and then created for the index) and are generally not sorted properly for the next operation in the query plan. 2. the query optimizer only estimates the size of intermediate results. For complex queries, the estimation may have a large error. Therefore, if the intermediate result is much larger than expected, the algorithm used to process the intermediate result must be effective and moderately weakened.

Hash join can reduce the use of non-standardization. Non-Standardization generally improves performance by reducing join operations, despite the risk of redundancy (such as inconsistent updates ). Hash join reduces the need for non-standardization. Hash join enables vertical partitioning (represented by several groups of columns in a single table using a separate file or index) to become a viable option for physical database design.

Hash join has two types of input: generate input and test input. The query optimizer assigns these roles so that the smaller two inputs are used as the generated input.

Hash join is used for Multiple matching operations: Internal join; left Outer Join, right Outer Join and full outer join; left half join and right half join; intersection; Union and difference. In addition, some deformation of the hash join can be deleted and grouped repeatedly, such as sum (salary) group by department. These modifications only use one input for the production and test roles.

Hash join is divided into three types: Hash join in memory, Grace hash join, and recursive hash join.

Hash join in memory: Hash join scans or computes the entire generated input, and then generates a hash table in memory. Insert each row into a hash bucket based on the hash value of the calculated hash key. If the entire generated input is smaller than the available memory, all rows can be inserted into the hash table. The generation phase is the detection phase. Scan or compute the entire test input in one row, calculate the hash key value for each test row, scan the corresponding hash bucket, and generate matching items.

Grace hash join: If the generated input is greater than the memory size, hash join is performed in several steps. This is called "Grace hash join ". Each step is divided into the generation and detection phases. First, consume the entire generation and test input and partition It (using the hash function on the hash key) into multiple files. The hash function can be used to ensure that any two join records must be in the same file pair. Therefore, connecting two large input tasks is simplified to multiple smaller instances of the same task. Then, the hash join is applied to each pair of partition files.

Recursive hash join: If the generated input is so large that the standard external merging input requires multiple merging levels, multiple partition steps and multiple partition levels are required. If only some partitions are large, you only need to use the additional partition steps for those partitions. To make all the partition steps as fast as possible, a large asynchronous I/O operation will be used so that a single thread can make multiple disk drives busy.

During the optimization process, you cannot always determine which hash join to use. Therefore, at the beginning of SQL Server, the hash join in the memory is used, and then gradually converted to the grace hash join and recursive hash join Based on the generated input size.

If the optimizer incorrectly predicts which of the two inputs is smaller and thus determines which one is used as the generated input, the generated and probe roles will be dynamically reversed. Hash join ensures that a small overflow file is used as the input. This technology is called "role reversal ". After at least one file overflows to the disk, role inversion occurs only in the hash join.

Note: You can also explicitly specify the connection mode. sqlserver will try its best to respect your choice. For example, you can write: inner loop join, left outer merge join, and inner hash join.

However, I recommend that you do not do this because SQL Server has the correct choice. If you do not believe it, try it.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More