SQL Server statement Optimization

Last Update:2018-12-03 Source: Internet

Author: User

Tags sql server query

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Core tips: About the execution plan
Select Show execution from the query menu of the SQL Server Query analyzer.
Plan: run the SQL query statement. The results window contains grid and execution.
Three tabs: plan and messages. The execution plan in graphical form is in the order from right to left, which is also the execution order. Each icon in the execution plan represents an operation

Description of the Execution Plan

Select Show execution from the query menu of the SQL Server Query analyzer.
Plan: run the SQL query statement. The results window contains grid and execution.
Three tabs: plan and messages. The execution plan in graphical form is in the order from right to left, which is also the execution order. Each icon in the execution plan represents an operation, and each operation will
There is one or more inputs, and one or more outputs. Input and Output may be a physical data table, index data structure, or intermediate result set/Data Structure during execution. Rat
Move to the icon to display the specific information of the operation, such as the logical and physical operation name, the number and size of records, I/O cost, CPU cost, and the specific expression of the operation (Parameter
Argument ). Move the cursor over the connection arrow to display the number of records and the size of records in the Operation output result set at the start of the arrow. Generally, this output result set can be understood as the end of the arrow.
Input
.

In addition, some supplementary instructions on the Execution Plan: 1.
The information displayed in the execution plan is an "evaluation" result, not 100% accurate information. For example, the number of records is taken from the statistical information, the I/O and CPU costs come from the basis of the Execution Plan generation process.
Statistical information and other evaluation results. 2. The execution plan is not necessarily accurate.
The statistical information accuracy maintained by the server, on the other hand, the SQL statement Compilation Time may be different from the execution time environment (memory usage, CPU status, etc.

I/O cost and CPU cost evaluation, SQL statement compilation and execution process are not discussed here. In addition, although the execution plan is not necessarily accurate, it is still the most important basis for SQL statement analysis, because you can understand that in most cases, SQL Server is executed in this way.

Join method description

In the database, how do SQL statements such as tablea inner join tableb and tablea left out join tableb execute the join operation? Which algorithm does SQL server use to join two tables?

SQL Server 2000 has three methods: nested loop, merge, and hash. Oracle also uses these three methods.

1. nested loop join

1.1 example SQL

Select... from tablea inner join tableb on tablea. col1 = tableb. col1 where tablea. col2 =? And tableb. col2 =?

No index is created in tablea, and a primary key (clustered index) is created on col1 ).

1.2 description of pseudo-code of the Algorithm

foreach rowA in tableA where tableA.col2=? { search rowsB from tableB where tableB.col1=rowA.col1 and tableB.col2=? ; if(rowsB.Count<=0) discard rowA ; else output rowA and rowsB ; }

The Join Operation has two inputs. In the preceding example, tablea is the outer input used for the outer loop, and tableb is the inner input used for the internal loop. The following describes the specific steps for SQL Server to complete this operation for the execution plan.

1.3 move the execution plan method to the beginning of the article.

1.4 procedure

The following figure shows the execution plan of the example SQL statement. On the Right of the nested loop operation, outer input is located on the top and inner is located on the bottom.
Input. You cannot determine the outer input and inner Based on the table in the join operation.
The input relationship must be determined from the execution plan, because the SQL server will automatically select which one is used as the inner input.

A) Perform table scan on tablea. The input of this operation is the data in Table A. The data is stored on the disk and loaded to the memory during the operation. The output is the outer input of the qualified record set, which will be used as B. In this operation, tablea. col1 =? .

B) Execute the nested loop operation described in the preceding pseudocode. Perform Step C for each output record in ).

C) Perform the clustered index seek operation on tableb. This operation is performed in the nested
The input is the clustered index data of Table B. It uses tableb. col1 = Rowa. col1 and tableb. col2 =? These two items
And select the matching results from the clustered index of tableb.

D) construct the returned result set. From the output of the nested loop, sort out the specified fields in the SELECT statement to construct the final output result set.

1.5 advanced description

The preceding example uses clustered indexes for inner input. The following describes non-clustered indexes to enhance the understanding and analysis capabilities of the execution plan.

Modify the primary key on table B col1 to a non-clustering mode. The execution plan of the SQL statement in the example is as follows:

The preceding steps a), B), and C are the same as those in 1.4. Note that step C is used to perform the index seek operation.
Index
Seek is different. The root node of the clustered index is each actual data record, and the root node of the non-clustered index is a reference to the root-node key value of the clustered index (if the table has a clustered index ), or the actual data
Reference of the recorded rowid (a table without clustered indexes, which is called a heap table ). Clustered Index
After seek is executed, the actual physical data records have been loaded into the memory, while the index
After the seek operation, the actual physical data records are not loaded, but only the root node data of non-clustered indexes. Only the index field data and the referenced clustered index key value or rowid are included.
In this step, SQL server uses the index field value in the root node data of the non-clustered index
The record (ROWA) associated fields in the input are matched to determine whether the results meet the conditions. If yes, the data structure of the root node of the non-clustered index is saved to nested.
In the output data structure of the loop operation, a bookmark is created, indicating that reference data needs to be obtained based on the bookmark when necessary.

D) execute the bookmark lookup operation. The output of the nested loop operation is a memory data structure. Before you sort the output result set of the entire query statement from the memory data structure, you need to handle the bookmark reference problem, the bookmark lookup operation is to obtain detailed record data based on the clustered index key value or rowid referenced in the bookmarks.

E) filter. Review the previous operations and perform the nested
During loop, only non-clustered index fields (tableb. col1) and outer
Tableb. col2 =? This operation uses tableb. col2 =? Pair
The output of bookmark lookup is filtered.

The person who looks at it carefully may have a few questions after coming here, 1. tablea. col2 =? Why is there no filter operation? 2.
Why didn't the filter operation appear in 1.4? The statement is as follows: 1. Table is executed on tablea.
Scan is to scan the actual data. During this scan, tablea. col2 =? This condition is filtered to avoid an additional filter operation.
Move the cursor to the table scan operation. tablea. col2 =? is displayed in the prompt parameter (argument? Has been applied. 2.
As mentioned above, the root node of the clustered index is the actual data record, and the clustered index is executed.
Seek eventually scanned the actual data records. In this process, tableb. col2 =? This condition also avoids an additional filter operation. This is
The reason for the filter operation is not found in 1.4.

F) construct the returned result set. Same as step d in step 2.

1.6 nested loop usage conditions

If any Join Operation meets the nested loop usage conditions, the SQL server will evaluate the cost of the nested loop (I/O cost, CPU cost, etc.) during the query optimization process, determine whether to use this join method based on the evaluation results.

The condition for using the nested loop method is: a) the number of records in the outer input is not large, preferably below 1000-2000. Generally, it is hard to say that the number of records exceeds 3000. The nested loop is basically not selected. B) as an inner input table, there are indexes that can be used for this query.

This is because the number of outer input records is small, which means that the number of Outer Loops is relatively small. The inner input has available indexes, which means to search inner in the loop.
When there are matching records in the input table, the efficiency is very high, even if the inner input table actually contains millions of records. Based on these two conditions, nested
Loop execution efficiency is very high. In the three join modes, memory and CPU consumption are the least (except for the improper forced designation of the nested loop method ).

Note: The number of records in outer input does not indicate the number of records in outer input.
The actual number of records in the input table. For example, in the example SQL, if tablea maintains statistics on col2 (the index of col2 exists or the statistical information maintained separately), and
Tablea. col2 =? The condition value of is in the form of Sarg (searchable parameters ).
The server can use statistical information and condition values to evaluate the number of records that meet the condition, and query the results that meet the condition tablea. col2 =? Is used for the outer loop. Inner
The input table has an available index, which is used by the inner input table
The fields associated with the input table (one or more fields) can hit an index (some or all of these fields appear before an index field ).

It does not mean that SQL Server 100% selects the nested loop. Because SQL
The server query optimizer is based on cost evaluation. If the cost evaluated by other solutions is better than this
Server selects other join methods. For example, if the index that meets the conditions on inner input is a non-clustered index
The server may need an additional bookmark lookup operation to obtain the actual record data.
The input table has a large amount of data and a high degree of index fragmentation, which may lead to a very high bookmark lookup cost.
Server will try to evaluate and select other join schemes.

1.7 force specify the nested loop method

If you use the loop keyword, for example, tablea inner loop join tableb, SQL Server is forced to use the nested loop method to perform this join operation. Or use the option, for example, tablea inner join tableb option (loop join)

The nested loop algorithm has its applicability. The efficiency within this range is the highest. Beyond this range, the efficiency is very poor. Unless you are very sure, do not forcibly specify the join method.

Next we will not go into detail like above.

2. Merge join

Merge
The first step of join is to ensure that the two joined tables are sorted by the joined fields. If the associated fields have available indexes and the sorting is consistent, you can directly perform merge
Join Operation; otherwise, SQL Server needs to sort the joined tables by the joined fields (that is, in the merge
On the two inputs before join, you may need to execute an sort operation before performing merge join ).

After the two tables are sorted by the joined fields, the merge join operation starts to match one record in each table. If the Join Operation meets the joined conditions, it is placed in the result set. Otherwise, discard a record with a small associated field value, and retrieve the next record from the corresponding table of this record to continue matching until the entire cycle ends.

When performing merge join on many-to-many joined tables, temporary tables are usually required. For example, a join B uses merge
When a join operation is performed, if a group of values of a joined field exist in both A and B, there are multiple records A1, a2... an, B1, B2... bn, the A1,
A2..... an, all equal records B1, B2.. bn must be matched once in B. In this way, the pointer needs to be moved from B1 multiple times
BN. Each time, the corresponding B1. .. bn record needs to be read. Pre-read B1.. BN records into a temporary memory table, which is faster than reading from the original data page or disk.

The merge join operation itself is very fast, but sorting before merge join may be quite time-consuming (SQL
The server consumes the most memory and CPU operations. One is big data sorting and the other is Big Data hash operations. These operations refer to the sort and hash operations in the query plan, such
Hash join, distinct operations implemented using the hash algorithm, and so on, instead of referring to order in your SQL
By keyword), especially for record sets with a large amount of data, resulting in a very high query cost for using merge join. For tables with a large data volume, if merge
Join joined fields can use clustered indexes, merge
Join is one of the fastest join methods. Therefore, the optimization scheme is to design the association relationship and the index structure of the table at the table structure design level. SQL statements make full use of the index to minimize merge
Sort operations before join to reduce bookmark lookup operations.

In general, if the nested loop condition cannot be met, the evaluation of the merge join method will be considered. Merge
The choice of join is mainly to consider the data size of two inputs and whether the associated fields can hit the index. For example, tablea join
Tableb, associated fields can hit indexes in both tables. If the data volume exceeds the Selection Range of the nested loop, merge will be considered.
Join method. Of course, if the amount of data in tablea and tableb is too large, resulting in an excessively high cost, it will give up merge join and evaluate hash join.

Use Inner merge join or option (merge join) to force the merge join method.

3. Hash join

For details about the hash algorithm, you can view some data about the data structure. Hash
Algorithms are mainly used for searching large amounts of data. To avoid matching all the data in the same way as merge join
Hash function, first give the data to be searched based on the hash
Key creates a hash value as an index. During the search, the hash value is first located in a small search range, and then the matching results in this range are searched to Improve the efficiency.

SQL Server uses a table with a small amount of data as build input. Try to use the hash constructed based on build input.
The table can be fully stored in the memory, so that the matching operation in the probe stage is fully performed in the memory. Such a hash join is called in-memory hash.
Join.

If the number of build input records is very large and the created hash table cannot be stored in the memory, SQL Server
Input and probe input are partitioned into multiple partitions. Each partition includes an independent and paired build.
Input and probe input. In this way, a large hash join is split into multiple independent hash joins that do not affect each other.
Join can be completed in the memory. SQL Server saves the split partition file on the disk and loads the build of a partition each time.
Input and probe input are in the memory for a hash join. This hash join is called Grace hash.
The grace hash join algorithm used for join.

With the hash Join Operation of big data, standard external merge sorts, multiple merge
Levels, multiple partitioning steps, multiple partitioning levels, SQL
Server may also use algorithms such as recursive hash join or other optimization methods.

Hash
Join is generally used for operations on a large amount of data. For example, if the data of a table in the join operation reaches a certain level or cannot be loaded to the memory at a time, if your joined field cannot be killed in both join tables
The index is also processed using hash join. Therefore, in general, hahs
The join processing cost is very high. It is one of the top killer of the memory and CPU of the database server, especially when it involves partitions (a large amount of data leads to insufficient memory or high concurrent access leads to the current processing line
If the data volume is not huge, partitions may be required.) to complete all the partition steps as soon as possible, a large number of asynchronous I/O operations will be used, therefore, a single
The thread may cause multiple disk drives to be busy, which may block the execution of other threads.

Use Inner hash join or option (hash join) to force the hash join method.

Suggestion:

All three join methods have two inputs. Basic Principles of optimization: 1. Avoid hash of big data
Join, convert it into efficient merge join and nested loop as much as possible
Join. Possible means include table structure design, index adjustment design, SQL optimization, and business design optimization. For example, when redundant fields are used, the statistical analysis results are periodically run to the static state using the service.
For tables, appropriate redundant tables are synchronously updated using AOP or similar mechanisms. 2.
Minimize the data volume at the two inputs of join. This is often because the condition does not comply with the Sarg (a lot of superb skills can be used in this case) and the internal conditions of subqueries are insufficient.
(When SQL statements are too complex, the SQL Server query optimizer is often silly. The conditions written outside the subquery are not used in subqueries.
Internal query, which affects the internal efficiency of the subquery or the efficiency of join with the subquery ). In addition, the design and business end try to limit the data size of these two inputs.

Supplement: SQL Server 2005

There are some differences in the display of the execution plan, but the main part or principle is similar and there will be no deviation. In the preceding example, when a non-clustered index is used on table B, the execution plan of SQL Server 2005 is shown as follows:

One major difference is that the bookmark lookup operation under SQL Server 2000 is displayed as an RID lookup operation under SQL Server 2005.
+ A nested
The implementation of the loops operation is actually quite understandable. It can be said that the execution plan is more reasonable. As soon as you see this operation, you will know that it uses a loop mechanism to obtain the real data from tableb.
Data.

Another point is that after you move the cursor over the execution plan icon, some changes will be prompted. For example, 2005 will display the output list of each operation (output
List), and in my previous article, I basically use the word "Output Data Structure" to express it. By viewing the output list, you can better understand the RID
The reason for the existence of operations such as Lookup (bookmark lookup.

Finally, you can save the execution plan displayed in the graph in 2005. You can enable it later and view and analyze it in a graphical way. This is not possible under 2000, 2000 only execution plan text can be saved. These small functions are very useful for analyzing SQL Performance and more intuitive analysis on the graphic interface.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More