Optimizing queries can be a common thing for SQL Server optimizations. Since the optimization of database is a wide topic in itself, this article only talks about how to understand SQL Server query plan when optimizing query. After all, my knowledge of SQL Server is limited, and if there is a mistake, I implore you to criticize it in time after your discovery.
First, open SQL Server Management Studio and enter a query statement to see how SQL Server displays the query plan.
Description: The database presented in this article is a database that I have prepared for a demo program and can be downloaded from this web page.
Select V.orderid, V.customerid, V.customername, V.orderdate, V.summoney, v.finished
From Ordersview as V
where V.orderdate >= ' 2010-12-1 ' and V.orderdate < ' 2011-12-1 ';
Where Ordersview is a view, which is defined as follows:
ISNULL (dbo. Customers.customername, N ') as CustomerName
FROM dbo. Orders left OUTER JOIN
Dbo. Customers on dbo. Orders.CustomerID = dbo. Customers.CustomerID
For the previous query, the query plan given by SQL Server is as follows (click the Show Estimated Execution Plan button on the toolbar):
From this diagram, we can get at least 3 useful information:
1. Which implementation steps cost more. Obviously, the cost of the two steps to the right is relatively high.
2. Which implementation steps produce a larger amount of data. For the amount of data generated by each step, SQL Server's execution plan is represented by "line weight," and is therefore easily distinguishable from.
3. What kind of action is performed at each step.
For a slow query, we usually have to know which steps are more expensive, and then we can try some improved methods. In general, if you can't solve a problem by improving your hardware performance or adjusting the settings of the Os,sql server, the remaining alternatives are usually the following:
1. Increase the index of the corresponding field for operations such as "scan".
2. Sometimes rebuilding the index may also be effective, please refer to the following text.
3. Adjust the statement structure, and guide SQL Server to use other query scheme to execute.
4. Adjust the table structure (sub-table or partition).
Here are some important theoretical knowledge that can be helpful in understanding the implementation plan.
How SQL Server looks for records
Speaking of which, I have to say the index of SQL Server. SQL Server has two kinds of indexes: clustered and nonclustered indexes. The difference is that a "clustered index" directly determines where a record is stored, or that a record can be obtained directly from a clustered index. The nonclustered index holds two messages: 1. The value of the corresponding indexed field, 2. Record the location of the corresponding clustered index (if the table does not have a clustered index, save the record pointer). Therefore, if you can find a record through a "clustered index", it is obviously the fastest.
SQL Server has the following methods to find the data records you need:
1. "Table Scan": traverse the entire table to find all matching record rows. This operation will be checked on one line and, of course, the least efficient.
2. "Index Scan": According to the index, filter out some of the records from the table, and then find all the matching records row, obviously than the first way to find a smaller, so than "table Scan" faster.
3. "Index Seek": According to the index, locate (obtain) the location of the record, and then get the record, therefore, compared to the first two ways faster.
4. "Clustered Index Scan": Same as "Table Scan". Note: Do not think that there is a index, it is not the same. In fact, it means that each row of records is scanned by a clustered index, because the records are stored sequentially in the clustered index. and "Table Scan" just said: The table to be scanned does not have a clustered index, so these two operations are essentially the same.
5. "Clustered Index Seek": directly according to the clustered index to obtain records, the fastest!
So, when you find that a query is slow, you can first check which operations are relatively expensive, and then see if those operations are "Table Scan" or "Clustered index Scan" when looking for records, and if it does relate to these two types of operations, consider adding an index to solve them. However, adding an index also affects the modification of the datasheet because the index of the corresponding field is updated when the datasheet is modified. Therefore, too many indexes can affect performance. There is also a situation that is not suitable for adding indexes: A field with 0 or 1 states. For example, most of them are 1, so indexing at this point doesn't make sense at all. This can only be considered for 0 or 1 of these two cases separately to save, the table or partition is a good choice.
If you can't solve it by adding indexes and adjusting tables, try adjusting the statement structure and directing SQL Server to use other query schemes to execute. This method requires: 1. The function of the statement is clear, 2. The structure of the datasheet to be queried is clear, 3. The relevant business background knowledge is clear. If it can be solved by this method, of course, it is also a good solution. However, sometimes SQL Server is smarter, and even if you adjust the statement structure, it will not affect its execution plan.
How do I compare the performance of two SQL statements with the same functionality, and I recommend two ways: 1. By placing two of query statements directly in SQL Server Management Studio and then looking at their execution plans, SQL Server tells you the query cost of two queries in percent. This method is simple, usually can be referred to, but sometimes will not be allowed, for specific reasons please go down (possibly index statistics old).
2. According to the real program call, write the corresponding test code to call: This method is troublesome, but it is more representative of the actual invocation of the situation, the results are more reference value, it is also worthwhile.
SQL Server Join method
In SQL Server, each join command is run internally with three more specific ways to execute:
1. "Nested Loops join", the index Nested Loops connection is the fastest join operation if one join is small and the other join is entered very large and the index is created on its join column, because they require the least I/O and the comparison.
A nested loop join, also known as a nested iteration, uses a join input as an external input table (shown as the top input in a graphical execution plan) and another join input as an internal (bottom) input table. External loops process external input tables line-by-row. The internal loop executes for each outer row and searches for matching rows in the internal input table. You can use the following pseudo code to understand:
foreach (Row R1 in outer table)
foreach (row R2 in inner table)
if (R1, R2 meets matching criteria)
Output (r1, r2);
In the simplest case, the entire table or index is scanned during the search, which is called "Nested loops join Alone". If an index is used when searching, it is called an index nested loop join. If an index is born as part of a query plan (and the index is destroyed immediately after the query completes), it is called a temporary index nested loop join. The query optimizer takes into account all these different situations.
Nested loops joins are especially effective if the external input is small and the internal input is large and the index is created beforehand. In many small transactions, such as those that affect only a small set of rows, an indexed nested loop join is better than a merge join and a hash join. However, in large queries, nested loops joins are often not the best choice.
2. Merge join, which is the fastest join operation if two join inputs are not small but have been sorted on the join columns (for example, if they were obtained by scanning a sorted index). If two join inputs are large and the size of the two inputs is similar, the predefined merge joins provide a similar performance to the hash join. However, if the size of the two inputs varies widely, the hash join operation is usually much faster.
Merge joins require that both inputs be sorted on the merged columns, and the merge columns are defined by the equivalent (ON) clause of the JOIN predicate. Typically, the query optimizer scans the index (if there is an index on the appropriate set of columns), or places a sort operator underneath the merge join. In rare cases, although there may be multiple equivalent clauses, only some of the equivalent clauses available are used to get the merged columns.
Because each input is sorted, the Merge Join operator obtains a row from each input and compares it. For example, for an inner JOIN operation, returns if the row is equal. If the row is not equal, discard the smaller row and get another row from the input. This process will be repeated until all the actions have been processed.
A merge Join operation can be a regular operation, or it can be a many-to-many operation. Multi-pair merge joins use temporary table storage rows (which can affect efficiency). If there is a duplicate value in each input, the other input must be rewound to the beginning of the duplicate when processing each of the duplicates in one of the inputs. You can create a unique index to tell SQL Server that there are no duplicate values.
If there is a resident predicate, all rows that satisfy the merge predicate will take a value on the resident predicate and return only those rows that satisfy the resident predicate.
Merging the joins itself is fast, but selecting a merge join can be time-consuming if you need a sort operation. However, if the data is large and can be obtained from an existing B-tree index, the merged join is usually the fastest available join algorithm.
3. "Hash join", hash join can effectively handle unsorted large non-index input. They are useful for the intermediate results of complex queries, because: 1. Intermediate results are not indexed (unless they are explicitly saved to disk and then indexed), and are typically not sorted appropriately for the next action in the query plan. 2. The query optimizer estimates only the size of intermediate results. Since there may be large errors in the estimation of complex queries, if the intermediate results are much larger than expected, the algorithms for processing intermediate results must be effective and moderately weakened.
Hash joins can reduce the use of non-normalization. Non-canonicalization generally gains better performance by reducing join operations, although this is a risk of redundancy (such as inconsistent updates). Hash joins reduce the need to use non canonicalization. A hash join makes a vertical partition (representing several sets of columns in a single table with a separate file or index) a viable option for the physical database design.
Hash joins have two types of input: build input and probe input. The query optimizer assigns these roles so that the smaller of the two inputs is generated as input.
Hash joins are used for multiple sets of matching operations: inner joins, left outer joins, right outer joins and full outer joins, left and right halves; intersection; Union and difference. In addition, some of the variants of a hash join can be repeatedly deleted and grouped, such as SUM (salary) group by department. These modifications use only one input for the build and probe roles.
Hash joins are divided into 3 types: hash joins in memory, Grace Hash joins, and recursive hash joins.
Hash join in memory: A hash join scans or computes the entire build input first, and then generates a hash table in memory. Inserts each row into the hash bucket, based on the computed hash value of the hashing. If the entire build input is less than the available memory, you can insert all rows into the hash table. After the build phase is the probing phase. Scans or computes the entire probe input one line at a time, computes the hashing value for each probe line, scans the corresponding hash bucket, and generates a match.
Grace Hash Join: If the build input is greater than memory, the hash join is divided into steps. This is known as the Grace hash join. Each step is divided into the generation phase and the detection stage. First, you consume the entire build and probe input and partition it (using the hash function on hashing) as multiple files. Using a hash function on hashing guarantees that any two join records must be in the same file pair. Therefore, a task that joins two large inputs is reduced to multiple smaller instances of the same task. The hash join is then applied to each pair of partition files.
Recursive hash join: Multiple partitioning steps and multiple partition levels are required if the input of a standard external merge requires multiple merge levels if the generated inputs are very large. If only some of the partitions are larger, you need to use additional partition steps for those partitions. For all partition steps to be as fast as possible, large asynchronous I/O operations will be used so that a single thread can make multiple disk drives busy working.
You cannot always determine which hash join is used during the optimization process. As a result, SQL Server initially uses an in-memory hash join and then gradually converts to a Grace hash join and a recursive hash join based on the size of the generated input.
If the optimizer incorrectly anticipates which of the two inputs is smaller and thus determines which as the build input, the build role and probe role will be dynamically reversed. A hash join ensures that a smaller overflow file is used as a build input. This technique is called "role reversal". A role reversal occurs in a hash join when at least one file is overflowed to disk.
Description: You can also specify the join method explicitly, and SQL Server will respect your choice as much as possible. You can write like this: Inner loop join, left outer merge join, Inner hash join
However, I recommend that you do not do this, because SQL Server's choice is basically correct, do not believe you can try.
Well, say a lot of theoretical things, and then a practical example to explain it.
More specific execution process
Earlier, I gave a picture that reflected SQL Server's execution plan for executing a query, but it may reflect less detail, and of course, you can move the mouse pointer over a node, and the following information appears:
Just, I installed the Chinese version, above are Chinese characters, I do not say more. I'm going to say another way of implementing the process, which contains more execution information than this, and is actually implemented. (Of course, you can also continue to use graphical methods, click the "Include Actual execution Plan" button on the toolbar before running the query)
Let's go back to SQL Server Management Studio, enter the following statement, and then execute.
Select V.orderid, V.customerid, V.customername, V.orderdate, V.summoney, v.finished
From Ordersview as V
where V.orderdate >= ' 2010-12-1 ' and V.orderdate < ' 2011-12-1 ';
Note: Now added, "SET statistics Profile on", the results are as follows:
You can see from the picture, after executing the query, get two tables, the above table shows the results of the query, the following table shows the query execution process. This picture may be visually unfriendly compared to the first picture in this article, however, it can reflect more information, and especially in more complex queries, it may look easier, because there are too many steps to execute the plan for complex queries, and graphics are too large to be easily observed. and the Execution process table reflects 2 valuable data (top two columns).
Let's take a look at this "Execute process form". Let me pick a few important words.
Rows: Represents the number of record bars that are produced in one execution step. (real data, not expected)
"Executes": represents the number of times an execution step has been executed. (real data, not expected)
"Stmt Text": Represents a description of the step to be performed.
"EstimateRows": Indicates how many rows of data to expect to return.
In this "Execute Process table", I think the first three columns are more important for optimizing queries. For the first two columns, I also explained that the meaning is clear. The first two columns also roughly reflect the cost of those steps, and you should keep an eye out for slower queries. "Stmt Text" will tell you what each step does. For this table, it is actually a tree-type information (one row represents a node in the graph), so I suggest reading them from the bottom. As an example, let me explain the execution of this form as it is expressed.
Line 5th: "Clustered Index Seek (OBJECT: ([mynorthwind].[ DBO]. [Customers]. [Pk_customers]), Seek: ([MyNorthwind]. [dbo]. [Customers]. [Customerid]=[mynorthwind]. [dbo]. [Orders]. [CustomerID]) ORDERED FORWARD) "means that SQL Server makes a seek operation on the table Customers, and in the way of" Clustered Index Seek, "the corresponding index is" Pk_customers ", and the value of the seek is derived from [Orders]. [CustomerID]
Line 4th: "Clustered Index Scan (OBJECT: ([mynorthwind].[ DBO]. [Orders]. [Pk_orders]), Where: ([MyNorthwind]. [dbo]. [Orders]. [orderdate]>= ' 2010-12-01 00:00:00.000 ' and [MyNorthwind]. [dbo]. [Orders]. [orderdate]< ' 2011-12-01 00:00:00.000 '), meaning that SQL Server does scan operations on table customers, the worst "table scan", because there is no index on the OrderDate column , so it has to be so.
Line 3rd: "Nested Loops (left Outer Join, Outer REFERENCES: ([mynorthwind].[ DBO]. [Orders]. [CustomerID]) , meaning that SQL Server joins the data generated by rows 5th and 4th in the form of "Nested Loops", where the outer table is orders, and the matching operation to join is also indicated in line 5th.
Line 2nd: "Compute Scalar (DEFINE: [Expr1006]=isnull ([mynorthwind].[ DBO]. [Customers]. [Customername],n ']), meaning that a call to perform a isnull () function is to be executed. For specific reasons, refer to the View definition code given in the previous section of this article.
Line 1th: "SELECT [v]." [Orderid],[v]. [Customerid],[v]. [Customername],[v]. [Orderdate],[v]. [Summoney],[v]. [Finished] from [Ordersview] [v] WHERE [v]. [Orderdate]>=@1 and [v]. [orderdate]<@2], usually line 1th is the entire query, representing its return value.
Index statistics: Selection criteria for a query plan
The previous "Implementation Plan", since it is a plan, indicates that it will be able to determine the operational options before implementation. So how does SQL Server choose an execution plan? How does SQL Server know when to index or which index to use? For SQL Server, whenever you want to execute a query, you first check that the execution plan for the query exists in the cache, and if not, generate an execution plan that, when it comes to the execution plan, is not to see which indexes are available (randomly chosen), but rather to refer to what is called "index statistics" "Data. If you take a closer look at the previous execution plan or execute the process table, you will find that SQL Server can estimate the amount of data generated by each step, precisely because SQL Server can estimate the amount of data that SQL Server can choose the method that it deems most appropriate to execute the query process, at which point " Index statistics to tell SQL Server this information. Here, you are not a little curious, in order to give you a sense of "index statistics", we look at "index statistics" is what it looks like. In SQL Server Management Studio, enter the following statement, and then execute.
DBCC SHOW_STATISTICS (products, Ix_categoryid)
The results obtained are as follows:
First, explain the command: the "DBCC SHOW_STATISTICS" command can display "index statistics" that we want to know, and it takes two parameters, 1. Table name, 2. Index name
Let's take a look at the results of the command, which consists of three tables:
1. The first table, which lists the key information for this index statistic.
Column name describes the name of the statistical information. Updated the date and time the statistics were last updated. The number of rows in the rows table. Rows sampled the number of sampling lines for statistics. Steps data can be divided into groups, corresponding to the third table. Density the selectivity of the first indexed column prefix (excluding eq_rows). The average length of all indexed columns Average key length. String index If yes, the statistic contains a string summary index to support estimating the result set size for the like condition. Applies only to leading columns for char, varchar, nchar and nvarchar, varchar (max), nvarchar (max), text, and ntext data types.
2. The second table, which lists the selectivity of various field combinations, the smaller the data, the smaller the repetition, and the higher the selectivity.
The column name describes all density the selectivity of the index column prefix set (including eq_rows). Note: The smaller the value, the higher the selectivity.
If this value is less than 0.1, the selectivity of the index is higher, whereas the selectivity is not high. The average length of the Average length index column prefix set. Columns the name of the index column prefix for which all density and Average length are displayed.
3. The third table, the histogram of the data distribution, is the amount of data that SQL Server relies on to estimate the number of steps to perform.
The column name Description range_hi_key The maximum value in each group. Range_rows The estimated number of rows per group of data, and does not contain the maximum value. Eq_rows in order to give you a better understanding of these data, especially the third group, please look at the following figure:
When I was filling the test data, I deliberately divided the CategoryID into 1 to 8 (10 was later temporarily added), each group filled with 78 data. So the data in the third table of index statistics are also correct, and it is based on these statistics that SQL Server can estimate the amount of data for each execution step, thus affecting choices such as join. Of course, when you choose a Join method, you also refer to the selectivity of the fields in the second table. When SQL Server builds execution plans for queries, the query optimizer uses these statistics and combines the associated indexes to evaluate the cost of each scenario to select the best query plan.
Another example illustrates the importance of statistics for a query plan. First, add more data, see the following code:
declare @newCategoryId int;
INSERT INTO dbo. Categories (CategoryName) VALUES (N ' Test statistics ');
Set @newCategoryId = Scope_identity ();
declare @count int;
Set @count = 0;
while (@count < 100000)
Begin
Values (CAST (newid () as nvarchar), @newCategoryId, n ' A ', M, @count +1, n ');
Set @count = @count + 1;
End
Go
Update STATISTICS Products;
Go
And take a look at the index statistics:
Take a look at the same query, but because the query parameter values are not the same, SQL Server chooses the execution plan:
where P.categoryid = 26; --26 is the latest CategoryID, so this query will return a 10W record
where P.categoryid = 6; --This query will return 95 records
As you can see from the figure above, SQL Server chooses a completely different execution plan because of the different parameter values for CategoryID. The importance of statistical information is evident here.
After the statistics are created, the database engine sorts the column values, which create statistics based on those values, and creates a histogram based on these values (up to 200, separated by intervals). The histogram specifies how many rows match exactly each interval value, how many rows are within the interval, and the density or recurrence rate of the values in the interval.
SQL Server 2005 introduces additional information collected for statistical information created for char, varchar, varchar (max), nchar, nvarchar, nvarchar (max), text, and ntext columns. This information, called a string summary, helps the query optimizer to estimate the selectivity of query predicates in string patterns. When you have a like condition in a query, you can use a string summary to more accurately estimate the result set size and continually refine the query plan. These conditions include conditions such as where ProductName like '%bike ' and where Name like ' [Cs]heryl '.
Since "index statistics" is so important, when will it be generated or updated? In fact, "index statistics" are not maintained by hand, and SQL Server automatically maintains them. And there is also a parameter in SQL Server to control this update:
Statistical information automatic function working mode
When you create an index, the query optimizer automatically stores statistics about the indexed columns. In addition, when the Auto_create_statistics database option is set to ON (the default), the database engine automatically creates statistics for columns that do not have indexes for predicates.
As the data in the column changes, the index and column statistics may become obsolete, causing the query optimizer to select the query processing method that is not optimal. For example, if you create a table that contains one indexed column and 1,000 rows of data, each row has a unique value in the indexed column, the query optimizer treats the indexed column as a good way to collect query data. If there are many duplicate values after the data in the column is updated, the column is no longer the ideal candidate for the query. However, the query optimizer still treats it as a good candidate based on the outdated distribution statistics for the index (based on the data before the update).
When the Auto_update_statistics database option is set to ON (the default), the query optimizer automatically updates these statistics periodically when the data in the table changes. Statistics updates are started whenever the statistics used in the query execution plan do not pass the test for the current statistics. The sampling is performed randomly on individual data pages, from the smallest nonclustered index of the columns required for table or statistic information. When a data page is read from disk, all rows on the data page are used to update the statistics. The general situation is that statistics are updated when approximately 20% of the rows of data are changed. However, the query optimizer always ensures that the number of rows sampled is as small as possible. For tables less than 8 MB, a full scan is always performed to collect statistics.
Sampling data (rather than analyzing all data) minimizes the cost of automatic Updates to statistics. In some cases, statistical sampling does not get the exact characteristics of the data in the table. You can use the sample clause and the FULLSCAN clause of the UPDATE STATISTICS statement to control the amount of data that is sampled when the statistics are manually updated on a per-table basis. The FULLSCAN clause specifies that all data in the scan table is to be collected for statistics, and the sample clause is used to specify the percentage of the number of rows sampled or the number of rows sampled
In SQL Server 2005, the database option Auto_update_statistics_async provides the statistical information asynchronous update feature. When this option is set to ON, the query compiles without waiting for the statistics to be updated. The out-of-date statistics are placed in the queue and updated by the worker threads in the background process. Queries and any other concurrent queries are compiled immediately by using existing expiration statistics. Query response time is predictable because there is no delay in waiting for updated statistics, but out-of-date statistics may cause the query optimizer to select inefficient query plans. Queries that start after the updated statistics are ready will use those statistics. This may cause the plan to recompile the cache (depending on the older version of the statistic). If some data definition language (DDL) statements (for example, CREATE, ALTER, and DROP statements) appear in the same explicit user transaction, the asynchronous statistics cannot be updated.
The Auto_update_statistics_async option is set at the database level and determines the update method for all statistics in the database. It applies only to statistical updates and cannot be used to create statistics asynchronously. Setting this option to on is only valid if Auto_update_statistics is set to ON. By default, the Auto_update_statistics_async option is set to OFF.
From the above description, we can see that for large tables, or there may be statistical information update is not timely, at this time, it may affect the query optimizer's judgment.
Some people may have an experience: for some slow queries, they think of rebuilding the index to try to solve it. In fact, it makes sense to do so. Because, at some point, a query suddenly slows down, it may be related to the update of the statistic information, and it will affect the query optimizer's judgment. If you rebuild the index at this point, you can let the query optimizer know the latest data distribution and you can avoid the problem. Do you remember the procedure table I used to display with "SET statistics profile on"? Note that the table shows the actual amount of data and the estimated amount of data for each step. To rebuild the index, we can actually use "SET STATISTICS Profile" to see if the difference between the actual amount of data and the estimated amount of data is greater, then we can consider manually updating the statistics and then try again.
Refine view Query
In other words, the optimized view query, although the view is also defined by a query statement, is essentially a query, but it and the general query in the optimization, there is a difference. The main difference here is that although the view is defined by a query statement, it may not make much sense to parse the query definition, because the view is not used directly, but before it is used, it is preceded by a where statement, or in other statements for use by the FROM clause. Here's an example of a view Ordersview in my demo database, defined in front of the code. Let's take a look at what kind of execution plan comes in if you use this view directly:
As you can see from this view, SQL Server does a full table scan of the table orders, which should be inefficient. Take a look at the following query:
As you can see from the execution plan, it's not the same as the one above. The lookup of the Orders table in the previous query is a way of using the "Clustered index Scan", and now the "Clustered Index Seek" is used, and the cost percentage of the two steps to the right is also changed. This is enough to show that when optimizing a view, it is best to apply different filtering conditions to the actual requirements, and then decide how to optimize.
Here's a three-query-composition scenario to see the implementation plan for this view.
SELECT * FROM dbo. Ordersview where OrderId = 1;
SELECT * FROM dbo. Ordersview where CustomerId = 1;
SELECT * FROM dbo. Ordersview where OrderDate >= ' 2010-12-1 ' and OrderDate < ' 2011-12-1 ';
Obviously, for the same view, under different filtering conditions, the difference in the execution plan is obvious.
Recommended Reading-MSDN Articles
Index statistics
http://msdn.microsoft.com/zh-cn/library/ms190397 (sql.90). aspx
Query optimization recommendations
http:// msdn.microsoft.com/zh-cn/library/ms188722 (sql.90). aspx is a
checklist for analyzing a slow query
http://msdn.microsoft.com/ zh-cn/library/ms177500 (sql.90). aspx
logical operators and physical operators reference
http://msdn.microsoft.com/zh-cn/library/ms191158 ( sql.90). aspx