Objective
In the previous article we looked at how to view the query plan, this article will introduce the analysis techniques in the query plan we looked at, as well as several of our commonly used operator optimization techniques, the same emphasis on the master of basic knowledge.
In this article, we can understand how we normally write T-SQL statements, how they are decomposed in SQL Server database systems, and how data results are organized by individual operators.
Technical preparation
Based on the SQL SERVER2008R2 version, a more concise case library (Northwind) from Microsoft was used for parsing.
One, the data connection
Data connections are the most common when we write T-SQL statements, and we get the data we want by associating two of tables.
SQL Server supports three physical connection operators by default: Nested loop joins, merge connections, and hash joins. Three kinds of connection each have the use, each has the characteristic, the different scene database will choose the best connection way for us.
A, nested loops join (nested loops join)
Nested loops are the simplest and most basic way to connect. Two tables are associated with a keyword, and then the rows of two tables are associated by a double loop, followed by a keyword filter.
Can be referenced for understanding analysis
In fact, nested scanning is a very simple way to get data, the simple point is the two-layer loop filter out the result value.
We can deepen our understanding with the following code
For each row R1 in the outer table for each row R2 int the inner table if R1 join with R2 return (R1,R2)
Give me an example.
SELECT o.orderidfrom Customers C JOIN Orders oon c.customerid=o.customeridwhere c.city=n ' London '
The above icon is a nested loop-linked icon. And the explanation is clear.
The consumption of this method is the product of the appearance and the inner table, in fact, we call the Cartesian product. So the size of the consumption is increased with the data volume of two tables, especially the internal table, because it is repeated scanning multiple times, so in practice we take measures to reduce each appearance or the number of rows in the table to reduce consumption.
For this algorithm there is a way to improve the performance, because the two tables are related by the keyword, so at the time of the query for the underlying data acquisition speed directly related to the performance of this algorithm, here optimize the way to use two table keyword as index query, improve query speed.
Another point is that in nested loops, when two tables are associated, there are filters for appearances, such as "WHERE c.city=n '" in the example above is a filter on the appearance (Customers), and here the City column has an index in the table. Therefore, the two subqueries of the statement are index lookups (index Seek).
However, in some cases our query conditions are not covered by the index, and at this point the sub-operators under the nested loop join become index Scans (index Scan) or RID lookups.
As an example,
SELECT E1. Employeeid,count (*) from Employees E1 JOIN Employees e2on E1. Hiredate>e2. Hiredategroup by E1. EmployeeID
The above code is obtained from the staff table of each employee before the number of personnel. Let's take a look at the execution plan for the query
It is obvious here that the association of two tables is carried out by the HireDate column, and this column is not covered by the index entry, so the two tables can only be obtained through the full table of the clustered index scan, if the two table data volume is particularly large, it is undoubtedly a very performance-consuming query.
As can be seen from the text, the query result of the T-SQL is obtained by filtering the Cartesian product formed after the full table scan of two tables in the nested loop operator. This is not an optimal approach, because we can actually get the results by filtering through two tables and then getting the results by nesting the loop operators, so that performance can be much better.
Let's try to change this statement.
SELECT E1. Employeeid,ecnt. CNT from Employees E1 cross APPLY ( SELECT COUNT (*) CNT from Employees E2 WHERE E1. Hiredate<e2. HireDate) ecnt
The result item queried by the above code is the same as above, except that we filter the internal table based on the results of the external table, so that we do not need to get all the data items.
We look at the text execution plan
We compare the execution consumption of the two statements before and after, and compare the execution efficiency.
The execution time is reduced from 1 seconds to 179 milliseconds to 93 milliseconds. The effect is obvious.
Compared to CPU consumption, memory, compile time and other overall consumption has been reduced, reference.
Therefore, the optimization of nested loops connection is focused on these points: the reduction of the data volume of two tables, the indexing on the connection key, and the overriding index on the predicate query condition is best to reduce the number of records that conform to the predicate condition.
B. Merge Join
There are a number of problems with the nested loop connection mentioned above, especially if the two tables are large tables, because it produces n multiple full-table scans, it is obvious that this approach can seriously consume resources.
For these reasons, another way of connecting is provided in the database: Merge connection. Remember that this is not what SQL Server provides, because this connection algorithm is a connection algorithm used by all the RDBMS in the market.
A merge connection is a row of two tables that is read in turn to compare. If two rows are the same, output a connected row and continue reading on the next line. If the rows are not the same, discard the less of the two inputs and continue to read, until the end of the row scan of a table in two tables, the execution is complete, so the algorithm executes only one scan per table and does not require the entire table to be scanned to stop.
The algorithm requires two tables to scan and contrast sequentially, but there are two prerequisites: 1, the corresponding column of two tables must be sorted beforehand, 2, the conditions for merging two tables must have an equivalent connection.
We can use the following code to understand
Get first row R1 from Input1get first row R2 from Input2while to the end of either Inputbegin if R1 joins with r2< C1/>begin Output (R1,R2) get next row R2 from Input2 end Else if R1<R2 get next row R1 from Input1
else Get next row R2 from Input2end
The total consumption of the merge Join operator is proportional to the number of rows in the input table and is read at most once for the table, which is not the same as a nested loop connection. Therefore, the join operation for a large table is a good choice for a merge connection.
For a merge connection, you can improve performance from the following points:
- The Connection Value content column type between two tables, if the associated column in both tables is a unique column, there is no duplicate value, this association performance is best, or there is a table with a unique column, which is associated with a one-to-many association, which is also our most commonly used, than we often use the Master-Slave Table Association query If there are duplicate values for the associated columns in the two tables, the third table is required to stage the duplicate values when the two tables are associated, and the third table is called "worktable", which is stored in tempdb or in memory, and this performance is affected. So in view of this, we often do the optimization method is: The association to use clustered index (uniqueness) as far as possible
- We know that this algorithm assumes that both tables are sorted, so when we apply it, it is preferable to use the sorted table Association. If there is no sort, also the associated item to select is the index overlay, because the sorting of the large table is a resource-intensive process, and we choose to sort the index overlay columns much better than the regular column sort.
Let's give an example.
SELECT o.customerid,c.customerid,c.contactname from Orders O JOIN Customers CON O.customerid=c.customerid
We know that the associated item in this T-SQL statement is CustomerID, and this column is the primary key clustered index, which is unique and ordered, so there is no sort operation displayed.
and all output result items that use a merge join are already sorted.
We're looking for a slightly more complicated case, with no pre-ordered T-SQL using merge queries
SELECT o.orderid,c.customerid,c.contactnamefrom Orders O JOIN Customers CON O.customerid=c.customerid and O.shipcity <>c.cityorder by C.customerid
The above code returns those customers whose shipping orders are not local to the customer.
The above query plan can be seen, the consumption of the order is always huge, in fact, the above statement according to the logic should be after the merge connection to obtain data, only the display according to CustomerID to sort.
But because the merge Join operator needs to be sorted before itself, SQL Server here takes a prioritized strategy, putting the sort operations ahead of the merge connection, and after merging the connections, there is no need to do extra sorting.
This is actually where we want to sort the results of the query, and also take advantage of the features of the merge connection.
C, hash connection (hash join)
We analyzed the above two connection algorithms, the two algorithms have their own characteristics, but also each has its own application scenario: Nested loops are suitable for a relatively small data set connection, the merge connection should be a medium-sized data set, but it has its own shortcomings, such as requirements must have an equivalent connection, and need to pre-order and so on.
What about the connection database for large data sets? That is the application scenario for the hash join algorithm.
Hash joins are much better for parallel operations of large data collections than others, especially for OLAP data warehouse scenarios.
Hash joins are similar in many places to merge connections, such as requiring at least one equivalent connection, and also supporting all outer join operations. However, unlike merge connections, hash joins do not need to be pre-ordered for the input data collection, and we know that sorting operations on large tables is a significant drain, so the hash operation will undoubtedly improve the performance of hashing operations.
Hash joins are divided into two stages at execution time:
During the build phase, the hash join reads all rows from a table, processes the line model hash of the equivalent connection key, and then creates a memory hash table, which in turn puts the rows in the original column into different hash buckets.
After the first phase is complete, you begin the second phase of the exploratory phase, in which the hash join reads all rows from the second data table and also hashes on the same equivalent connection key. The hash process buckets the previous stage, and then explores the matching rows from the hash table.
In the above process, the first stage of the build phase is blocked, that is, the hash connection must read in and process all the build input before the row can be returned. And this process requires a piece of memory storage to provide support, and the use of a hash function, so the corresponding will also consume CPU and so on.
And the above process is generally used in the process of concurrency, the full use of resources, of course, the system will be limited to the number of hashes, if the amount of data is very large, there will be memory overflow and other issues, and for these problems, SQL Server has its own way of processing.
We can use the following code to understand
--build phase for each row R1 in the build tablebegin calculate hash value on R1 join key (s) insert R1 into the Appropriat E Hash bucketend--Exploratory phase for each row R2 in the probe tablebegin calculate hash value on R2 join key (s) for each row R1 in the corresponding hash bucket if R1 joins with R2 output (R1,R2) end
Before the hash connection is executed, SQL Server estimates how much memory is needed to build the hash table. The basic estimation method is estimated by the statistical information of the table, so sometimes the statistical information is inaccurate, it will directly affect its computing performance.
SQL Server will do its best to reserve enough memory to ensure a successful build of the hash connection, but sometimes it is necessary to allocate a small portion of the hash table to the hard disk, which is deposited in the tempdb library, and this process is repeated multiple times.
Give me a Lezilai look.
SELECT o.orderid,o.orderdate,c.customerid,c.contactnamefrom Orders O JOIN Customers CON O.customerid=c.customerid
We analyze the above execution statement, the above execution results are related by the CustomerID column, the theory will be most appropriate to take the merge join operation, but the merge connection needs to sort, but we do not specify the order by option in the statement, so after evaluation, This statement is connected by a hash connection.
We give it a sort of display, it chooses the merge connection as the best way to connect
Let's sum up the features of this algorithm
- As with merge joins, the complexity of the algorithm is basically to traverse the data sets of each side separately.
- It does not require pre-ordering of the dataset, nor does it require any index above it, and is processed by a hashing algorithm
- The basic way to take a parallel execution plan
However, the algorithm also has its own shortcomings, because it is using a hash function, so the runtime CPU consumption is high, the same memory is also relatively large, but it can be used in parallel processing, so the algorithm for the large data table connection query on the display of their own unique advantages.
The hashing algorithm in the hashing process when the memory of the use and allocation method, there is its own unique hashing method, such as: Left depth tree, right depth tree, dense hash connection tree, and so on, here do not do a detailed introduction, just need to know how to use it.
Hash join is not an optimal connection algorithm, but it does not optimize the input, because the input data set is particularly large, and there is no index on the connector is not required. In fact, this is also a forced choice, but the algorithm has its adaptation to the scene, especially in the OLAP data Warehouse, in a relatively adequate system resources environment, the algorithm has been the scene of its play.
Of course, the two algorithms described above are not useless, in the business of the OLTP system library, the two lightweight connection algorithm, its own superiority has also been recognized.
So these three kinds of algorithms, no one good who bad, only the right scenario to apply the appropriate connection algorithm, so as to play its own strengths, and this happens to be the skills we have to master.
These three kinds of connection algorithms we can also display the designation, but it is generally not recommended, because the default SQL Server for us to evaluate the best way to operate the connection, of course, sometimes it is wrong to evaluate the time we need to specify, the method is as follows:
Second, the aggregation operation
Aggregations are also frequently encountered when writing T-SQL statements, and we analyze the characteristics and optimizations of some commonly used aggregation operator operators.
A, scalar aggregation
Scalar aggregation is a common way of aggregating data, such as the following aggregate functions used in the statements we write: MAX (), MIN (), AVG (), COUNT (), SUM ()
The output of these data result items above is basically generated by the flow aggregation, and this operator is also known as: Scalar aggregation
Let's look at a column first.
SELECT COUNT (*) from Orders
The above chart is the operator of the flow aggregation.
There is also an operator that computes a scalar, because the result item data type that is produced in the stream aggregation is the bigint type, and the default output is the int type, so an operator for the type conversion is added.
Let's look at one that doesn't need to be converted.
SELECT MIN (OrderDate), MAX (OrderDate) from Orders
Look at the operator for averaging
SELECT AVG (Freight) from Orders
When averaging, a case-when classification was added to SQL Server when it was executed, preventing the denominator from occurring at 0.
Let's see distinct under the circumstances, the execution plan
SELECT count (DISTINCT ShipCity) from Ordersselect count (DISTINCT OrderID) from Orders
The same statement above, but produces a different execution plan, just because the number of different columns is summarized, because OrderID does not have duplicate columns, so SQL Server does not need to sort direct flow aggregation can produce summary values, and shipcity different it will have duplicate values, So the aggregate value can be obtained only after sorting and then streaming aggregations.
In fact, the most common way for this algorithm to flow aggregation is the grouping (group by) calculation, the above scalar calculation also uses this feature, but the whole formation of a large group of aggregation.
I understand it through the following code
Clear the current aggredate resultsclear the current group by columnsfor all input rowbegin if the input row does not Match the current group by columns begin to output the current aggreagate results (if any) clear the current Aggr Eagate results set the current group by columns to the input row end Update the aggregate results with the INP UT rowend
The flow aggregation operator is a simple process that maintains an aggregation group and aggregated values, scans the data in the table sequentially, ignores if the aggregation group can not match, and if so, joins the aggregation group and updates the aggregated value result item.
As an example,
SELECT Shipaddress,shipcity,count (*) from Ordersgroup by shipaddress,shipcity
Flow aggregation is used here, and the first two columns are sorted, and the ordering is always very expensive.
The following code will not produce a sort
SELECT Customerid,count (*) from Ordersgroup by CustomerID
So here we have summed up an optimization method for flow aggregation: Avoid sorting as much as possible, and to avoid sorting you need to have the grouping (group by) field within the index coverage.
B, Hash aggregation
The method of flow aggregation above needs to be ordered in advance, we know that sorting is a very large consumption process, so it is not suitable for large table grouping aggregation operation, in order to solve this problem, another kind of aggregation operation is introduced: hash aggregation
The so-called hash aggregation internal method is the same as the hash connection mechanism mentioned earlier in this article.
Hash aggregation does not require sorting and excessive memory consumption, and it is easy to execute the plan in parallel, using multi-CPU synchronization, but one drawback is that the process is blocked, and that the hash aggregation does not produce any results until the full input.
Therefore, it is a good application scenario to use hash aggregation in large data tables.
Deepen your understanding with the following code
For each input Rowbegin calculate hash value on GROUP by columns Check for a matching row in the hash table if Maching row not found insert a new row to the hash table else update the matching row with the input rowend --Final output ouput all rows in the hash table
The simple point is to hash the grouped columns, assign them to a different hash bucket, and then match them in order before outputting the results before the operation is matched.
As an example,
SELECT Shipcountry,count (*) from Ordersgroup by ShipCountry
This statement is very interesting, we use the ShipCountry to group, we know that the column is not covered by the index, as a matter of fact, the choice of stream aggregation should be a good way, with the examples we enumerated above, first sort this field, and then use flow aggregation to form the output of the result item.
But why does this statement SQL Server choose Hash Match as the best algorithm for us?!!
I'm here to compare two group fields: ShipCountry and the previous shipaddress
The front is the country, followed by the address, the country is a lot of duplicates, and only a few unique values. And the address is not the same, discrete distribution, we know that sequencing is a very resource-intensive thing, but the use of hash matching only need to extract the different column values can be, so compared to performance, there is no doubt that the hash matching algorithm here is notch above algorithm.
And how does SQL Server know about these two columns of content distribution type? This is where the powerful statistics of SQL Server are supported.
In SQL Server, it is not a fixed statement that will form a specific plan, and the resulting specific plan is not always optimal, which is related to the content distribution, the amount of data, the type of data in the database's existing data tables, and the statistics that are used to record these details.
All optimal planning choices are based on existing statistical information, and if our statistics are not updated in a timely manner, then the optimal execution plan will not be the best, and sometimes the worst.
Reference documents
- Microsoft Books Online logical operators and physical operator references
- Refer to the book SQL. server.2005. Technical Insider "series
Conclusion
This article is here first, this article mainly describes the T-SQL statement tuning from the execution plan, and introduces three common connection operators and aggregation operators, the next one will focus on some of our other most commonly used operators and tuning techniques, including: Curd operators, Union operators, index operations, Parallel computing and so on, the content of SQL Server performance tuning involves a wide range of topics, followed by the analysis in the next article.
At the end of the article, we give a link to the previous article
SQL Server Tuning Series Basics
SQL Server Tuning Series Basics (Summary of common operators-three ways to analyze physical connections)