Technical preparation
Based on the SQL SERVER2008R2 version, a simpler case library (Northwind) of Microsoft is used for parsing.
First, data connection
Data connections are the most commonly used when we write T-SQL statements, and we get the data we want through the association of two tables.
SQL Server supports three physical connection operators by default: Nested loops joins, merge joins, and hash connections. Three kinds of connections are used, each has its own characteristics, different scenarios will be the database for us to choose the best way to connect.
A, nested loops join (nested loops join)
Nested loops joins are the simplest and most basic way to connect. The two tables are associated with the keyword, then the rows of the two tables are associated through a double loop, and then filtered by the keyword.
The following diagram can be used to understand the analysis
In fact, nested scanning is a very simple way to get the data, simple point is two-layer loop filter out the result value.
We can deepen our understanding through the following code
The code is as follows |
Copy Code |
For each row R1 in the outer table For each row R2 int the inner table If R1 join with R2 Return (R1,R2) |
Take an example.
The code is as follows |
Copy Code |
SELECT O.orderid From Customers C JOIN Orders O On C.customerid=o.customerid WHERE c.city=n ' London ' |
The above icon is a nested loop connection icon. And the explanation is clear.
The consumption of this method is the product of the exterior and the inner table, which is actually the Cartesian product we call it. So the size of the consumption increases with the amount of data in the two tables, especially the internal table, because it is repeatedly scanned, so the action we take in practice is to reduce the number of rows per appearance or table to reduce consumption.
For this algorithm there is also a way to raise performance, because the two tables are linked through the keyword, so in the query for the underlying data acquisition speed directly related to the performance of this algorithm, where the optimization of the way to use two table keyword for the index query, improve query speed.
Another point is that in nested loops joins, when two tables are associated, there is a filtering condition for the appearance, such as "WHERE c.city=n ' London" in the example above is a filter for the appearance (Customers), and the city column here is indexed in the table, So the two subqueries for the statement are indexed lookup (index seek).
However, in some cases our query criteria are not covered by the index, at which point the child operator under the nested loops join becomes an index Scan (index Scan) or a RID lookup.
As an example
The code is as follows |
Copy Code |
SELECT E1. Employeeid,count (*) From Employees E1 JOIN Employees E2 On E1. HireDate GROUP by E1. EmployeeID |
The above code is to obtain the number of employees who are employed by each employee from the employee's table. Let's take a look at the execution plan for the query
It's obvious here that the two-table association passed through the HireDate column, and this column is not covered by the index entries, so the two tables can only be obtained through the whole table clustered index Scan, if the two tables data is particularly large, it is a very performance-intensive query.
It can be seen from the text that the results of this T-SQL query are filtered by filtering the Cartesian product formed by two tables after a full table scan in the nested loop operator. This is actually not an optimal way, because the result we get is that we can filter through two tables and then get the result by nested loop operators, so that the performance is much better.
We're trying to change this statement.
The code is as follows |
Copy Code |
SELECT E1. Employeeid,ecnt. Cnt From Employees E1 CROSS APPLY ( SELECT COUNT (*) CNT From Employees E2 WHERE E1. HireDate ) ecnt |
The result items that are queried by the above code are the same as the above, except that we filter the internal table according to the results of the external table, so that we do not need to get all the data items.
Let's review the text execution plan
Let's compare the execution consumption of the two statements before and after, comparing the execution efficiency
The execution time is reduced from 1 seconds to 179 milliseconds to 93 milliseconds. Effect is obvious.
Compared to CPU consumption, memory, compile time and other overall consumption has been reduced, refer to the above figure.
Therefore, the optimization of the nested loop connection is focused on these points: the reduction of data in two tables, indexing on the connection key, and overriding indexes on predicate query conditions are best able to reduce the number of records that meet predicate conditions.
B, Merge joins (merge join)
There are a number of problems with the nested loops join mentioned above, especially if both tables are large tables, because it produces n-time full table scans, which is obviously a serious drain on resources.
For these reasons, there is another way of connecting in the database: Merging connections. Remember that this is not what SQL Server provides, because this connection algorithm is a join algorithm used by all RDBMS on the market.
A merge connection is a row that reads two tables in turn. If two rows are the same, output a concatenated row and continue reading on the next line. If the row is not the same, discard the fewer of the two inputs and continue reading until the row scan of one table in the two table finishes, so the algorithm executes only once for each table, and does not require the entire table scan to stop.
The algorithm requires a sequential scan comparison of two tables, but there are two prerequisites: 1, the corresponding columns of two tables must be sorted beforehand, 2, the conditions for merging the two tables must have an equivalent connection.
We can use the following code to understand
The code is as follows |
Copy Code |
Get a row R1 from INPUT1 Get a row R2 from Input2 While not in the end of either input Begin If R1 joins with R2 Begin Output (R1,R2) Get next row R2 from Input2 End else if R1 Get next row R1 from INPUT1 Else Get next row R2 from Input2 End |
The total consumption of a merge Join operator is proportional to the number of rows in the input table, and it is not the same as the nested loop connection when the table is read up to a maximum. Therefore, merge joins are a good choice for connecting operations to large tables.
For a merge connection, you can improve performance from the following points:
The Connection Value content column type between the two tables, if the associated columns in both tables are unique columns, there is no duplicate value, this association performance is the best, or there is a table with a unique column can also be associated with a One-to-many association method, this is our most commonly used, than we often use the master-slave table associated with the query If there are duplicate values in the associated columns in the two tables, then a third table is needed to hold the duplicate values when the two tables are associated, and the third table, called "worktable", is stored in tempdb or in memory, and this performance is affected. So, in view of this, we often do the optimization methods are: Association even as far as possible to adopt a clustered index (uniqueness)
We know that the premise of using this algorithm is that both tables are sorted, so when we apply it, it's best to use the Sorted table Association first. If there is no sort, the associated item you want to select is an index overlay, because the sort of the large table is a resource-intensive process, and we select the index overlay column for sorting performance much better than the normal column sort.
Let's give you an example.
The code is as follows |
Copy Code |
SELECT O.customerid,c.customerid,c.contactname From Orders O JOIN Customers C On O.customerid=c.customerid |
We know that the associated item in this T-SQL statement is CustomerID, and this column is the primary key clustered index that is unique and sorted, so there is no sort operation shown here.
And all the output items that use a merged connection are already sorted.
We're looking for a slightly more complicated situation, no sorted in advance. T-SQL with merged queries
The code is as follows |
Copy Code |
SELECT O.orderid,c.customerid,c.contactname From Orders O JOIN Customers C On O.customerid=c.customerid and O.shipcity<>c.city ORDER BY C.customerid |
The above code returns those customers whose shipping orders are not local to the customer.
The above query plan can see that the consumption of the order is always huge, in fact, our above statements according to the logic should be in the merge connection to get the data, before using the display according to CustomerID to sort.
However, because the merge join operators need to be sorted before themselves, SQL Server here takes a prioritized strategy, takes the sort operation ahead of the merge connection, and does not need to do additional sorting after merging the connections.
This is actually where we want to sort the query results, which also take advantage of the characteristics of the merge connection.
C, hash join (hash join)
We analyzed the above two connection algorithms, each of the two algorithms has its own characteristics, and each has its own scenario: nested loops are suitable for relatively small data set connections, and the merge connection should be associated with a medium-sized dataset, but it has its own drawbacks, such as requiring an equivalent connection, and needs to be sorted beforehand.
What about the connection database for large data sets? That's the scenario where the hash join algorithm is used.
Hash joins are much better for parallel operations of large data sets than others, especially for OLAP data warehouse scenarios.
Hash connections are similar to merging connections in many places, such as requiring at least one equivalent connection and supporting all outer join operations. However, unlike merge connections, hash joins do not need to sort the input data sets beforehand, and we know that sorting the large table is a great drain, so removing the sort operation will definitely improve the performance of the hash operation.
A hash join is executed in two stages:
Build phase
In the build phase, the hash connection reads all rows from a table, hashes the row models of the equivalent join keys, creates a hash table of memory, and then puts the rows of the original column into a different hash bucket.
Exploratory phase
After the first phase is completed, the second phase of the discovery phase is entered, and the hash connection reads all the rows from the second datasheet, and also hashes on the same equivalent join key. Hash the bucket to the previous stage, and then explore the matching rows from the hash table.
In the preceding procedure, the build phase of the first phase is blocked, that is, the hash connection must read in and process all the build inputs before the row can be returned. And this process requires a piece of memory storage to provide support, and the use of a hash function, so the corresponding will also consume CPU and so on.
And the above process is commonly used in the process of concurrent processing, the full use of resources, of course, the number of the system will be limited, if the amount of data is too large, there will be memory overflow problems, and the solution to these problems, SQL Server has its own way of handling.
We can use the following code to understand
The code is as follows |
Copy Code |
--Construction phase For each row R1 in the build table Begin Calculate hash value on R1 join key (s) Insert R1 into the appropriate hash bucket End --Exploration stage For each row R2 in the probe table Begin Calculate hash value on R2 join key (s) For each row R1 in the corresponding hash bucket If R1 joins with R2 Output (R1,R2) End |
Before the hash connection is executed, SQL Server estimates how much memory is needed to build the hash table. The basic estimate is estimated by the statistical information of the table, so sometimes the statistical information is inaccurate, which will directly affect its operational performance.
SQL Server will do its best to reserve enough memory to ensure a successful hash connection, but sometimes there is not enough memory to allocate a small portion of the hash table to the hard disk, which is stored in the tempdb library, and this process is repeated repeatedly.
Give me a Lezilai look.
The code is as follows |
Copy Code |
SELECT O.orderid,o.orderdate,c.customerid,c.contactname From Orders O JOIN Customers C On O.customerid=c.customerid |
Let's analyze the execution statement above, the results of which are associated with the CustomerID column, and the theory will most appropriately be a merge join operation, but the merge join needs to be sorted, but we do not specify an ORDER by option in the statement, so we evaluate This statement is connected in the form of a hash join.
We add a sort of display to it, and it chooses the merge join as the optimal connection.
Let's summarize the characteristics of this algorithm.
As with merge joins the algorithm complexity is basically to traverse each side of the data set each time
It does not need to sort the dataset in advance, nor does it require any indexes on it, which is handled by hashing algorithms
The way to basically take parallel execution plans
However, the algorithm also has its own disadvantages, because it uses a hash function, so the Run-time CPU consumption, the same memory is also relatively large, but it can be used in parallel processing, so the algorithm for large data table connection query on the display of their own unique advantages.
About the hashing algorithm in the hash process of memory usage and allocation of the way, there is its own unique hashing method, such as: Left depth tree, right depth tree, dense hash connection tree, and so on, here do not do detailed introduction, just need to know how to use it.
Hash join is not an optimal connection algorithm, except that it is not optimized for input because the input dataset is particularly large and there is no requirement for the index on the connector. In fact, this is a forced choice, but the algorithm has its adaptation of the scene, especially in the OLAP data Warehouse, in a relatively sufficient system resources environment, the algorithm has been playing a scene.
Of course, the two algorithms described earlier are not useless, in the business OLTP system library, these two lightweight connection algorithms, with its own advantages have been recognized.
So these three algorithms, no one good who bad, only the right scene to apply the appropriate connection algorithm, so as to play its own strengths, and it happens that these are the skills we have to master.
These three connection algorithms we can also display the designation, but generally do not recommend this, because the default SQL Server for us to evaluate the best way to connect the operation, of course, sometimes it is not evaluated when it needs our own designation, the method is as follows:
Second, the aggregation operation
Aggregation is also what we often encounter when writing T-SQL statements, and we analyze some of the commonly used aggregation operator attributes and optimizations.
A, scalar aggregation
Scalar aggregation is a commonly used method of data aggregation, such as the following aggregate functions used in the statements we write: MAX (), MIN (), AVG (), COUNT (), SUM ()
The output of these data result items is basically generated by the flow aggregation, and this operator is also known as: Scalar aggregation
Let's see a column first.
The code is as follows |
Copy Code |
SELECT COUNT (*) from Orders |
The chart above is the operator of the stream aggregation.
The above figure also has an operator that computes a scalar, because the result item data type produced by the stream aggregation is the bigint type, and the default output is the int type, so an operator with a type conversion is added.
Let's look at one that doesn't need to be converted.
The code is as follows |
Copy Code |
SELECT MIN (OrderDate), MAX (OrderDate) from Orders |
Take a look at the operator of averaging
The code is as follows |
Copy Code |
SELECT AVG (Freight) from Orders |
For the average, when SQL Server executes, we also add a case-when category to prevent the denominator of 0 from happening.
Let's look at distinct under the circumstances of the execution plan
The code is as follows |
Copy Code |
SELECT COUNT (DISTINCT ShipCity) from Orders SELECT COUNT (DISTINCT OrderID) from Orders |
The same statement above, but it produces a different execution plan, just because it occurs on the count rollup of different columns, because OrderID does not have duplicate columns, SQL Server does not need to sort direct flow aggregation to produce summary values, and shipcity different it will have duplicate values, So you can only sort after the stream aggregation to get the summary value sequentially.
In fact, the most commonly used method of streaming aggregation is the grouping (group by) calculation, and the scalar computation above also utilizes this feature, but aggregates the whole into a large group.
I understand through the following code
The code is as follows |
Copy Code |
Clear the current aggredate results Clear the current GROUP by columns For each input row Begin If the input row does not match the current group by columns Begin Output the current aggreagate results (if any) Clear the current aggreagate results Set the current group by columns to the input row End Update the aggregate results with the input row End |
The flow aggregation operator is a simple process that maintains an aggregation group and aggregate value, scans the data in the table, and, if it does not match the aggregation group, joins the aggregation group and updates the aggregated value result items.
As an example
The code is as follows |
Copy Code |
SELECT Shipaddress,shipcity,count (*) From Orders GROUP by Shipaddress,shipcity |
Stream aggregation is used here, and the two columns are sorted first, and the order is always consumed very much.
The following code will not produce a sort
The code is as follows |
Copy Code |
SELECT Customerid,count (*) From Orders GROUP by CustomerID |
So here we have summed up a way to optimize the flow aggregation: Avoid sorting generation, and to avoid sorting requires grouping (group by) fields in the index coverage.
B, Hash aggregation
The above flow aggregation way needs to be sorted in advance, we know that the sort is a very large consumption process, so it is not suitable for large table group aggregation operation, in order to solve this problem, introduced another aggregation operation: hash aggregation
The so-called hash aggregation method is the same as the hash join mechanism mentioned earlier in this article.
Hash aggregation does not require sorting and excessive memory consumption, and it is easy to execute the plan in parallel, using multiple CPUs synchronously, but one drawback is that the process is blocked, and that hash aggregation does not produce any results until the full input.
So it is a good scenario to use hash aggregation in large data tables.
Deepen your understanding with the following code
The code is as follows |
Copy Code |
For each input row Begin Calculate hash value on GROUP by columns Check for a matching row in the hash table If maching row not found Insert a new row into the hash table Else Update the matching row with the input row End --Final output result Ouput all rows in the hash table |
The simple point is to hash the grouped columns, assign them to a different hash bucket, and then match them to each other before the operation matches, before outputting the results.
As an example
The code is as follows |
Copy Code |
SELECT Shipcountry,count (*) From Orders GROUP by ShipCountry |
This statement is very interesting, we use the ShipCountry for the group, we know that the column is not covered by the index, in fact, the choice of flow aggregation should be a good way, as we enumerated above, the first sort of this field, and then use the flow of aggregation to form the result item output.
But why is this statement SQL Server choosing a hash match as the optimal algorithm!!!
Me? Compare two grouped fields: ShipCountry and front shipaddress
The front is the country, followed by the address, the country is a lot of duplicates, and only a few unique values. And the address is not the same, discrete distribution, we know that sorting is a very resource-consuming thing, but the use of hash matching only need to extract different column values can be, so compared to performance, there is no doubt that the hash matching algorithm here is slightly algorithm.
And how does the above two-column content distribution type SQL Server know? This is the powerful statistic that SQL Server supports.
It is not a fixed statement in SQL Server that makes a specific plan, and the resulting specific plan is not always optimal, which is related to many factors, such as content distribution, data volume, data type, and so on in the existing data table of the database, and the statistics are recorded for these details.
The choice of all the best plans is based on existing statistical information to evaluate, if our statistics are not updated in time, then the evaluation of the optimal implementation plan will not be the best, sometimes it is the worst.
Reference documents
Microsoft Books Online logical operators and physical operator references
Refer to the book SQL. server.2005. Technical Insider "series
Conclusion
This article first come to this, this article focuses on T-SQL statement tuning from the execution plan, and introduces three common connection operators and aggregation operators, the next one will highlight some of our other most commonly used operators and tuning techniques, including: Curd operators, Union operators, index operations, Parallel operations, and so on, the content of SQL Server performance tuning covers a wide range of subsequent articles in order to expand the analysis.