Introduction
In SQL Server, using the TOP clause for complex queries can have performance implications, which can be good or bad, with different possibilities for different situations.
The SQL statement in the relational database is only an abstract concept and does not contain any implementations. Many of the metadata affects the build of the execution plan, and the SQL statement itself is not the metadata referenced by the generation of the execution plan (except for the hint), but the top keyword is a keyword that directly affects the execution plan, so using top in some cases can cause performance impact, so let's look at different scenarios.
Single-table case
For a single-table query (where the single table refers to a physical single table that does not contain views, table-valued functions), there is no performance impact on top, and if you add top to SQL Server, top itself can be considered a query hint, meaning that the optimizer "returns only n rows ”。 Let's look at a simple example, as shown in 1:
Figure 1: Specify a single-table execution plan for the top keyword
As you can see from the execution plan comparison in Figure 1, the use of the top clause can often improve performance for single-table queries with index support, when n of the number of rows in top N prompts the query optimizer to return n rows instead of using the data distribution in the statistics, where top n is reasonable for the query optimizer.
However, there are times when grant memory (which estimates the required memory each time the plan is executed, and if the estimated memory is less than the execution memory, it will spill to tempdb, which has a very large impact on performance, because the formula for each version predicts memory varies greatly, This is not explained in detail here) and is not allowed to have a very high performance impact. Before we start to talk about this, let's talk about the two operators first:
Sort
The sort operator is a very general sort operator and may appear in many places in the execution plan, such as before the merge join, due to order by. The algorithm is very generic and can be sorted on very large result sets, which are blocking (meaning that data cannot flow to the next operator until the end of the sort) and require a lot of memory and CPU resources. Another problem with this operator is that when the grant memory is insufficient, tempdb secondary is required to complete the sort, so there is a significant performance overhead.
Top N Sort
TOP N Sort is a sort algorithm that adapts to small scenes, specifically for a small number of queries. For the selection of only a few data, the cost of sorting the entire result set is too high, so the algorithm of top n is first to take the first data, compared with other data to see if the maximum (or minimum), then take the second data comparison, and so on, until the first n data is found. This algorithm is significantly better than the sort operator if the number of rows is small, but if the n value is too large, the algorithm is not appropriate for the following reasons:
1. The algorithm does not support spill to tempdb, resulting in the inability to host too large result sets.
2. The algorithm needs to traverse n times, if n is too large, the cost is too high.
For SQL Server, the threshold value for this n is too large is 100. Let's look at an example where the test data and code are shown in Listing 1.
CREATE TABLE Testtop
(ID int,sortkey int,somevalue CHAR (1000))
DECLARE @i INT =1
While @i<300000
BEGIN
INSERT into Testtop VALUES (@i,@i, ' a ')
SET @[email protected]+1
END
CREATE CLUSTERED INDEX pk_id on testtop (ID)
--test 1
SELECT TOP * from Testtop
ORDER by SortKey
--test 2
SELECT TOP (101) * FROM Testtop
ORDER by SortKey
Code Listing 1. Test data and test code
The first test is top 100, just the top N sort algorithm, the second test is top 101, only the normal sort algorithm is used, and 2 is shown.
Figure 2. TOP 101 Sort requires more memory, resulting in insufficient memory grant spill to tempdb
Let's look at the execution time, as shown in execution time 3, due to the presence of spill to tempdb.
Figure 3: Very large difference in execution time
As can be seen from Figure 3, the execution time varies greatly.
Therefore, for the use of top, try to use the values within the top 100.
Multi-table case
Because the top statement has hints for optimizer cardinality estimates, multiple table queries can lead to an underestimation of the number of rows in extreme cases and thus affect performance.
For example, the following 4 sample queries
Figure 4: Query using the top 1 table
In this case, because the existence of TOP1 makes the query optimizer use 1 as the estimated number of rows, and the actual number of rows is very different, so in this case, the use of top instead may lead to higher costs (although we see Figure 4 is estimated to be 0% vs 100%, but the actual difference is huge), The higher reason is not only that the optimizer is estimated to be 1, because loop join can end immediately if it finds 1, but in the example above, the filter condition is too selective to find too many random lookups for the first data (loop join is random IO), as shown in cost 5.
Figure 5: Using top instead results in performance degradation
The root cause is that the estimated number of rows is only 1 lines, and in most cases this line
For the above scenario, we can usually have the following centralized workaround:
1. Using hints, since we know this is due to the fact that the actual number of rows is much larger than the estimated number of rows, we can try using a hash join,forcescan.
2. Increase the Where condition so that the number of rows returned has a higher selectivity.
3. Do not use TOP1, and use the number above the top 10 to make the estimated number of rows larger, compared to 5 of the query we changed from TOP1 to TOP10, then the execution plan becomes 6.
Figure 6. Implementation plan for TOP 10
This is due to the fact that LOOP join can return a limited number of rows more quickly when the number of rows is low, which is equivalent to a fast n hint to the table, but when the number of rows increases, the optimizer prefers to use merge or hash to complete the operation, with better performance in extreme cases where there are more rows returned (low selectivity). As shown in result 7.
Figure 7: TOP10 has better performance than TOP1 in special cases.
therefore, in conjunction with the single-table example, the top keyword is recommended when the number is between 10 and 100.
Summary
This article describes the possible impact of the top keyword on the execution plan under single table and multi-table conditions, which in turn affects the query plan. Top impact execution plans are mainly in the following two areas:
- Memory Grant
- Estimated number of rows
Therefore, in the special case, the top statement can be considered according to the actual situation of the recommendations of this article.
Issues and workarounds that may be caused by the TOP clause in SQL Server