Qu Yan miscellaneous-The row_number function of egg pain

Source: Internet
Author: User

The use of row_number for paging is almost a household name, and it is easy to use. It is simply a tool for programmers to take a home. However, row_number does not exist like an invincible bug, I have encountered several small issues recently for your entertainment.

--- ===================================================== ====================

Question 1: Why is it faster to add a where condition?

Query SQL:

WITH Temp AS(SELECT * ,ROW_NUMBER()OVER(ORDER BY T2.C6 DESC) AS RIDFROM TB001 AS T1INNER JOIN TB002 AS T2ON T1.C1=T2.C1WHERE T1.C2>1000AND T2.C3<99999AND T1.C4=5)SELECT * FROM TempWHERE RID BETWEEN 0 AND 10

The development brother was excited to ask me, for the above similar query, if there is no where RID between 0 and 10, the query is completed within 1 second, if there is a where condition, if the execution does not end after 30 seconds, about 300 rows of data are returned without the where condition. 10 rows of data are returned after the where condition is filtered, and the returned data rows are of a small length, we can ignore the impact of the returned data size on the network and display. What is the problem? People with a little DBA experience will soon find the root cause of the problem-the execution plan is incorrect.

Let's change the SQL to analyze it.

WITH Temp AS(SELECT * ,ROW_NUMBER()OVER(ORDER BY T1.C1 DESC) AS RIDFROM TB001 AS T1WHERE T1.C2>1000)SELECT * FROM TempWHERE RID BETWEEN 0 AND 10

Let's speculate on how to implement the above query. Assume that t1.c1 has an index ix_c1 and t1.c2 has an index ix_c2.

Method 1:

A => for the internal query of the CTE, the ix_c2 index is used to locate the data that meets the condition t1.c2> 1000, and the result set U1 is obtained.

B => sort the result set U1 by t1.c1 and calculate the value of the RID column in each row of U1.

C => Search for rows in the result set U2 that meet the criteria of the RID between 0 and 10 filter.

D => return the result set U3

Implementation Method 2:

A => use the index ix_c1 to access T1 data in sequence by order by t1.c1 DESC

B => check whether the row obtained by step a meets the t1.c2> 1000 conditions, put the results that meet the conditions into the result set u1, and then increment the RID at a time.

C => check the result set UI obtained by step B. Stop steps A and B when sufficient data rows (RID between 0 and 10) are obtained.

D => return the result set u1

The above two methods can get the correct returned results, but what is better?

For implementation method 1, assume that table T1 has 1000 million data records. If there are only 20 rows meeting t1.c2>, use the index ix_c2 to quickly find the 20 rows that meet the conditions, then, sorting the 20 rows of data only consumes a slight amount of CPU resources. However, if only 1000 rows meet the requirements of t1.c2>, sorting consumes a lot of CPU resources, as a result, the query is slow.

For implementation method 2, assuming that table T1 has 1000 million data, the value of C1 is traversed in reverse order according to the index ix_c1. If the first 50 rows are traversed, 10 rows of data that meet the requirements of t1.c2> can be found, the query can end quickly, consuming only a small amount of logical reads. However, if you need to traverse the first 1000 rows of data to find 10 rows that meet the requirements of t1.c2>, a large amount of logical reads will be consumed, as a result, the query is slow.

Therefore, it is not difficult to draw a conclusion: There is no absolutely correct execution plan, and there is only a relatively high-quality execution plan.

-- ===================================================== ======================================

We know that when SQL Server generates an execution plan, it will estimate the number of affected rows and overhead of some steps based on the input parameters and statistical information, and find an execution plan with lower overhead, for the query mentioned at the beginning of this article, SQL Server is vulnerable to the temptation of the RID between 0 and 10. Select an execution plan similar to implementation method 2, the data distribution happens to be the worst case for this method, and the query results are slow and slow.

Similar cases include:

1. the query returns 20 rows of data, and then adds order by and top (10) based on the query. The execution efficiency is much slower, so why is it so slow to sort 20 rows of data to top?

2. query returns 20 rows of data. In the query, select top (20) and select top (10000) are added respectively. The result of select top (10000) is much faster than select top (20, in my case, select top (10000) is completed in 5 ms, and no result is returned in 10 minutes of select top (1 ).

 

The above cases all share the same operation order by + top, and row_number is essentially order by + top. We know that CPU resources are the most valuable resources in server resources, the sorting of result sets is a process that consumes CPU resources. SQL Server selects a "it" to save CPU resources and considers it an appropriate execution plan. The result is a tragedy.

-- ===================================================== ==========

I tried a variety of writing methods to address the problem of the Development brother. I couldn't solve this SQL statement without having to use temporary tables or index prompts, so I came up with an evil trick:

WITH Temp AS(SELECT * ,ROW_NUMBER()OVER(ORDER BY T2.C6 DESC) AS RIDFROM TB001 AS T1INNER JOIN TB002 AS T2ON T1.C1=T2.C1WHERE T1.C2>1000AND T2.C3<99999AND T1.C4=5)SELECT * FROM TempWHERE RID+0 BETWEEN 0 AND 10

The academic school is about to scream. This rid + 0 between 0 and 10 writing method is not scientific, inefficient, and Junior programmers do not understand the bad SQL statements written by SQL...

Use the RID + 0 to cheat the query optimizer and make it unable to estimate the number of rows to be returned for between 0 and 10, in this way, "it" can only be honestly "first" for CET internal queries.

PS: I cheated on the query optimizer, but I couldn't lie to the development brother. He always thought the writing method was too Bt and asked other DBAs several times, that is, he didn't accept my suggestion, sad reminder.

-- ===================================================== ========

A small suggestion:

Don't yell when we see something like wheere C1 + 10> 20, so we can say that we can't use the index. Let's look at the scenario. What if there is no index at all on C1?

-- ===================================================== ==========================================

Row_number is really useful in implementing paging rows, but not in all scenarios. This is a real example.

A query has only two parameters @ P1 and @ P2, representing the number between the row @ P1 and the row @ P2.

When @ p1 = 0 and @ P2 = 1000, the consumption is as follows:

Table 'xxxdetail '. The scan count is 186, the logic reads 4922, the physical reads 0, the pre-read 0, the lob logic reads 0, the lob physical reads 0, and the lob pre-read 0. Table 'xxx '. Scan count 1, logical reads 809, physical reads 0, pre-reads 0, lob logic reads 0, lob physical reads 0, and LOB pre-reads 0. SQL Server execution time: CPU time = 0 ms, occupied time = 73 Ms.

When @ p1 = 7241284 and @ P2 = 7240285, the consumption is as follows:

Table 'xxxdetail '. The scan count is 1468817, the logic reads 35838994, the physical reads 1, the pre-read 0, the lob logic reads 0, the lob physical reads 0, and the lob pre-read 0. Table 'xxx '. Scan count 1, logical reads 5983509, physical reads 0, pre-reads 0, lob logic reads 0, lob physical reads 0, and LOB pre-reads 0. SQL Server execution time: CPU time = 45926 ms, occupied time = 56816 Ms.

There are so many pages, speechless !!!
Since it is speechless, I will not explain it more, but I will say tears. Just look at it.

-- ===================================================== ==============================================

After finishing the work, attach

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.