Recently, I encountered a sudden demand: Grouping A data table, and then retrieving the first n data entries in each group sorted by certain rules. At first glance, this is an ordinary query, with no difficulty. I can write it in a pen. After all, I got stuck for a while. It was a classic topic in the SQL field, and I had never thought about it. Today, I will get several method ColumnsOutputTo discuss with the public.
The structure of the example table is described before the body.
TableSectiontransactionlog, a log table used to record activities of each department
Sectionid, department ID
Sectiontransactiontype, Activity Type
Totaltransactionvalue, Activity Cost
Transactiondate, activity time
We set the scenario to: select the last two activities of each department (sectionid.
The data in the sectiontransactionlog table used by the author exceeds 3,000,000.
1. nested subqueries
1
1 Select * From Sectiontransactionlog mlog 2 Where 3 ( Select Count (* ) From Sectiontransactionlog sublog 4 Where Sublog. sectionid = Mlog. sectionid And Sublog. transactiondate > = Mlog. transactiondate) <= 2 5 Order By Sectionid, transactiondate Desc
Running time: 34 seconds
This method is simple. It only determines in the subquery whether the record is one of the two most recent records in its section.
2
1 Select * From Sectiontransactionlog mlog 2 Where Mlog. ID In 3 ( Select Top 2 ID 4 From Sectiontransactionlog sublog 5 Where Sublog. sectionid = Mlog. sectionid 6 Order By Transactiondate Desc ) 7 Order By Sectionid, transactiondate Desc
Run time: 1 minute 25 seconds
In the subquery, use transactiondate for sorting, and the value is Top 2. Then, the in keyword is used to determine whether the record matches the subquery.
Ii. Self-join Mode
1 Select Mlog. * From Sectiontransactionlog mlog 2 Inner Join 3 ( Select Rankleft. ID, Count ( * ) As RanknumFrom Sectiontransactionlog rankleft 4 Inner Join Sectiontransactionlog rankright 5 On Rankleft. sectionid = Rankright. sectionid And Rankleft. transactiondate <= Rankright. transactiondate 6 Group By Rankleft. ID 7 Having Count ( * ) <= 2 ) Sublog On Mlog. ID = Sublog. ID 8 Order By Mlog. sectionid, mlog. transactiondate Desc
Running time: 56 seconds
This implementation method is more clever, but it is a little more complicated than the previous method. The sublog part constructed based on the sectiontransactionlog table self-join is calculated for each activity (identified by ID) to calculate its sort ranknum within the Section (by time transactiondate ).
In the self-join condition rankleft. sectionid = rankright. sectionid and rankleft. transactiondate <= rankright. under the filtering of transactiondate, for an activity (identified by ID) in the query results, it is connected only when it is in a section and later than or at the same time (including its own, of course ). Id = 1:
As you can see from this, the Count calculation based on this result is the ranknum of the id = 1 activity in section 9022.
Then having count (*) <= 2 selects the rows within 2, and then joins the SELECT statement to obtain the required information.
3. Apply row_number () (SQL Server 2005 and later)
1 Select * From 2 ( 3 Select * , Row_number () Over (Partition By Sectionid Order By Transactiondate Desc ) As Rownum 4 From Sectiontransactionlog 5 ) Ranked 6 Where Ranked. rownum <= 2 7 Order By Ranked. sectionid, ranked. transactiondate Desc
Running time: 20 seconds
This is the most efficient implementation method up to now. Row_number () over (partition by sectionid order by transactiondate DESC) completes the entire process of grouping, sorting, and row number retrieval.
Efficiency
Next we will make an efficiency statistics on the above four methods.
Method |
Time consumed (seconds) |
Ranking |
Apply row_number () |
20 |
1 |
Nested subquery method 1 |
34 |
2 |
Self-join Mode |
56 |
3 |
Nested subquery method 2 |
85 |
4 |
In the four methods, nested subqueries take the longest time. Where is the efficiency loss? Is it because the in keyword is used? Execute Plan ):
We can see that the optimizer resolves in to left semi join, and its loss is extremely low. Most of the performance of this query is consumed at the order by (Top N sort) of the subquery ). If the order by transactiondate DESC clause in the subquery is deleted (of course, the result is incorrect), it takes only 8 seconds.
Adding a valid index can improve the performance of this query method.