SQL Server Group after top N (GO)
Recently, the work of Tuyu a demand: To Group a data table, and then take out each group in accordance with a certain sequence of the first n data. At first glance, this is an ordinary inquiry, no difficulty. Can write a pen, after all, is trapped in the author for a while. Think about it, check the network, and never thought it was a classic topic in the SQL world. Today, I would like to make a list of some of the methods I have come up with in order to discuss with them.
The sample table structure is described before the text.
table Sectiontransactionlog, a log table used to record various activities of each department SectionID, Department ID sectiontransactiontype, activity type Totaltransactionvalue, activity spending transactiondate, event time
We set the scene to select the last two events held by each department (SectionID).
I used to test the data in the Sectiontransactionlog table is over 3,000,000.
One, nested subqueries query mode
1
1SELECT*From Sectiontransactionlog mlog2 where 3 (select count (from Sectiontransactionlog sublog4 where Sublog.sectionid Span style= "color: #808080;" >= mlog.sectionid and sublog.transactiondate >= mlog.transactiondate) <=2< Span style= "color: #008080;" >5 order by SectionID, transactiondate desc
Running time: 34 seconds
The principle of this method is relatively simple, only in the subquery to determine whether the record is one of the newly occurred in its section 2.
2
1SELECT*FromSectiontransactionlog Mlog2where mlog.idInch3 (select top 2 Id 4 from Sectiontransactionlog sublog5 where Sublog.sectionid = Mlog.sectionid6 Span style= "color: #0000ff;" >order by transactiondate desc ) 7 order by SectionID, transactiondate desc
Running time: 1 minutes 25 seconds
Use the TransactionDate sort in the subquery to take top 2. and apply the In keyword to determine whether the record conforms to the subquery.
Second, the self-connection mode
1Select Mlog.*FromSectiontransactionlog Mlog2InnerJoin3 (SELECT Rankleft.id,COUNT (*)As RanknumFromSectiontransactionlog Rankleft4InnerJoinSectiontransactionlog Rankright5On Rankleft.sectionid= Rankright.sectionidand rankleft.transactiondate <= Rankright.transactiondate6 group by Rankleft.id7 Having count (*) <= Span style= "color: #800000; Font-weight:bold; " >2) sublog on mlog.id =8 order by Mlog.sectionid, mlog.transactiondate desc
Running time: 56 seconds
This is a clever implementation, but it is slightly more complex than the previous method. The Sublog section, which is based on the Sectiontransactionlog table self-join, calculates the sort ranknum (by Time transactiondate) of each activity (identified by ID).
In the self-join condition Rankleft.sectionid = Rankright.sectionid and Rankleft.transactiondate <= Rankright.transactiondate filter, the query results are associated with an activity (identified by ID) with only one section and later or at the same time as the activity (including itself, of course). Active self-joins for id=1:
It can be seen from a glance that the count calculation based on this result is the Id=1 activity in section 9022 of the row ranknum.
Then have COUNT (*) <= 2 To select the row within 2, and then do a join select the required information.
Third, apply Row_number () (SQL SERVER 2005 and later)
1Select*From2(3Select*, Row_number ()over (Partition by SectionID order by transactiondate desc) as Rownum4 from Sectiontransactionlog5 ) Ranked6 where Ranked.rownum <= 27 order by ranked. SectionID, ranked. TransactionDate desc
Running time: 20 seconds
This is by far the most efficient way to achieve this. Row_number () over (partition by SectionID ORDER BY transactiondate Desc) completes the entire process of grouping, sorting, and fetching line numbers.
Efficiency considerations
Below we make an efficient statistic on the above 4 methods.
Method |
Time-consuming (seconds) |
Ranking |
Application Row_number () |
20 |
1 |
Nested subqueries 1 |
34 |
2 |
Self-Join mode |
56 |
3 |
Nested subqueries 2 |
85 |
4 |
Of the 4 methods, the nesting of subqueries 2 is the longest, and where is the efficiency loss? Did you really use the IN keyword? For its execution plan (execute plan):
, we can see that the optimizer will parse in to the left Semi Join, and its loss is very low. The majority of the query's performance is consumed by the subquery's order by (Top N Sort). Sure enough, if you delete the ORDER BY transactiondate DESC clause in the subquery (which, of course, is incorrect), it takes only 8 seconds.
Adding a valid index can improve the performance of the Query method.
SQL Server is grouped and then takes top N