Top N After grouping SQL Server

Source: Internet
Author: User

Recently, I encountered a sudden demand: Grouping A data table, and then retrieving the first n data entries in each group sorted by certain rules. At first glance, this is an ordinary query, with no difficulty. I can write it in a pen. After all, I got stuck for a while. It was a classic topic in the SQL field, and I had never thought about it. Today, I will get several method ColumnsOutputTo discuss with the public.

The structure of the example table is described before the body.

TableSectiontransactionlog, a log table used to record activities of each department
Sectionid, department ID
Sectiontransactiontype, Activity Type
Totaltransactionvalue, Activity Cost
Transactiondate, activity time

We set the scenario to: select the last two activities of each department (sectionid.

The data in the sectiontransactionlog table used by the author exceeds 3,000,000.

1. nested subqueries

1

 1   Select   *   From  Sectiontransactionlog mlog  2   Where   3 ( Select   Count (* ) From  Sectiontransactionlog sublog  4       Where Sublog. sectionid = Mlog. sectionid And Sublog. transactiondate > = Mlog. transactiondate) <=  2  5   Order   By Sectionid, transactiondate Desc 

Running time: 34 seconds

This method is simple. It only determines in the subquery whether the record is one of the two most recent records in its section.

2

 1   Select   *   From  Sectiontransactionlog mlog  2   Where Mlog. ID In  3 ( Select   Top   2  ID  4      From  Sectiontransactionlog sublog  5       Where Sublog. sectionid =  Mlog. sectionid  6       Order   By Transactiondate Desc  )  7   Order   By Sectionid, transactiondate Desc 

Run time: 1 minute 25 seconds

In the subquery, use transactiondate for sorting, and the value is Top 2. Then, the in keyword is used to determine whether the record matches the subquery.

Ii. Self-join Mode

 1   Select Mlog. *   From  Sectiontransactionlog mlog  2   Inner   Join  3 ( Select Rankleft. ID, Count ( * ) As RanknumFrom  Sectiontransactionlog rankleft  4   Inner   Join  Sectiontransactionlog rankright  5   On Rankleft. sectionid = Rankright. sectionid And Rankleft. transactiondate <=  Rankright. transactiondate  6   Group  By  Rankleft. ID  7   Having   Count ( * ) <=   2 ) Sublog On Mlog. ID =  Sublog. ID  8   Order   By Mlog. sectionid, mlog. transactiondate Desc 

Running time: 56 seconds

This implementation method is more clever, but it is a little more complicated than the previous method. The sublog part constructed based on the sectiontransactionlog table self-join is calculated for each activity (identified by ID) to calculate its sort ranknum within the Section (by time transactiondate ).

In the self-join condition rankleft. sectionid = rankright. sectionid and rankleft. transactiondate <= rankright. under the filtering of transactiondate, for an activity (identified by ID) in the query results, it is connected only when it is in a section and later than or at the same time (including its own, of course ). Id = 1:

As you can see from this, the Count calculation based on this result is the ranknum of the id = 1 activity in section 9022.

Then having count (*) <= 2 selects the rows within 2, and then joins the SELECT statement to obtain the required information.

3. Apply row_number () (SQL Server 2005 and later)

 1   Select   *   From  2   ( 3   Select   * , Row_number () Over (Partition By Sectionid Order   By Transactiondate Desc ) As  Rownum  4   From  Sectiontransactionlog  5   ) Ranked 6   Where Ranked. rownum <=   2  7   Order   By Ranked. sectionid, ranked. transactiondate Desc 

Running time: 20 seconds

This is the most efficient implementation method up to now. Row_number () over (partition by sectionid order by transactiondate DESC) completes the entire process of grouping, sorting, and row number retrieval.

 

Efficiency

Next we will make an efficiency statistics on the above four methods.

Method Time consumed (seconds) Ranking
Apply row_number () 20 1
Nested subquery method 1 34 2
Self-join Mode 56 3
Nested subquery method 2 85 4

In the four methods, nested subqueries take the longest time. Where is the efficiency loss? Is it because the in keyword is used? Execute Plan ):

 

We can see that the optimizer resolves in to left semi join, and its loss is extremely low. Most of the performance of this query is consumed at the order by (Top N sort) of the subquery ). If the order by transactiondate DESC clause in the subquery is deleted (of course, the result is incorrect), it takes only 8 seconds.

Adding a valid index can improve the performance of this query method.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.