【演算法總結-top K】堆–尋找最小（大）的k個元素

最後更新：2018-12-04 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

文章目錄

1.快速排序的思路：
2.堆。

top K問題是一個經典的問題。

該問題描述為：輸入n個整數，輸出其中最小的k個元素，例如，輸入 1,2,3,4，5,6,7,8 那麼最小的4個元素就是1,2,3,4.

除了這個，top K問題還指：常遇到的一類問題是，在海量資料中找出出現頻率最高的前K個數，或者從海量資料中找出最大的前K個數，這類問題通常稱為“top K”問題，如：在搜尋引擎中，統計搜尋最熱門的10個查詢詞；在歌曲庫中統計下載率最高的前10首歌等等。

說到top K（第一類）問題，腦袋中經常閃現的兩個概念是：快速排序和堆。為什麼是這兩個概念呢？原因有：

1.快速排序的思路：

給定一個樞軸元素，可以將數組按照這個元素分為兩個部分。這個思路對於top K問題有什麼作用？答案就是，根據partition的結果（返回的是樞軸的索引），可以輕鬆得到元素的個數。根據這個數字與K的關係遞迴劃分，最後一定可以得出前面元素個數為k個的劃分。

該思路的實現部分可見：http://blog.csdn.net/ohmygirl/article/details/7846544 快速排序求數組的第K個元素。

2.堆。

堆其實是一棵完全二叉樹，堆對於兩類問題有著很好的解決方案：a.排序問題：由於堆是一棵完全二叉樹，所以採用堆堆n元數組進行排序，時間複雜度不會超過O(nlgn),而且只需要幾個額外的空間。b.優先順序隊列。通過插入新元素和調整堆結構來維護堆的性質，每個操作所需要的時間都是O(lgn).

堆的常見實現是採用一個大小為n的數組儲存元素，並且0號單元捨棄不用。對堆中的元素按照層次從上到下，從左至右的順序依次編號。那麼對於一個編號為i的元素：

    a：如果左孩子存在，那麼左孩子的編號為2i    b：如果右孩子存在，那麼右孩子的編號為2*i + 1    c：如果有父節點，那麼父節點的編號為 i/2    d：節點為分葉節點的條件是左孩子且右孩子都為空白，為空白節點的條件是i<1或者i>n

堆的設計對於處理top K問題十分方便。首先設定一個大小為K的堆（如果求最大top K,那麼用最小堆，如果求最小top K,那麼用最大堆），然後掃描數組。並將數組的每個元素與堆的根比較，合格就插入堆中，同時調整堆使之符合堆的特性，掃描完成後，堆中保留的元素就是最終的結果。說到調整堆，不得不提的是調整的演算法，分為兩類：

向下調整（shiftdown）和向上調整(shiftup)。

以最小堆為例：

向上調整演算法對應的代碼如下：

void shiftUp(int *heap,int n){int i = n;for(;;){if(i == 1){break;}int p = i/2;if(heap[p] <= heap[i]){break;}swap(&heap[p],&heap[i]);i = p;}}

向下調整對應的代碼如下：

void shiftDown(int * heap,int n){ int i = 1; for(;;){int c = 2*i;if(c > n){break;}if(c+1 <= n){if(heap[c+1] <= heap[c]){c++;}}if(heap[i] <= heap[c]){break;}swap(&heap[c],&heap[i]);i = c;}}

有了堆的基本操作，top K問題就有了一個基礎（當然也可以完全不用堆解決top K問題）。以最小top K問題為例（此時需要建立大小為k的最大堆），top K的求解過程是：掃描原數組，將數組的前K個元素扔到堆中，調整使之保持堆的特性。對於k之後的元素，如果比堆頂元素小，那麼替換堆頂元素並調整堆，掃描是數組完成後，堆中儲存的元素就是最終的結果。

進一步思考，對于海量資料的處理，top K問題如何?呢，當然堆演算法還是可行的。有沒有其他的思路呢。

關于海量資料的處理，推薦july的部落格：http://blog.csdn.net/v_july_v/article/details/7382693 如何秒殺99% 的海量資料處理題

另外一個可以參考的部落格：http://dongxicheng.org/big-data/select-ten-from-billions/

最近在研究hadoop，所以我的想法是，用hadoop的MapReduce演算法實現top K問題，是不是效率更高一些，畢竟，hadoop在海量資料處理，並行計算方面還是蠻有優勢的。

MapReduce的思路也很簡單。所以編碼的話，只需要定義任務類，然後再定義內部的Mapper和Reducer靜態類就可以了。

轉載一段mapReduce的top K代碼（代碼未經測試，原文地址：http://www.linuxidc.com/Linux/2012-05/60234.htm）：

    package jtlyuan.csdn;      import java.io.IOException;      import org.apache.Hadoop.conf.Configuration;      import org.apache.Hadoop.conf.Configured;      import org.apache.Hadoop.fs.Path;      import org.apache.Hadoop.io.IntWritable;      import org.apache.Hadoop.io.LongWritable;      import org.apache.Hadoop.io.Text;      import org.apache.Hadoop.mapreduce.Job;      import org.apache.Hadoop.mapreduce.Mapper;      import org.apache.Hadoop.mapreduce.Reducer;      import org.apache.Hadoop.mapreduce.lib.input.FileInputFormat;      import org.apache.Hadoop.mapreduce.lib.input.TextInputFormat;      import org.apache.Hadoop.mapreduce.lib.output.FileOutputFormat;      import org.apache.Hadoop.mapreduce.lib.output.TextOutputFormat;      import org.apache.Hadoop.util.Tool;      import org.apache.Hadoop.util.ToolRunner;      //利用MapReduce求最大值海量資料中的K個數       public class TopKNum extends Configured implements Tool {      public static class MapClass extends Mapper<LongWritable, Text, IntWritable, IntWritable> {      public static final int K = 100;      private int[] top = new int[K];      public void map(LongWritable key, Text value, Context context)      throws IOException, InterruptedException {          String[] str = value.toString().split(",", -2);      try {// 對於非數字字元我們忽略掉       int temp = Integer.parseInt(str[8]);      add(temp);      } catch (NumberFormatException e) {     //    }      }      private void add(int temp) {//實現插入       if(temp>top[0]){      top[0]=temp;      int i=0;      for(;i<99&&temp>top[i+1];i++){      top[i]=top[i+1];      }      top[i]=temp;      }      }          @Override      protected void cleanup(Context context) throws IOException,  InterruptedException {      for(int i=0;i<100;i++){      context.write(new IntWritable(top[i]), new IntWritable(top[i]));      }      }      }          public static class Reduce extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {      public static final int K = 100;      private int[] top = new int[K];      public void reduce(IntWritable key, Iterable<IntWritable> values, Context context)      throws IOException, InterruptedException {      for (IntWritable val : values) {      add(val.get());      }      }      private void add(int temp) {//實現插入    if(temp>top[0]){       top[0]=temp;      int i=0;      for(;i<99&&temp>top[i+1];i++){      top[i]=top[i+1];      }      top[i]=temp;      }      }      @Override      protected void cleanup(Context context) throws IOException,  InterruptedException {      for(int i=0;i<100;i++){      context.write(new IntWritable(top[i]), new IntWritable(top[i]));      }      }         public int run(String[] args) throws Exception {      Configuration conf = getConf();      Job job = new Job(conf, "TopKNum");      job.setJarByClass(TopKNum.class);      FileInputFormat.setInputPaths(job, new Path(args[0]));      FileOutputFormat.setOutputPath(job, new Path(args[1]));      job.setMapperClass(MapClass.class);      job.setCombinerClass(Reduce.class);      job.setReducerClass(Reduce.class);      job.setInputFormatClass(TextInputFormat.class);      job.setOutputFormatClass(TextOutputFormat.class);      job.setOutputKeyClass(IntWritable.class);      job.setOutputValueClass(IntWritable.class);      System.exit(job.waitForCompletion(true) ? 0 : 1);      return 0;      }      public static void main(String[] args) throws Exception {      int res = ToolRunner.run(new Configuration(), new TopKNum(), args);      System.exit(res);      }     }  }    /*     * 列舉一部分出來：     * 306 306     307 307     309 309     313 313     320 320     346 346     348 348     393 393     394 394     472 472     642 642     706 706     868 868     */

至此，處理海量資料我們有了新的思路：MapReduce + hadoop

再次感慨，mapReduce真是海量資料處理的神器~

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

【演算法總結-top K】堆–尋找最小（大）的k個元素

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support