The classic algorithm for Spark programming top K

Source: Internet
Author: User

Tag:spark   topk    algorithm    

TOP KTOP K algorithm has two steps, one is to count the word frequency, the second is to find the highest frequency of the first k words. 1. The example description assumes the top 1, the following input and output. Input:hello world bye worldhello hadoop bye hadoopbye hadoop hello  Hadoop output: Words hadoop  Word frequency 42. Design Ideas first statistical wordcount of the frequency, the data into (word, frequency) of the data pair, the second phase of the idea of divided treatment, to find the top k of the RDD each partition, and finally the top of each partition  k results are merged to produce a new collection, and the results of top k are counted in the collection. Each partition is stored in a single machine, so you can use a stand-alone method to find TOPK. This example takes the form of a heap. You can also directly maintain an array with k elements, and interested readers can refer to other information to understand the implementation of the heap. 3. The code example TOP K algorithm sample code is as follows: Import org.apache.spark.sparkcontextimport org.apache.spark.sparkcontext._ Object topk {def main (args:array[string])  {/* executive WordCount, statistics of the most high-frequency words */val spark =  new sparkcontext ("local",  "TopK", System.getenv ("Spark_home"),  Sparkcontext.jarofclass (This.getclass)) val count = spark.textfile ("Data"). FlatMap (line = >line.split (" "). Map (word => (word, 1)). Reducebykey (_ + _)/* Statistics The top of each of the RDD's partitions  k Query */val topk = count.mappartitions (iter =>&nbsp {while (Iter.hasnext)  {puttoheap (Iter.next ())}getheap (). iterator}). Collect ()/* Merges the TOPK queries in each partition into a new collection, Statistics out TOPK query */val iter = topk.iteratorwhile (iter.hasnext)  {puttoheap (Iter.next ())}val  Outiter=getheap (). iterator/* output TOPK value */println ("topk  value  :") while (Outiter.hasnext)  {println ("\ n   Word frequency: "+outiter.next (). _1+"   Words: "+outiter.next (). _2)}spark.stop ()}}def puttoheap (iter :  (string, int))  {/* data into a heap containing k elements */...} Def getheap ():  array[(String, int)] = {/* gets the elements in a heap containing k elements */val a=new array[ (String, int)] () ...} 4. Scenario Top k's sample model can be applied to applications that have consumed the most time in the past, the most frequently accessed IP addresses, and the most recent, updated, and frequent tweets.


This article is from the "Star Moon Love" blog, please be sure to keep this source http://xuegodxingyue.blog.51cto.com/5989753/1949780

The classic algorithm for Spark programming top K

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.