Tag:spark topk algorithm
TOP KTOP K algorithm has two steps, one is to count the word frequency, the second is to find the highest frequency of the first k words. 1. The example description assumes the top 1, the following input and output. Input:hello world bye worldhello hadoop bye hadoopbye hadoop hello Hadoop output: Words hadoop Word frequency 42. Design Ideas first statistical wordcount of the frequency, the data into (word, frequency) of the data pair, the second phase of the idea of divided treatment, to find the top k of the RDD each partition, and finally the top of each partition k results are merged to produce a new collection, and the results of top k are counted in the collection. Each partition is stored in a single machine, so you can use a stand-alone method to find TOPK. This example takes the form of a heap. You can also directly maintain an array with k elements, and interested readers can refer to other information to understand the implementation of the heap. 3. The code example TOP K algorithm sample code is as follows: Import org.apache.spark.sparkcontextimport org.apache.spark.sparkcontext._ Object topk {def main (args:array[string]) {/* executive WordCount, statistics of the most high-frequency words */val spark = new sparkcontext ("local", "TopK", System.getenv ("Spark_home"), Sparkcontext.jarofclass (This.getclass)) val count = spark.textfile ("Data"). FlatMap (line = >line.split (" "). Map (word => (word, 1)). Reducebykey (_ + _)/* Statistics The top of each of the RDD's partitions k Query */val topk = count.mappartitions (iter =>  {while (Iter.hasnext) {puttoheap (Iter.next ())}getheap (). iterator}). Collect ()/* Merges the TOPK queries in each partition into a new collection, Statistics out TOPK query */val iter = topk.iteratorwhile (iter.hasnext) {puttoheap (Iter.next ())}val Outiter=getheap (). iterator/* output TOPK value */println ("topk value :") while (Outiter.hasnext) {println ("\ n Word frequency: "+outiter.next (). _1+" Words: "+outiter.next (). _2)}spark.stop ()}}def puttoheap (iter : (string, int)) {/* data into a heap containing k elements */...} Def getheap (): array[(String, int)] = {/* gets the elements in a heap containing k elements */val a=new array[ (String, int)] () ...} 4. Scenario Top k's sample model can be applied to applications that have consumed the most time in the past, the most frequently accessed IP addresses, and the most recent, updated, and frequent tweets.
This article is from the "Star Moon Love" blog, please be sure to keep this source http://xuegodxingyue.blog.51cto.com/5989753/1949780
The classic algorithm for Spark programming top K