Since learning about MapReduce programming has been going on for some time, as a child who is looking for self-confidence and fun in programming and loves programming, the hand has become "itchy" and wants to try a little bit. So I wrote the TOPK code. TOPK means to find out all the words of the word frequency top K from the original file. The first analysis of the problem, from which we can be inspired: to know the word frequency ranking of all the words before K, then is not to all the word frequency statistics ah? So we think of a more classic example: WordCount. Yes. This is it, counting the number of each word in the original file depends on it.
However, our word frequency statistics come out, the next need to do is how to find out the frequency of the top k of all the words. How to find out the word frequency ranking before k? We know that the result of wordcount is the word frequency of all the words, and is already in order. So what we need to do next is:
1, will all the same word frequency words summed up, this step is the map after the shuffle process can get the corresponding results, only in the map phase will be the word frequency as key, words more as value can.
2, to find out the top K of the word frequency of all the words, and in accordance with the order of frequency, in this step, many people through the use of TREEMAP data structure to achieve, but here to note, TreeMap for the same key value will be covered. Therefore, data with the same key value cannot be manipulated. There are also some people who encapsulate the key, but also avoid the result of having the same key value. Therefore, the method I use here is to store all words with the same word frequency in ArrayList, and finally write the contents of ArrayList into the HDFs.
To sum up, to achieve the results of TOPK, need to use two Mr Jobs, one is wordcount job, one is TOPK job.
The code is as follows:
1, the WordCount part:
Import Java.io.ioexception;import Java.util.stringtokenizer;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.intwritable;import org.apache.hadoop.io.LongWritable; Import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.Mapper ; Import Org.apache.hadoop.mapreduce.reducer;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;public class MYTOPK {public static class Mymap extends mapper& Lt longwritable, text, text, intwritable>{private final intwritable one =new intwritable (1);p rivate text word =new text (); public void Map (longwritable ikey,text ivalue,context Context) throws IOException, interruptedexception{ StringTokenizer str=new StringTokenizer (ivalue.tostring ()); while (Str.hasmoretokens ()) {Word.set (Str.nextToken ())); Context.write (Word, one);}}} public static class Myreduce extends Reducer<text, intwritable, TeXT, Intwritable>{private intwritable result=new intwritable ();p ublic void reduce (Text ikey,iterable<intwritable > Ivalue,context Context) throws IOException, Interruptedexception{int sum=0;for (intwritable val:ivalue) {sum+= Val.get ();} Result.set (sum); Context.write (ikey, result);}} Set a static function to make it easy to call public static Boolean run (String-in, string-out) directly in main through the class name throws IOException, ClassNotFoundException, Interruptedexception{configuration conf =new Configuration (); Job Job=new Job (conf, "Wordcount"); Job.setjarbyclass (Mytopk.class); Job.setmapperclass (Mymap.class); Job.setreducerclass (Myreduce.class);//Set map output type Job.setmapoutputkeyclass (Text.class); job.setmapoutputvalueclass (intwritable.class);//Set the output type of reduce job.setoutputkeyclass (text.class); Job.setoutputvalueclass (Intwritable.class )///Set input/output path Fileinputformat.addinputpath (Job, New Path (in)); Fileoutputformat.setoutputpath (Job, New Path (out)); return job.waitforcompletion (TRUE);}}
2, TOPK implementation process
Import Java.io.ioexception;import java.util.map.entry;import java.util.arraylist;import java.util.Comparator; Import Java.util.set;import java.util.stringtokenizer;import java.util.treemap;import Java.util.regex.Pattern; Import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.intwritable;import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.Text; Import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.reducer;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;import Org.apache.hadoop.mapreduce.lib.output.multipleoutputs;import Org.apache.hadoop.mapreduce.lib.output.textoutputformat;public class MyTopK1 {public static class MyMap extends Mapper <longwritable, Text, intwritable, text>{intwritable outkey=new intwritable (); Text outvalue=new text ();p ublic void map (longwritable ikey,text ivalue,cOntext context) throws IOException, Interruptedexception{stringtokenizer str=new StringTokenizer (ivalue.toString ()); while (Str.hasmoretokens ()) {//This represents each line of data for input data, each line contains the number of words and words, and this content is not in Ivalue, the following need to separate the number of words and words in ivalue. String Element=str.nexttoken (), if (Pattern.matches ("\\d+", Element)) {//This is used to match the number of words in a regular expression outkey.set ( Integer.parseint (Element));//The number of words as the key value}else {outvalue.set (element);//The Word as a key value}}context.write (Outkey, Outvalue) ;//In the process of writing the number of words will be sorted}}public static Treemap<integer, arraylist<string> > HM =new Treemap<integer, Arraylist<string> > (New comparator<integer> () {public int compare (Integer V1,integer v2) {return V2.compareto (v1);}}); /used to select out Topkprivate static Multipleoutputs<text, intwritable> mos=null;//used for multi-file output private static String Path=null ;//After the shuffle process, the same number of words are together, and this data as the input data of reduce the public static class Myreduce extends Reducer<intwritable, Text, Text, intwritable>{public void reduce (intwritable ikey,iterable<text> Ivalue,context Context) throws IOException, interruptedexception{arraylist<string> tmp=new arraylist<string> ( Text val:ivalue) {context.write (Val,ikey);//output full-sorted content//tmp.add (val.tostring ()); This will result in more memory, where the optimization//optimization method is to limit the length of the list, because it is TOPK, so each of the maximum is 10 can be (Tmp.size () <=10) {Tmp.add (val.tostring ()); }}hm.put (Ikey.get (), TMP);} private static int topknum=10; Indicates the maximum number of protected void Cleanup (context context) throws Ioexception,interruptedexception {//string Path = Context.getconfiguration (). Get ("Topkout"), MoS = new Multipleoutputs<text, intwritable> (context); Set<entry<integer, arraylist<string> > > Set = Hm.entryset (); for (Entry<integer, arraylist< String>> entry:set) {arraylist<string> al = Entry.getvalue (); if (topknum-al.size () > 0) {for (String Word : AL) {//if (topknum--> 0) {mos.write ("Topkmos", New Text (Word),///Here the parameter "Topkmos" represents a property name new Intwritable (Entry.getkey ()), path);//}}}}mos.close ();}} @SuppressWarnings ("deprecation") publicstatic void Run (String in,string out,string topkout) throws IOException, ClassNotFoundException, interruptedexception{ Configuration conf=new configuration ()//create job and develop map and reduce job job=new job (conf); Job.setjarbyclass (Mytopk1.class) , Job.setmapperclass (Mymap.class), Job.setreducerclass (myreduce.class),//TOPK output path Path=topkout;//conf.set (" Topkout ", topkout);//Set the type of the map output job.setmapoutputkeyclass (intwritable.class); Job.setmapoutputvalueclass ( Text.class)///Set the output type of reduce job.setoutputkeyclass (text.class); Job.setoutputvalueclass (intwritable.class);// Set the multipleoutputs output format,//The second parameter here "Topkmos" to be the same as the parameters in the Write Method Multipleoutputs.addnamedoutput (Job, "Topkmos", Textoutputformat.class, Text.class, intwritable.class);//Set input/output format fileinputformat.addinputpath (Job, New Path (in)); Fileoutputformat.setoutputpath (Job, New Path (out));//Submit Job Job.waitforcompletion (TRUE);}}
3, the entrance of the program
Import Java.io.ioexception;import Org.apache.log4j.propertyconfigurator;public class Topkmain {public static void main (string[] args) throws ClassNotFoundException, IOException, interruptedexception {//TODO auto-generated method Stub,,// One thing to keep in mind here is to manually load some log4j files, one to remove the warning, and the other to see the details of the error in the event of an error, string rootpath = System.getproperty ("User.dir"). Propertyconfigurator.configure (rootpath+ "\\log4j.properties");//To count the number of words, sort the literal String in = "hdfs:// 192.168.1.21:9000/input "; Count the number of words after the result of String wordcoutoutput = "Hdfs://192.168.1.21:9000/out"; After the results of the statistics are reordered the contents of the String sort = "Hdfs://192.168.1.21:9000/sort"; Specifies the file name of the first K-bar output String TopK = "HDFS://192.168.1.21:9000/TOPK"; If the count of words is completed after the job completes, sort if (Mytopk.run (in, wordcoutoutput)) { Mytopk1.run (wordcoutoutput, SORT,TOPK);}}}
MapReduce Implementation TOPK example