Reprint: http://www.cnblogs.com/sharpxiajun/p/5205496.html
Recently made a small MapReduce program, the main purpose is to calculate the highest value of the top 5, originally intended to use spark calculation, but I am now just a simple look at spark, so the first to use the MapReduce calculation, today and we share this example, It is also a summary of the program you wrote.
First explain the chain, for example, we want to count this week's chain, then the calculation method is this week's data and last week the difference between the number divided by last week, the number is the chain, if the month is the next month and last month of the difference between the data divided by the last month the number is the month. However, this mapreduce example does not directly calculate the ratio, but simply to find the difference between the different time periods of the value, the result of the chain is the operation of the business system.
Look at the test data I constructed, the test data is divided into two files, the contents of the file one is as follows:
123456 |
guanggu,1;90 hongshan,1;80 xinzhou,1;70 wuchang,1;95 hankou,1;85 hanyang,1;75 |
The test data for the second file is as follows:
guanggu,2;66hongshan,2;68xinzhou,2;88wuchang,2;59hankou,2;56hanyang,2;38
Here the field of the first column in each row is the key, and key and value are separated by commas, 1;90 is the value, value contains two contents, 1 is the time period marker, 90 is the value, you can see that the same key will have two different time periods (using 1 and second marks).
The operation logic of MapReduce is as follows: First step we ask for the number of the chain, the second step is to sort, do this algorithm I have been thinking for a long time is to find the chain value and sequencing two process merger, but finally found it difficult to do, had to split the entire operation process into two different mapreduce, The first MapReduce calculates the chain, the second is sorted, and the two are iteration relationships. Here is the explanation for the two mapreduce reason, the main reason is that the most primitive data difficult to two different time periods of data by key merged into a row of data, so the MapReduce calculation must have a process is to perform the same key merge operation, Therefore, it has to be divided into two steps to complete the calculation.
Next is the specific code, first, the first mapreduce, used to calculate the value of the MapReduce, its map implementation code is as follows:
Import java.io.ioexception;import org.apache.hadoop.io.text;//use input as Object,text, output is the data structure of Text,text, object is actually line number, In this calculation, the meaning is small, text is the content of each line of the public class Mrbyareahbmap extends Org.apache.hadoop.mapreduce.mapper<object, text, text, text>{ private static String Firstseparator = ",";//the key and value values for each line are separated by commas @Override protected void Map ( Object key, Text value, context context) throws IOException, interruptedexception {/ * The logic of this map is very simple, is to split the key and value from the line, for some beginners may wonder, how exactly did we get the same key together? This depends on the reduce calculation */ Text areakey = new text ();//Reduce input is text type text areaval = new text ();//Reduce input is text type String line = value.tostring (); if (line! = null &&!line.equals ("")) { string[] arr = Line.split (firstseparator); Areakey.set (Arr[0]); Areaval.set (arr[1]); Context.write (Areakey, Areaval);}}
The following is the reduce code, as follows:
Import Java.io.ioexception;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.reducer;public Class Mrbyareahbreduce extends Reducer<text, text, text, text>{private static String firstseparator = ";"; private static String Preflag = "1"; private static String Nextflag = "2"; The input of/*reduce is also the form of key,value, but this input merges the value of the same key in the map, which is an array form, but the reduce method is numerically processed by an iterator */@Override protected void reduce (Text key, iterable<text> values, context context) throws IOException, Interruptedexception { int NUM1 = 0,num2 = 0,hbnum = 0; for (Text value:values) {String inval = value.tostring (); string[] arr = Inval.split (firstseparator); The logic below is to get different time period values by different time period markers if (Arr[0].equals (Preflag)) {num1 = integer.valueof (arr[1]); } if (Arr[0].equals (Nextflag)) {num2 = integer.valueof (arr[1]); }} hbnum = Num1-num2;//Here calculates the chain text ValueText = new text (); Valuetext.set (Hbnum + ""); Text Retkey = new text (); /* To reduce the key has been modified, the original key and the values of each time period together, so that reading the results of the calculation can be read to the original calculation data, which is Key,value calculation mode can be humble helpless lift * * Retkey.set ( Key.tostring () + firstseparator + num1 + firstseparator + num2); Context.write (Valuetext,retkey); }}
To find the chain of Mapredue code to end, the following is the sorting algorithm, sorting algorithm is more simple, in the calculation of the output of the MapReduce in the chain I will be the chain value and the original key interchange, and then output to the result file, the result file is the second mapreduce input, Now we're going to sort this new key, mapredcue the default sorting mechanism from map to reduce in the calculation model, if the map output key is a character type then the collation is sorted by dictionary, and if key is a number, it will be sorted by number from small to large. , here is the specific algorithm of the sort MapReduce, the map code is as follows:
Import Java.io.ioexception;import Org.apache.hadoop.io.intwritable;import org.apache.hadoop.io.LongWritable; Import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.mapper;public class Mrbyareasortmap extends mapper<longwritable, Text, intwritable, text> {/ * The sort we need is sorted by the value of key, but this sort is the output of map, So the key output in the code using the intwritable type is actually sort of the map logic is very simple to ensure that the output key of the map is a numeric type can * /@Override protected void Map ( Longwritable key, Text value, context context) throws IOException, interruptedexception { String line = Value.tos Tring (); /*reduce output file format is separated by a space, but also do not know a few spaces, or tab segmentation, where the use of regular expressions s+ is not afraid of how many spaces and tab * /string[] arr = line.split ("\\s+"); Intwritable Outputkey = new Intwritable (integer.valueof (arr[0])); Text Outputvalue = new text (); Outputvalue.set (arr[1]); Context.write (Outputkey, Outputvalue);} }
The reduce code is as follows:
Import Java.io.ioexception;import Org.apache.hadoop.io.intwritable;import Org.apache.hadoop.io.text;import org.apache.hadoop.mapreduce.reducer;/* reduce code is surprising, is the map result is output as */public class Mrbyareasortreduce extends Reducer<intwritable, Text, intwritable, text> { @Override protected void reduce (intwritable key, Iterable<text> values, context context) throws IOException, interruptedexception {for (Text text:values { Context.write (key, text);}}}
The comments in the code explain the logic of the code in detail, which is not described here.
The following is the invocation of the two MapReduce main function, that is how we execute the mapreduce way, this main function is very characteristic, the feature is that two MapReduce has an iterative relationship, In particular, the first mapredcue after the execution of the second mapredcue to execute, or the first mapredcue output is the second mapredcue input, The second feature is that in the sort calculation we used the map to reduce process and the default sorting mechanism in the process of shuffle, then the mechanism is not as simple as the MapReduce code, in fact, we need more in-depth understanding of the principle of mapreduce, Here we look directly at the code, the code is as follows:
Mport Java.io.ioexception;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.intwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.lib.jobcontrol.controlledjob;import Org.apache.hadoop.mapreduce.lib.jobcontrol.jobcontrol;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;public class Mrbyareajob {public static void main (string[] Ar GS) throws IOException {//A mapreduce is a job a job requires a separate configuration, I start to let two jobs public configuration, finally Mr Error Con figuration conf01 = new Configuration (); Controlledjob CONJOBHB = new Controlledjob (CONF01); The following code in many articles will mention here is not much to say job JOBHB = new Job (conf01, "HB"); Jobhb.setjarbyclass (Mrbyareajob.class); Jobhb.setmapperclass (Mrbyareahbmap.class); Jobhb.setreducerclass (Mrbyareahbreduce.class); Jobhb.setmaPoutputkeyclass (Text.class); Jobhb.setmapoutputvalueclass (Text.class); Jobhb.setoutputkeyclass (Text.class); Jobhb.setoutputvalueclass (Text.class); Conjobhb.setjob (JOBHB); Fileinputformat.addinputpath (JOBHB, New Path (Args[0])); Fileoutputformat.setoutputpath (JOBHB, New Path (Args[1])); Configuration conf02 = new configuration (); Job Jobsort = new Job (conf02, "sort"); Jobsort.setjarbyclass (Mrbyareajob.class); Jobsort.setmapperclass (Mrbyareasortmap.class); Jobsort.setreducerclass (Mrbyareasortreduce.class); Partitioner is a step in shuffle, a partitioner corresponds to a reduce//if this mapredue has multiple reduce, how do we guarantee the global consistency of the sort, so this needs to be handled JOBSO Rt.setpartitionerclass (Partitionbyarea.class); Map to numeric sort by default is from small to large, but the demand is from big to small, so we need to change this sort jobsort.setsortcomparatorclass (Intkeycomparator.class); Jobsort.setmapoutputkeyclass (Intwritable.class); Jobsort.setmapoutputvalueclass (Text.class); Jobsort.setoutputkeyclass (Intwritable.class); Jobsort.setoutputvalueclass (Text.class); Controlledjob conjobsort = new Controlledjob (CONF02); Conjobsort.setjob (Jobsort); This adds the job dependency conjobsort.adddependingjob (CONJOBHB); You can see that the output of the first MapReduce is the second input fileinputformat.addinputpath (Jobsort, New Path (Args[1])); Fileoutputformat.setoutputpath (Jobsort, New Path (args[2])); Master control job Jobcontrol Mainjobcontrol = new Jobcontrol ("Mainhbsort"); Mainjobcontrol.addjob (CONJOBHB); Mainjobcontrol.addjob (Conjobsort); Thread t = new thread (Mainjobcontrol); T.start (); while (true) {if (mainjobcontrol.allfinished ()) {System.out.println (Mainjobcontrol.getsuccessfulj Oblist ()); Mainjobcontrol.stop (); Break } } }}
Here are two classes not introduced, one is Intkeycomparator, this is to ensure that the sort of mapreduce results are sorted by number from large to small, the code is as follows:
Import Org.apache.hadoop.io.intwritable;import Org.apache.hadoop.io.writablecomparable;import Org.apache.hadoop.io.writablecomparator;public class Intkeycomparator extends Writablecomparator { protected Intkeycomparator () { super (intwritable.class,true); } @Override public int compare (writablecomparable A, writablecomparable b) { return-super.compare (A, b); } }
Another class is Partitionbyarea, which ensures that sorting does not guarantee the global consistency of the sort because of the number of reduce settings, which is the following code:
Import Org.apache.hadoop.io.intwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.partitioner;public class partitionbyarea<intwritable, Text> extends Partitioner <intwritable, text> { @Override public int getpartition (intwritable key, Text value, int numreducetasks) { int maxValue =; int keysection = 0; Numreducetasks is the default reduce task Count if (Numreducetasks > 1 && key.hashcode () < MaxValue) { int Sectionvalue = MaxValue/(numReduceTasks-1); int count = 0; while ((Key.hashcode ()-Sectionvalue * count) > Sectionvalue) { count++; } Keysection = numReduceTasks-1-count; } return keysection;} }
Here in particular to explain is partitionbyarea, this principle I took a long time to understand, partition is the map output for reduce corresponding to do partition, generally a partition corresponds to a reduce, If we set the reduce task slot to one then we do not have to change the partition class, but the actual production of reduce will often configure multiple, this time to ensure that the overall sorting of the data is very important, then how we ensure that the overall order of the data, At this point we are going to find the maximum value of the input data, and then divide the maximum value by the quotient of the number of partition as the boundary of the segmented data, so that the overall ordering of the data can be made more efficient.
Now that all the code has been introduced, here's how we can get this code to run, I'm writing code using the IDE is eclipse, but I'm not using the MapReduce plugin, but instead of running it directly on the server, let me describe how to run the MR, as follows:
First I use the root user on the server with the Hadoop service to create a folder of my own, where the name of the folder is called Xiajun, I pass the source file through FTP to the Xiajun directory Javafile folder, execute the following command:
Mkdir/xiajun/javafilejavac–classpath/home/hadoop/hadoop/hadoop-core-0.20.2-cdh3u4.jar–d/xiajun/javaclass/ Xiajun/javafile/*.java
The above command compiles the source file and compiles the Java code of the Javafile folder into the Javaclass directory.
jar–cvf/xiajun/mymr.jar–c/xiajun/javaclass/.
Here the Javaclass directory class file into a jar package, placed in the Xiajun directory.
Next we use a Hadoop user login:
Su–hadoop
The reason why I use root to compile the jar package is because my Hadoop user does not have permission to upload files.
We first upload the test data to HDFs and then execute the following command:
Cd/hadoop/bin
Switch the directory to the bin directory, and then do the following:
Hadoop jar Mymr.jar cn.com.TestMain input directory output directory
The input can be either a specific file or a directory, the output directory does not exist on HDFs, if there is Hadoop will not be able to confirm whether the task has been completed, the task will be forced to terminate.
Two mapreduce iterations the execution log is very disappointing, so if we find that the task is not performing properly, I am now mapredcue to perform a view error log.
Finally, let's look at how the application service should call this MapReduce program, where I use the remote call shell in the following code:
Import Java.io.bufferedreader;import Java.io.ioexception;import Java.io.inputstream;import Java.io.inputstreamreader;import Ch.ethz.ssh2.connection;import Ch.ethz.ssh2.session;public class TestMain {/** * @param args * @throws ioexception * @throws classnotfoundexception * @throws interruptedexception * * public static void Main (string[] args) {String hostname = "192.168.1.200"; String username = "Hadoop"; String pwd = "Hadoop"; Connection conn = new Connection (hostname); Session sess = null; Long begin = System.currenttimemillis (); try {conn.connect (); Boolean isauthenticated = Conn.authenticatewithpassword (username, pwd); Sess = Conn.opensession (); Sess.execcommand ("CD hadoop/bin && Hadoop jar/xiajun/mymr.jar com.test.mr.mrbyareajob/xiajun/areahbinput/ Xiajun/areahboutput58/xiajun/areahboutput68 "); InputStream stdout = sEss.getstdout (); BufferedReader br = new BufferedReader (new InputStreamReader (stdout)); StringBuilder sb = new StringBuilder (); while (true) {String line = Br.readline (); if (line = = null) break; Sb.append (line); } System.out.println (Sb.tostring ()); Long end = System.currenttimemillis (); SYSTEM.OUT.PRINTLN ("Time Consuming:" + (Begin-end)/1000 + "SEC"); } catch (IOException e) {e.printstacktrace (); }finally{Sess.close (); Conn.close (); } }}
Well, here's the end of the article.
Using 2 Mr Calculations