One: Background
In the MapReduce model, the function of reduce is mostly statistical classification type of total, the maximum value of the minimum value, etc., for these operations can consider the map output after the combiner operation, so as to reduce network transport load, while reducing the burden of reduce tasks . The combiner operation is run on each node, only affects the output of the local map, combiner input is the output of the local map, many times the logic of combiner and reduce logic is the same, so the two can share the reducer body.
Two: When to run combiner
(1): when the job set combiner, and the number of spill reached the min.num.spill.for.combine(default is 3), then combiner will be executed before the merge.
(2): However, in some cases, the merge starts, but the number of spill files does not meet the requirements, and combiner may run after the merge.
(3): Combiner may also not run, Combiner will consider a load situation at the time of the cluster.
Three: Program code
[Java]View PlainCopy
- Public class Wordcounttest {
- //define Input path
- private static final String Input_path = "Hdfs://liaozhongmin:9000/hello";
- //define Output path
- private static final String Out_path = "Hdfs://liaozhongmin:9000/out";
- public static void Main (string[] args) {
- try {
- //Create configuration information
- Configuration conf = new configuration ();
- /**********************************************/
- //Compress the map-side output
- //conf.setboolean ("Mapred.compress.map.output", true);
- //Set the compression class used for map-side output
- //conf.setclass ("Mapred.map.output.compression.codec", Gzipcodec.class, Compressioncodec.class);
- //Compress the output of the reduce side
- //conf.setboolean ("Mapred.output.compress", true);
- //Set the compression class used by the output of the reduce end
- //conf.setclass ("Mapred.output.compression.codec", Gzipcodec.class, Compressioncodec.class);
- //Add config file (we can dynamically configure the information when programming, without having to change the cluster manually)
- /*
- * Conf.addresource ("Classpath://hadoop/core-site.xml");
- * Conf.addresource ("Classpath://hadoop/hdfs-site.xml");
- * Conf.addresource ("Classpath://hadoop/hdfs-site.xml");
- */
- //Create File System
- FileSystem FileSystem = Filesystem.get (new URI (Out_path), conf);
- //If the output directory exists, we will delete
- if (filesystem.exists (new Path (Out_path))) {
- Filesystem.delete (new Path (Out_path), true);
- }
- //Create a task
- Job Job = New Job (conf, wordcounttest. Class.getname ());
- //1.1 Set Input directory and set the input data format class
- Fileinputformat.setinputpaths (Job, Input_path);
- Job.setinputformatclass (Textinputformat. Class);
- //1.2 Setting the Custom mapper class and setting the type of key and value of the map function output data
- Job.setmapperclass (mymapper. Class);
- Job.setmapoutputkeyclass (Text. Class);
- Job.setmapoutputvalueclass (longwritable. Class);
- //1.3 sets the number of partitions and reduce (the number of reduce, which corresponds to the number of partitions, because the partition is one, so the number of reduce is also one)
- Job.setpartitionerclass (Hashpartitioner. Class);
- Job.setnumreducetasks (1);
- //1.4 Sorting, grouping
- //1.5 (Combiner can be shared with reducer)
- Job.setcombinerclass (myreducer. Class);
- //2.1 Shuffle copies data from the map side to the reduce side.
- //2.2 Specifies the type of reducer class and output key and value
- Job.setreducerclass (myreducer. Class);
- Job.setoutputkeyclass (Text. Class);
- Job.setoutputvalueclass (longwritable. Class);
- //2.3 Specifies the path of the output and the format class for setting the output
- Fileoutputformat.setoutputpath (Job, new Path (Out_path));
- Job.setoutputformatclass (Textoutputformat. Class);
- //Submit Job Exit
- System.exit (Job.waitforcompletion (true)? 0: 1);
- } catch (Exception e) {
- E.printstacktrace ();
- }
- }
- public static class Mymapper extends Mapper<longwritable, text, text, longwritable> {
- //Define a Longwritable object as the value type of the map output
- Longwritable oneTime = new Longwritable (1);
- //Define a text object as the key type of the map output
- Text Word = new text ();
- protected void map (longwritable key, text value, mapper<longwritable, text, text, Longwritable>. Context context) throws IOException,
- interruptedexception {
- //Use Tab (\ t) for each row of records to split
- String[] splits = value.tostring (). Split ("\ t");
- //Iterate through a string array to output each word
- For (String str:splits) {
- //Set Word
- Word.set (str);
- //write the results out
- Context.write (Word, oneTime);
- }
- }
- }
- public static class Myreducer extends Reducer<text, longwritable, Text, longwritable> {
- //Define the value type of the Longwritable object's most reduce output
- longwritable result = new longwritable ();
- protected Void reduce (text key, iterable<longwritable> values, Reducer<text, longwritable, text, Longwritable>. Context context) throws IOException,
- interruptedexception {
- int sum = 0;
- //Iterate through the collection, counting the occurrences of each word and
- For (longwritable s:values) {
- Sum + = S.get ();
- }
- //Set result
- Result.set (sum);
- //write the results out
- Context.write (key, result);
- }
- }
- }
Data used by the program:
Program Run Process:
(1): The input record for map is 3 i.e. <0,hello you> <10,hello me> <19,you me love>
(2): Map output is recorded as 7
(3): after sorting the grouped records for
(4): entry to combiner record is 7, after combiner results for
IV: Restrictions on the use of combiner
Not all cases can be used combiner,combiner the applicable scenario is a summary sum, the maximum value of the scene, but for the averaging of the scene is not applicable. Because if combiner is used in the averaging program, which is averaged by using combiner after each map, the average of each map calculation is averaged over the reduce side, and the results are not the same as the true average.
Hadoop combiner Components