Hadoop combiner Components

Source: Internet
Author: User

One: Background

In the MapReduce model, the function of reduce is mostly statistical classification type of total, the maximum value of the minimum value, etc., for these operations can consider the map output after the combiner operation, so as to reduce network transport load, while reducing the burden of reduce tasks . The combiner operation is run on each node, only affects the output of the local map, combiner input is the output of the local map, many times the logic of combiner and reduce logic is the same, so the two can share the reducer body.

Two: When to run combiner

(1): when the job set combiner, and the number of spill reached the min.num.spill.for.combine(default is 3), then combiner will be executed before the merge.

(2): However, in some cases, the merge starts, but the number of spill files does not meet the requirements, and combiner may run after the merge.

(3): Combiner may also not run, Combiner will consider a load situation at the time of the cluster.

Three: Program code

[Java]View PlainCopy
  1. Public class Wordcounttest {
  2. //define Input path
  3. private static final String Input_path = "Hdfs://liaozhongmin:9000/hello";
  4. //define Output path
  5. private static final String Out_path = "Hdfs://liaozhongmin:9000/out";
  6. public static void Main (string[] args) {
  7. try {
  8. //Create configuration information
  9. Configuration conf = new configuration ();
  10. /**********************************************/  
  11. //Compress the map-side output
  12. //conf.setboolean ("Mapred.compress.map.output", true);
  13. //Set the compression class used for map-side output
  14. //conf.setclass ("Mapred.map.output.compression.codec", Gzipcodec.class, Compressioncodec.class);
  15. //Compress the output of the reduce side
  16. //conf.setboolean ("Mapred.output.compress", true);
  17. //Set the compression class used by the output of the reduce end
  18. //conf.setclass ("Mapred.output.compression.codec", Gzipcodec.class, Compressioncodec.class);
  19. //Add config file (we can dynamically configure the information when programming, without having to change the cluster manually)
  20. /* 
  21. * Conf.addresource ("Classpath://hadoop/core-site.xml");
  22. * Conf.addresource ("Classpath://hadoop/hdfs-site.xml");
  23. * Conf.addresource ("Classpath://hadoop/hdfs-site.xml");
  24. */
  25. //Create File System
  26. FileSystem FileSystem = Filesystem.get (new URI (Out_path), conf);
  27. //If the output directory exists, we will delete
  28. if (filesystem.exists (new Path (Out_path))) {
  29. Filesystem.delete (new Path (Out_path), true);
  30. }
  31. //Create a task
  32. Job Job = New Job (conf, wordcounttest.   Class.getname ());
  33. //1.1 Set Input directory and set the input data format class
  34. Fileinputformat.setinputpaths (Job, Input_path);
  35. Job.setinputformatclass (Textinputformat.   Class);
  36. //1.2 Setting the Custom mapper class and setting the type of key and value of the map function output data
  37. Job.setmapperclass (mymapper.   Class);
  38. Job.setmapoutputkeyclass (Text.   Class);
  39. Job.setmapoutputvalueclass (longwritable.   Class);
  40. //1.3 sets the number of partitions and reduce (the number of reduce, which corresponds to the number of partitions, because the partition is one, so the number of reduce is also one)
  41. Job.setpartitionerclass (Hashpartitioner.   Class);
  42. Job.setnumreducetasks (1);
  43. //1.4 Sorting, grouping
  44. //1.5 (Combiner can be shared with reducer)
  45. Job.setcombinerclass (myreducer.   Class);
  46. //2.1 Shuffle copies data from the map side to the reduce side.
  47. //2.2 Specifies the type of reducer class and output key and value
  48. Job.setreducerclass (myreducer.   Class);
  49. Job.setoutputkeyclass (Text.   Class);
  50. Job.setoutputvalueclass (longwritable.   Class);
  51. //2.3 Specifies the path of the output and the format class for setting the output
  52. Fileoutputformat.setoutputpath (Job, new Path (Out_path));
  53. Job.setoutputformatclass (Textoutputformat.   Class);
  54. //Submit Job Exit
  55. System.exit (Job.waitforcompletion (true)?  0: 1);
  56. } catch (Exception e) {
  57. E.printstacktrace ();
  58. }
  59. }
  60. public static class Mymapper extends Mapper<longwritable, text, text, longwritable> {
  61. //Define a Longwritable object as the value type of the map output
  62. Longwritable oneTime = new Longwritable (1);
  63. //Define a text object as the key type of the map output
  64. Text Word = new text ();
  65. protected void map (longwritable key, text value, mapper<longwritable, text, text, Longwritable>. Context context) throws IOException,
  66. interruptedexception {
  67. //Use Tab (\ t) for each row of records to split
  68. String[] splits = value.tostring (). Split ("\ t");
  69. //Iterate through a string array to output each word
  70. For (String str:splits) {
  71. //Set Word
  72. Word.set (str);
  73. //write the results out
  74. Context.write (Word, oneTime);
  75. }
  76. }
  77. }
  78. public static class Myreducer extends Reducer<text, longwritable, Text, longwritable> {
  79. //Define the value type of the Longwritable object's most reduce output
  80. longwritable result = new longwritable ();
  81. protected Void reduce (text key, iterable<longwritable> values, Reducer<text, longwritable, text, Longwritable>. Context context) throws IOException,
  82. interruptedexception {
  83. int sum = 0;
  84. //Iterate through the collection, counting the occurrences of each word and
  85. For (longwritable s:values) {
  86. Sum + = S.get ();
  87. }
  88. //Set result
  89. Result.set (sum);
  90. //write the results out
  91. Context.write (key, result);
  92. }
  93. }
  94. }



Data used by the program:

Program Run Process:

(1): The input record for map is 3 i.e. <0,hello you> <10,hello me> <19,you me love>

(2): Map output is recorded as 7

(3): after sorting the grouped records for

(4): entry to combiner record is 7, after combiner results for

IV: Restrictions on the use of combiner

Not all cases can be used combiner,combiner the applicable scenario is a summary sum, the maximum value of the scene, but for the averaging of the scene is not applicable. Because if combiner is used in the averaging program, which is averaged by using combiner after each map, the average of each map calculation is averaged over the reduce side, and the results are not the same as the true average.

Hadoop combiner Components

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.