The bandwidth available on the cluster limits the number of mapreduce jobs, so the most important thing to do is to avoid the data transfer between the map task and the reduce task as much as possible. Hadoop allows users to specify a merge function for the output of the map task, and sometimes we also call it combiner, which is like mapper and reducer.
The output of the merge function as input to the reduce function, because the merge function is an optimization scheme, Hadoop cannot determine how many times the merge function needs to be called for any record in the map task output. No matter how many times we call the merge function, the output of the reducer should be consistent. The rules for merging functions qualify the types of functions that can be used.
We still need the reduce function to handle records with the same key in different map outputs, which can effectively reduce the amount of data transferred between map and reduce, and it is prudent to use combiner in MapReduce operations.
In the MapReduce program, the merge function is defined by the Reducer interface, and we need to set the Combiner class in jobconf, which is used setcombinerclass this method.
Sinsing Notes of the Hadoop authoritative guide to the third article combiner