The question is raised
Under normal circumstances, one of the benefits of MapReduce is that the data sent to the reducer end is always sorted according to the input keys of the reducer, and if we use a single reducer, the sort will be straightforward, but only with a reducer of the situation is less, If more than one reducer is used, it is only possible to ensure that the contents of each reducer are sorted according to the key, and that there is no guarantee that the reducder is orderly between them, the following is the case:
Reducer1: problem solving with full sequencing
The technique of full sequencing is included in the implementation of Partitioner, where we need to convert the value range of the key to an index (0-25), for example, the key here is all English words, but we need to draw a few index ranges, and then these indexes are assigned to the corresponding reducer to solve
Here if the number of reducer we can allocate is 2, then we can assign (0-12) directly to the first reducer and (13-25) to another reducer
Note that here we are only indexed according to the first letter, but if we have 30 reducer now, we would have a problem if we were to determine the range of the index only by the first letter, which would cause 4 reduce to be wasted. At this point we need to re-determine the scope of the index and how the index is computed, for example, we can use the 0+26 0 +26 of the 0-square +0 means that the aa,1+26 0-square +26-square 0 represents AB, and so on, and then assigns the range of the index of the first +0 branch one to Reducer0, Then one of the 30 points assigned to Reducer1, and so on, if not divisible, we can leave more to the last reducer, but this is not the best solution, because this may cause the last reducer to be assigned to the data too much, affecting the performance of this task, The best practice should be:
If the scope of our current index is (0,82), allocated to 30 reducer, then each one should be 83/30=2, according to the most original idea, then reducer29 need to deal with (59,82) such a large index in the range of data, which is obviously unscientific, We need to assign the 22 to Reducer0 to Reducer21, which is not assigned to the back.
The next question is:
We may need to dynamically specify the range of the index of the input key of the reducer, where we need to implement the configurable interface for our partitioner, because the Hadoop framework loads our custom Partitioner instances during the initialization process. When the Hadoop framework instantiates this class through a reflection mechanism, it checks if the type is a configurable instance, and if it does, it calls setconf and sets the configuration object of the job. We can get the configuration variables in partitioner. Java Code Implementation
public class Globalsort {public static void main (string[] args) throws Exception {Configuration Configuratio
n = new Configuration ();
Configuration.set ("Key.indexrange", "26");
Job Job = job.getinstance (configuration);
Job.setnumreducetasks (2);
Job.setjarbyclass (Globalsort.class);
Job.setmapperclass (Globalsortmapper.class);
Job.setmapoutputkeyclass (Text.class);
Job.setmapoutputvalueclass (Longwritable.class);
Job.setreducerclass (Globalsortreducer.class);
Job.setoutputkeyclass (Text.class);
Job.setoutputvalueclass (Longwritable.class);
Job.setpartitionerclass (Globalsortpartitioner.class);
Fileinputformat.setinputpaths (job,new Path ("F:\\wc\\input"));
Fileoutputformat.setoutputpath (job,new Path ("F:\\wc\\output"));
Job.waitforcompletion (TRUE); }} class Globalsortmapper extends mapper<longwritable,text,text,longwritable>{@Override protected void MThe AP (longwritable key, Text value, Context context) throws IOException, Interruptedexception {//value is the content of the data that gets the row, this
Can split string[] splits = value.tostring (). Split ("");
for (String str:splits) {context.write (new Text (str), new longwritable (1L)); }}} Class Globalsortpartitioner extends partitioner<text,longwritable> implements configurable {Priva
TE configuration configuration = null;
private int indexrange = 0; public int getpartition (text text, longwritable longwritable, int numpartitions) {//If the value range is equal to 26, then it means that only the first letter is required
To divide the index int by index = 0;
if (indexrange==26) {index = Text.tostring (). ToCharArray () [0]-' a '; }else if (Indexrange = = 26*26) {//Here is the need to index the first two letters char[] chars = text.tostring (). ToCharArray (
);
if (chars.length==1) {index = (chars[0]-' a ') *26; } index = (chars[0]-' a ') *26+ (chars[1]-'A ');
} int perreducercount = indexrange/numpartitions;
if (indexrange<numpartitions) {return numpartitions;
} for (int i = 0;i<numpartitions;i++) {int min = I*perreducercount;
int max = (i+1) *perreducercount-1;
if (index>=min && index<=max) {return i;
}}//Here we are using the first less scientific method of return numPartitions-1;
} public void setconf (Configuration conf) {this.configuration = conf;
Indexrange = Configuration.getint ("Key.indexrange", 26*26);
} public Configuration getconf () {return configuration; }} class Globalsortreducer extends reducer<text,longwritable,text,longwritable>{@Override protected void
Reduce (Text key, iterable<longwritable> values, context context) throws IOException, Interruptedexception {
Long Count = 0;
for (longwritable value:values) { Count + = Value.get ();
} context.write (Key, New Longwritable (count)); }
}
Input file:
Hello a
Hello abc
helo jyw
he lq
mo no m n
zz za
Results of the output:
part-r-00000
a 1
ABC 1
he 1
Hello 2
helo 1
jyw 1
LQ 1
m 1
mo 1
part-r-00001
n 1
no 1
za 1
zz 1
How to determine the scope of an index
Lists all possible values of the key, which is the number of indexes
If the possible value of the key is infinite, then it should look like this example, looking for all possible values of a part of the key (in the sort mountain is different) summary
Step1: Implements the sort logic in the writablecomparable key, or writes a custom comparator, implements the CompareTo method, realizes the sort the larger than the size task
Step2: Define a method to convert an reducer instance to an index value
Step3: Implementing a custom Partitioner
The index range of the entire reducer key should be clear
Assign an instance to the appropriate reducer using the index of the key