Full sequencing of MapReduce

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The question is raised

Under normal circumstances, one of the benefits of MapReduce is that the data sent to the reducer end is always sorted according to the input keys of the reducer, and if we use a single reducer, the sort will be straightforward, but only with a reducer of the situation is less, If more than one reducer is used, it is only possible to ensure that the contents of each reducer are sorted according to the key, and that there is no guarantee that the reducder is orderly between them, the following is the case:
Reducer1: problem solving with full sequencing

The technique of full sequencing is included in the implementation of Partitioner, where we need to convert the value range of the key to an index (0-25), for example, the key here is all English words, but we need to draw a few index ranges, and then these indexes are assigned to the corresponding reducer to solve

Here if the number of reducer we can allocate is 2, then we can assign (0-12) directly to the first reducer and (13-25) to another reducer
Note that here we are only indexed according to the first letter, but if we have 30 reducer now, we would have a problem if we were to determine the range of the index only by the first letter, which would cause 4 reduce to be wasted. At this point we need to re-determine the scope of the index and how the index is computed, for example, we can use the 0+26 0 +26 of the 0-square +0 means that the aa,1+26 0-square +26-square 0 represents AB, and so on, and then assigns the range of the index of the first +0 branch one to Reducer0, Then one of the 30 points assigned to Reducer1, and so on, if not divisible, we can leave more to the last reducer, but this is not the best solution, because this may cause the last reducer to be assigned to the data too much, affecting the performance of this task, The best practice should be:
If the scope of our current index is (0,82), allocated to 30 reducer, then each one should be 83/30=2, according to the most original idea, then reducer29 need to deal with (59,82) such a large index in the range of data, which is obviously unscientific, We need to assign the 22 to Reducer0 to Reducer21, which is not assigned to the back.

The next question is:
We may need to dynamically specify the range of the index of the input key of the reducer, where we need to implement the configurable interface for our partitioner, because the Hadoop framework loads our custom Partitioner instances during the initialization process. When the Hadoop framework instantiates this class through a reflection mechanism, it checks if the type is a configurable instance, and if it does, it calls setconf and sets the configuration object of the job. We can get the configuration variables in partitioner. Java Code Implementation

public class Globalsort {public static void main (string[] args) throws Exception {Configuration Configuratio

        n = new Configuration ();

        Configuration.set ("Key.indexrange", "26");
        Job Job = job.getinstance (configuration);
        Job.setnumreducetasks (2);
        Job.setjarbyclass (Globalsort.class);
        Job.setmapperclass (Globalsortmapper.class);
        Job.setmapoutputkeyclass (Text.class);
        Job.setmapoutputvalueclass (Longwritable.class);
        Job.setreducerclass (Globalsortreducer.class);
        Job.setoutputkeyclass (Text.class);

        Job.setoutputvalueclass (Longwritable.class);

        Job.setpartitionerclass (Globalsortpartitioner.class);
        Fileinputformat.setinputpaths (job,new Path ("F:\\wc\\input"));
        Fileoutputformat.setoutputpath (job,new Path ("F:\\wc\\output"));
    Job.waitforcompletion (TRUE); }} class Globalsortmapper extends mapper<longwritable,text,text,longwritable>{@Override protected void MThe AP (longwritable key, Text value, Context context) throws IOException, Interruptedexception {//value is the content of the data that gets the row, this
        Can split string[] splits = value.tostring (). Split ("");
        for (String str:splits) {context.write (new Text (str), new longwritable (1L)); }}} Class Globalsortpartitioner extends partitioner<text,longwritable> implements configurable {Priva
    TE configuration configuration = null;

    private int indexrange = 0; public int getpartition (text text, longwritable longwritable, int numpartitions) {//If the value range is equal to 26, then it means that only the first letter is required
        To divide the index int by index = 0;
        if (indexrange==26) {index = Text.tostring (). ToCharArray () [0]-' a '; }else if (Indexrange = = 26*26) {//Here is the need to index the first two letters char[] chars = text.tostring (). ToCharArray (
            );
            if (chars.length==1) {index = (chars[0]-' a ') *26; } index = (chars[0]-' a ') *26+ (chars[1]-'A ');
        } int perreducercount = indexrange/numpartitions;
        if (indexrange<numpartitions) {return numpartitions;
            } for (int i = 0;i<numpartitions;i++) {int min = I*perreducercount;
            int max = (i+1) *perreducercount-1;
            if (index>=min && index<=max) {return i;

    }}//Here we are using the first less scientific method of return numPartitions-1;
        } public void setconf (Configuration conf) {this.configuration = conf;
    Indexrange = Configuration.getint ("Key.indexrange", 26*26);
    } public Configuration getconf () {return configuration; }} class Globalsortreducer extends reducer<text,longwritable,text,longwritable>{@Override protected void
        Reduce (Text key, iterable<longwritable> values, context context) throws IOException, Interruptedexception {
        Long Count = 0;
     for (longwritable value:values) {       Count + = Value.get ();
    } context.write (Key, New Longwritable (count)); }
}

Input file:

Hello a
Hello abc
helo jyw
he lq
mo no m n
zz za

Results of the output:

part-r-00000
a   1
ABC 1
he  1
Hello   2
helo    1
jyw 1
LQ  1
m   1
mo  1

part-r-00001
n   1
no  1
za  1
zz  1

How to determine the scope of an index

Lists all possible values of the key, which is the number of indexes
If the possible value of the key is infinite, then it should look like this example, looking for all possible values of a part of the key (in the sort mountain is different) summary

Step1: Implements the sort logic in the writablecomparable key, or writes a custom comparator, implements the CompareTo method, realizes the sort the larger than the size task
Step2: Define a method to convert an reducer instance to an index value
Step3: Implementing a custom Partitioner
The index range of the entire reducer key should be clear
Assign an instance to the appropriate reducer using the index of the key

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More