Hadoop mapreduce custom grouping RawComparator and hadoopmapreduce

Source: Internet
Author: User
Tags iterable hadoop mapreduce hadoop fs

Hadoop mapreduce custom grouping RawComparator and hadoopmapreduce

This article is published on my blog.

Next, I wrote the article "Hadoop mapreduce custom sorting WritableComparable" last time. In order of this, I should explain how to implement the custom grouping. I will not talk about the operation sequence here, for more information, see my blog comments.

First, check the Job class and find the setGroupingComparatorClass () method. The specific source code is as follows:

  /**   * Define the comparator that controls which keys are grouped together   * for a single call to    * {@link Reducer#reduce(Object, Iterable,    *                       org.apache.hadoop.mapreduce.Reducer.Context)}   * @param cls the raw comparator to use   * @throws IllegalStateException if the job is submitted   */  public void setGroupingComparatorClass(Class<? extends RawComparator> cls                                         ) throws IllegalStateException {    ensureState(JobState.DEFINE);    conf.setOutputValueGroupingComparator(cls);  }

From the source code of the method, we can see that this method is a custom key grouping function. The custom grouping class must meet the requirements of extends RawComparator. Let's look at the source code of this class:

/** * <p> * A {@link Comparator} that operates directly on byte representations of * objects. * </p> * @param <T> * @see DeserializerComparator */public interface RawComparator<T> extends Comparator<T> {  public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);}

However, RawComparator inherits the Comparator interface in a generic way. After a simple look, we will define a class to inherit RawComparator. The Code is as follows:

public class MyGrouper implements RawComparator<SortAPI> {    @Override    public int compare(SortAPI o1, SortAPI o2) {        return (int)(o1.first - o2.first);    }    @Override    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {        int compareBytes = WritableComparator.compareBytes(b1, s1, 8, b2, s2, 8);        return compareBytes;    }    }

In the source code, SortAPI is the defined object in the Custom sorting in the previous section. The first method can be seen from the comment that it compares the size of two parameters and returns a natural integer; the second method is compared during deserialization. Therefore, byte comparison is required. Next let's continue to look at the custom MyMapper class:

public class MyMapper extends Mapper<LongWritable, Text, SortAPI, LongWritable> {        @Override    protected void map(LongWritable key, Text value,Context context) throws IOException, InterruptedException {        String[] splied = value.toString().split("\t");        try {            long first = Long.parseLong(splied[0]);            long second = Long.parseLong(splied[1]);            context.write(new SortAPI(first,second), new LongWritable(1));        } catch (Exception e) {            System.out.println(e.getMessage());        }    }    }

Custom MyReduce class:

public class MyReduce extends Reducer<SortAPI, LongWritable, LongWritable, LongWritable> {    @Override    protected void reduce(SortAPI key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {        context.write(new LongWritable(key.first), new LongWritable(key.second));    }    }

Custom SortAPI class:

Public class SortAPI implements WritableComparable <SortAPI> {public Long first; public Long second; public SortAPI () {} public SortAPI (long first, long second) {this. first = first; this. second = second ;}@ Override public int compareTo (SortAPI o) {return (int) (this. first-o. first);} @ Override public void write (DataOutput out) throws IOException {out. writeLong (first); out. writeLong (second) ;}@ Override public void readFields (DataInput in) throws IOException {this. first = in. readLong (); this. second = in. readLong () ;}@ Override public int hashCode () {return this. first. hashCode () + this. second. hashCode () ;}@ Override public boolean equals (Object obj) {if (obj instanceof SortAPI) {SortAPI o = (SortAPI) obj; return this. first = o. first & this. second = o. second;} return false ;}@ Override public String toString () {return "output:" + this. first + ";" + this. second ;}}

Next, prepare the data as follows:

1       21       13       03       22       21       2

Upload to hdfs: // hadoop-master: 9000/grouper/input/test.txt. The main code is as follows:

Public class Test {static final String OUTPUT_DIR = "hdfs: // hadoop-master: 9000/grouper/output/"; static final String INPUT_DIR = "hdfs: // hadoop-master: 9000/grouper/input/test.txt "; public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); Job job = new Job (conf, Test. class. getSimpleName (); job. setJarByClass (Test. class); deleteOutputFile (OUTPUT_DIR); // 1 sets the input directory FileInputFormat. setInputPaths (job, INPUT_DIR); // 2 sets the input formatting class job. setInputFormatClass (TextInputFormat. class); // 3 set the custom er and key-value type job. setMapperClass (MyMapper. class); job. setMapOutputKeyClass (SortAPI. class); job. setMapOutputValueClass (LongWritable. class); // 4 partition job. setPartitionerClass (HashPartitioner. class); job. setNumReduceTasks (1); // 5 sort group job. setGroupingComparatorClass (MyGrouper. class); // 6 is set in a certain Reduce and key-value type job. setReducerClass (MyReduce. class); job. setOutputKeyClass (LongWritable. class); job. setOutputValueClass (LongWritable. class); // 7 set the output directory FileOutputFormat. setOutputPath (job, new Path (OUTPUT_DIR); // 8 submit the job. waitForCompletion (true);} static void deleteOutputFile (String path) throws Exception {Configuration conf = new Configuration (); FileSystem fs = FileSystem. get (new URI (INPUT_DIR), conf); if (fs. exists (new Path (path) {fs. delete (new Path (path ));}}}

Run the code and enter hadoop fs-text/grouper/output/part-r-00000 on the node to view the result:

1       22       23       0

Next, modify the compareTo () method of the SortAPI class:

    @Override    public int compareTo(SortAPI o) {        long mis = (this.first - o.first) * -1;        if(mis != 0 ){            return (int)mis;        }        else{            return (int)(this.second - o.second);        }    }

Execute again and view the/grouper/output/part-r-00000 file:

3       02       21       1

In this way, we can see that the same data grouping results will be affected by the Sorting Algorithm. For example, if the sorting is reverse, grouping is also performed first by grouping the data source in reverse order. We can also print records in the map function and reduce function (the process is omitted). After comparison, we can get the grouping stage: the key value is the same to the pair key (that is, compare (byte [] b1, int s1, int l1, byte [] b2, int s2, int l2) the return value is 0, the current group selects the first buffer output (which may be stored on the hard disk) in sequence ). Other key-value pairs with the same key won't be output to the buffer. Baidu searched this article. Its group is to iterate all the values output by the map function into the same key, which is equivalent to {key, value }:{ 1, {2, 1, 2}. The result is the same as that when no custom group is set at the beginning. We can view it in the reduce function output Iterable <LongWritable> values, in fact, I think this is a grouping, just like data query.

Here we should understand the differences between groups and partitions. Partitions are classified and split files of output results files for better viewing. For example, if an output file contains an http request in all states, the Request status is divided into several result files through partitions for convenience. Grouping is to reduce the output of some key-value pairs with the same key. After partitioning, all the data is still output to the reduce end, while grouping is reduced; of course, these two steps are also executed in different stages.


First come here this time. Keep recording!


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.