MapReduce sequencing and examples

Source: Internet
Author: User
Tags iterable shuffle static class tostring hdfs dfs

Sorting can be sorted into four sorts:
General sort
Partial sort
Global sorting
Two orders (for example, there are two columns of data, and the first column is the same, you need to sort the second column.) ) General Sort

The general sort is mapreduce itself with the sort function;
The text object is not suitable for sorting, intwritable,longwritable and other objects that implement the Writablecomparable type can be sorted; Partial sorting

The order of key is included by default in the process of map and reduce, if you do not require full ordering, you can output the result directly, then each output file contains the result that the installation key performs sorting; Global sorting

The Hadoop platform does not provide global data sequencing, and the global ordering of data in large-scale data processing is a very common requirement; the most intuitive way to use Hadoop for a lot of data sorting is to map the file so that it doesn't do any processing, Direct output to a reduce(a reduce processing, not very suitable for large-scale data, not high efficiency.) ), using Hadoop's own shuffle mechanism to sort all data, and then directly output by reduce;

If you want to sort the data globally in large-scale data processing,
The main idea is to divide the data by interval, for example, to sort integers,
[0,10000] in partition 0, (10000,20000] in partition 1,
In the case of uniform data distribution, the amount of data within each partition is basically the same, this is the ideal situation, but the actual data is often uneven distribution, there is a data skew situation, at this time in accordance with the previous partition data is not appropriate, at this point need some help-sampler ;

Package Sort;
Import java.io.IOException;
Import Java.net.URI;

Import java.net.URISyntaxException;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.LongWritable;
Import org.apache.hadoop.io.NullWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.hadoop.mapreduce.Partitioner;
Import Org.apache.hadoop.mapreduce.Reducer;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;



Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    public class Demo {private final static String Input_path = "Hdfs://liguodong:8020/input";

    Private final static String Output_path = "Hdfs://liguodong:8020/output";
        public static class Mymapper extends Mapper<longwritable, Text, longwritable, nullwritable>{@Override protected void Map (longwritable key, Text value, ConText context) throws IOException, interruptedexception {string[] values = value.tostring (). s
            Plit ("\\s+");

        Context.write (New Longwritable (Long.parselong (values[0)), Nullwritable.get ()); 

        }} public static class Myreducer extends Reducer<longwritable, nullwritable, longwritable, nullwritable>{ @Override protected void reduce (longwritable key, iterable<nullwritable> values, C Ontext context) throws IOException, interruptedexception {context.write (key, Nullwritable.get
        ());
        }} public static class Mypartitioner extends Partitioner<longwritable, nullwritable>{@Override public int getpartition (longwritable key, nullwritable value, int numpartitions) {if (
            Key.get () <= () {return 0%numpartitions;
 } if (Key.get () >100 && key.get () <1000) {               return 1%numpartitions;   
        } return 2; }} public static void Main (string[] args) throws ClassNotFoundException, IOException, Interruptede
        Xception, urisyntaxexception {Configuration conf = new configuration ();
        Final FileSystem FileSystem = Filesystem.get (new URI (Input_path), conf);
        if (Filesystem.exists (New Path (Output_path))) {Filesystem.delete (New path (Output_path), true); 

        Job Job = job.getinstance (conf, "shuffle sort");

        Job.setjarbyclass (Demo.class);  

        Fileinputformat.addinputpath (Job, New Path (Input_path));
        Job.setmapperclass (Mymapper.class);
        Job.setmapoutputkeyclass (Longwritable.class);

        Job.setmapoutputvalueclass (Nullwritable.class);

        Job.setpartitionerclass (Mypartitioner.class);
        Job.setreducerclass (Myreducer.class);
        Job.setoutputkeyclass (Longwritable.class); Job.setoutputvalUeclass (Nullwritable.class);

        Job.setnumreducetasks (3);

        Fileoutputformat.setoutputpath (Job, New Path (Output_path));
    System.exit (Job.waitforcompletion (true)? 0:1);
 }

}
[root@liguodong file]# vi sortsum [root@liguodong file]# hdfs dfs-put sortsum/input [Root@ligu


Odong file]# HDFs dfs-cat/input 43 6546 65 787 879 98 ... Run jar [root@liguodong file]# yarn jar Numsort.jar View execution results [Root@liguodong file]# HDFs dfs-ls/output/found 4 items-rw-r --r--1 root supergroup 0 2015-06-16 10:55/output/_success-rw-r--r--1 root supergroup 28 2015-06-1   6 10:55/output/part-r-00000-rw-r--r--1 root supergroup 2015-06-16 10:55/output/part-r-00001-rw-r--r-- 1 root supergroup 2015-06-16 10:55/output/part-r-00002 [Root@liguodong file]# HDFs dfs-cat/output/part-r-00 2 7 98 [Root@liguodong file]# HDFs dfs-cat/output/part-r-00001 543 567 675 787 879 [Root@liguo Dong file]# hdfs dfs-cat/output/part-r-00002 5423 6546 6554 

Defects of the above procedures: artificial partitioning, so that the data may be extremely asymmetric, prone to data skew.
As a result, Hadoop provides a sampler interface that can return a set of samples, which is a Hadoop sampler, and Hadoop provides a Totalorderpartitioner class that can be used for global sorting;

Hadoop2.2.0 Source

Package org.apache.hadoop.mapreduce.lib.partition;
 /** * Utility for collecting samples and writing a partition file for * {@link totalorderpartitioner}. 
   */public class Inputsampler<k,v> extends configured implements Tool {/** * Interface to the sample using an
   * {@link Org.apache.hadoop.mapreduce.InputFormat}. */public interface Sampler<k,v> {/** * for a given job, collect and return a subset of the keys from th
     E * input data.
  */k[] Getsample (inputformat<k,v> inf, Job Job) throws IOException, interruptedexception;
   }/** * Samples the first n records from S splits.
   * Inexpensive to sample random data.  */public static class Splitsampler<k,v> implements Sampler<k,v> {}/** * Sample from random points in
   The input. * General-purpose sampler.
   Takes numsamples/maxsplitssampled Inputs from * each split. */public static class Randomsampler<k,v> implements sampler<k,v&Gt
   {}/** * Sample from S-splits at regular intervals.
   * Useful for sorted data. */public static class Intervalsampler<k,v> implements Sampler<k,v> {}}
Package org.apache.hadoop.mapreduce.lib.partition;
/**
 * Partitioner effecting a total order by reading split points from
 * An externally generated source.
 */Public
class Totalorderpartitioner<k extends writablecomparable<?>,v>
    extends Partitioner <K,V> implements configurable {

}

About Partitioner This implementation can use the files produced by the sampler. Two orders

Example:

Key1 1
key2 2
key3 3
key2 1
key1 3

Intermediate Result:

<key1,1> 1
<key1,3>  3
<key2,1>  1
<key2,2>  2
<key3 ,3>  3

Sort Result:

Key1 1
key1 3
key2 1
key2 2
Key3 3

1. MapReduce will sort the key by default;
2. Main ideas:
Rewrite the partitioner, complete the key partition, form the first order;
Refer to the following:
http://blog.csdn.net/scgaliguodong123_/article/details/46489357
Implement Writablecomparator, complete their own sequencing logic, complete the 2nd order of key;
Refer to the following:
http://blog.csdn.net/scgaliguodong123_/article/details/46010947

Examples of the official Hadoop offer:

Package org.apache.hadoop.examples;
Import Java.io.DataInput;
Import Java.io.DataOutput;
Import java.io.IOException;

Import Java.util.StringTokenizer;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.IntWritable;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.RawComparator;
Import Org.apache.hadoop.io.Text;
Import org.apache.hadoop.io.WritableComparable;
Import Org.apache.hadoop.io.WritableComparator;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.hadoop.mapreduce.Partitioner;
Import Org.apache.hadoop.mapreduce.Reducer;

Import Org.apache.hadoop.util.GenericOptionsParser;
 /** * This was an example Hadoop map/reduce application.
 * It reads the text input files that must contain and integers per a line. * The output is sorted bY the first and second number and grouped on the * first number. * To Run:bin/hadoop jar Build/hadoop-examples.jar Secondarysort * <i>in-dir</i> <i>out-d
   ir</i> */public class Secondarysort {/** * Define a pair of integers that is writable.
   * They is serialized in a byte comparable format. */public static class Intpair implements writablecomparable<intpair> {private int fir
    st = 0;

    private int second = 0;
     /** * Set the left and right values.
      */public void set (int left, int. right) {first = left;
    second = right;
    } public int GetFirst () {return first;
    } public int Getsecond () {return second; 
     }/** * Read the integers. * Encoded As:min_value-0, 0--min_value, max_value->-1 */@Override public void ReadFields (Da Tainput in) throws IOException {first = In.readint () + Integer.
      Min_value;
    Second = In.readint () + integer.min_value;
      } @Override public void write (DataOutput out) throws IOException {Out.writeint (first-integer.min_value);
    Out.writeint (Second-integer.min_value);
    } @Override public int hashcode () {return first * 157 + second; } @Override public boolean equals (Object right) {if (right instanceof Intpair) {Intpair r = (intpa
        IR) right;
      return R.first = = First && R.second = = Second;
      } else {return false; }}/** A Comparator that compares serialized Intpair. */public static class Comparator extends Writablecomparator {public Comparator () {Super (Intpair.clas
      s); } public int Compare (byte[] b1, int s1, int L1, byte[] b2, int s2, int l2) {retur
      N Comparebytes (B1, S1, L1, B2, S2, L2); }} static {//register thiS comparator writablecomparator.define (Intpair.class, New Comparator ()); } @Override public int compareTo (Intpair o) {if (first! = O.first) {return first < O.first?-
      1:1;
      } else if (second! = O.second) {return second < O.second? -1:1;
      } else {return 0;
   }}}/** * Partition based on the first part of the pair. */public static class Firstpartitioner extends partitioner<intpair,intwritable>{@Override public int get Partition (Intpair key, intwritable value, int numpartitions) {return Math.Abs (KEY.GETF
    Irst () * 127)% Numpartitions; }}/** * Compare only the first part of the pair, so that reduce is called once * for each value of the first
   Part.
    */public static class Firstgroupingcomparator implements Rawcomparator<intpair> {@Override public int Compare (byte[] b1, int s1, int L1, byte[] B2, int s2, int l2) {return writablecomparator.comparebytes (B1, S1, INTEGER.SIZE/8,
    B2, S2, INTEGER.SIZE/8);
      } @Override public int compare (Intpair O1, Intpair O2) {int l = O1.getfirst ();
      int r = O2.getfirst (); return L = = r?
    0: (L < R -1:1);
   }}/** * Read integers from each line and generate a key, value pair * AS ((left, right), right). */public static class Mapclass extends Mapper<longwritable, Text, Intpair, intwritable> {private
    Final Intpair key = new Intpair ();

    Private final intwritable value = new intwritable (); @Override public void Map (longwritable inkey, Text invalue, Context context) throws IOException,
      interruptedexception {StringTokenizer ITR = new StringTokenizer (invalue.tostring ());
      int left = 0;
      int right = 0; if (Itr.hasmoretokens ()) {left = Integer.parseint (ITR).NextToken ());
        if (Itr.hasmoretokens ()) {right = Integer.parseint (Itr.nexttoken ());
        } key.set (left, right);
        Value.set (right);
      Context.write (key, value);
   }}}/** * A Reducer class that just emits the sum of the input values. */public static class Reduce extends Reducer<intpair, intwritable, Text, intwritable> {private STA
    Tic final text SEPARATOR = new text ("------------------------------------------------");

    Private final text first = new text ();
                       @Override public void reduce (Intpair key, iterable<intwritable> values, context context
      ) throws IOException, interruptedexception {context.write (SEPARATOR, NULL);
      First.set (Integer.tostring (Key.getfirst ()));
      for (intwritable value:values) {context.write (first, value);
    }}} public static void Main (string[] args) throws Exception {Configuration conf = new configuration ();
    string[] Otherargs = new Genericoptionsparser (conf, args). Getremainingargs ();
      if (otherargs.length! = 2) {System.err.println ("Usage:secondarysort <in> <out>");
    System.exit (2);
    Job Job = new Job (conf, "secondary sort");
    Job.setjarbyclass (Secondarysort.class);
    Job.setmapperclass (Mapclass.class);

    Job.setreducerclass (Reduce.class);
    Group and partition by the first int in the pair job.setpartitionerclass (firstpartitioner.class);

    Job.setgroupingcomparatorclass (Firstgroupingcomparator.class);
    The map output is Intpair, intwritable job.setmapoutputkeyclass (Intpair.class);

    Job.setmapoutputvalueclass (Intwritable.class);
    The reduce output is Text, intwritable job.setoutputkeyclass (Text.class);

    Job.setoutputvalueclass (Intwritable.class);
    Fileinputformat.addinputpath (Job, New Path (Otherargs[0])); Fileoutputformat.setoutputpath (Job, new Path (otherARGS[1]));
  System.exit (Job.waitforcompletion (true)? 0:1);
 }

}

Operation Result:

[Root@liguodong mapreduce]# hdfs dfs-put sort hdfs://liguodong:8020/input           
[Root@liguodong mapreduce]# HDFs Dfs-cat Hdfs://liguodong:8020/input
1 1
2 2
3 3
2 1
1 3

[root@liguodong mapreduce]# yarn jar Hadoop-mapreduce-examples-2.6.0.jar secondarysort/input/output ...

[Root@liguodong mapreduce]# HDFs dfs-cat/output/p*
------------------------------------------------
1       1
1       3
------------------------------------------------
2       1
2       2
- -----------------------------------------------
3       3

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.