Global ordering of MapReduce Totalorderpartitioner

Source: Internet
Author: User
Tags integer sort split

We know that the MapReduce framework sorts the map output key before the feed data is given to reducer, which ensures that every reducer is locally ordered, and the default partitioner of Hadoop is Hashpartitioner, It depends on the output key hashcode, so that the same key will go to the same reducer, but does not guarantee global order, if you want to get global sort results (such as Get top N, bottom n), you need to use Totalorderpartitioner , it ensures that the same key goes to the same reducer and also guarantees the global order.

public class hashpartitioner<k, v> extends Partitioner<k, v> {  
  /** use {@link object#hashcode ()} to Partit Ion. */Public
  int getpartition (K key, v. Value,  
                          int numreducetasks) {return  
    (Key.hashcode () & Integer.max_ VALUE)% Numreducetasks  
  }  
}
/** 
 * Partitioner effecting a total order by reading split points source in externally 
 generated. 
 * *
@InterfaceAudience. Public  
@InterfaceStability. Stable public  
class totalorderpartitioner<k Extends writablecomparable<?>,v>  
    extends partitioner<k,v> implements configurable {  
  //By Construction, we know if our keytype  
  @SuppressWarnings ("unchecked")//are memcmp-able and uses the trie public in  
  T Getpartition (K key, V value, int numpartitions) {return  
    partitions.findpartition (key);  
  }  

Totalorderpartitioner relies on a partition file to distribute Keys,partition file is a good implementation of the sequence file, if we set the reducer Number is N, then this file contains (N-1) a key split point, and is based on the key comparator order. Totalorderpartitioner examines which reducer each key belongs to and then decides which reducer to distribute to.

The Writepartitionfile method of the Inputsampler class samples the input files and creates partition file. There are three methods of sampling:

1. Randomsampler Random Sampling

2. The intervalsampler is sampled at certain intervals from the s split and is usually applied to ordered data

3. Splitsampler selects the first n record sample from S split

More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/webkf/tools/

Paritition file can be set by Totalorderpartitioner.setpartitionfile (conf, partitionfile), in Totalorderpartitioner When instance is created, the setconf function is called, and the key value in partition file is read, and if the key is binarycomparable (which can be considered a string type), the trie is constructed, and the time complexity is O (n), n is the depth of the tree. If the binarycomparable type constructs the Binarysearchnode, uses the binary lookup, the time Complexity O (log (n)), n is the reduce number

Boolean natorder =  
  Conf.getboolean (Natural_order, true);  
if (Natorder && BinaryComparable.class.isAssignableFrom (keyclass)) {  
  partitions = Buildtrie ( Binarycomparable[]) splitpoints, 0,  
      splitpoints.length, New byte[0],  
      //Now that blocks of identical splitless Trie nodes are   
      //represented reentrantly, and we develop a leaf for all trie  
      //node with only one split point, t He only reason for a depth  
      //limit are to refute stack overflow or bloat in the pathological  
      //case where the SPL It points are long and mostly look like bytes   
      //III...IIXII...III   .  Therefore, we make the default depth  
      //limit large but not huge.  
      Conf.getint (max_trie_depth));  
} else {  
  partitions = new Binarysearchnode (splitpoints, comparator);  
}

Sample Programs

Import org.apache.hadoop.conf.Configuration;  
Import Org.apache.hadoop.fs.Path;  
Import Org.apache.hadoop.io.Text;  
Import Org.apache.hadoop.mapreduce.Job;  
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
Import Org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;  
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
Import Org.apache.hadoop.mapreduce.lib.partition.InputSampler;  
Import Org.apache.hadoop.mapreduce.lib.partition.InputSampler.RandomSampler;  
      
Import Org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner;  
        public class TOTALSORTMR {public static int runtotalsortjob (string[] args) throws Exception {  
        Path InputPath = new Path (args[0]);  
        Path OutputPath = new Path (args[1]);  
        Path partitionfile = new Path (args[2]);  
              
        int reducenumber = Integer.parseint (args[3]); Randomsampler The first parameter represents the probability that the key will be selected, the second parameter is a select sample number, and the third parameter is the maximum read input splitsNumber Randomsampler<text, text> sampler = new Inputsampler.randomsampler<text, text> (0.1, 10000, 10);  
        Configuration conf = new Configuration ();  
              
        Set partition file full path to Conf totalorderpartitioner.setpartitionfile (conf, partitionfile);  
        Job Job = new Job (conf);  
        Job.setjobname ("Total-sort");  
        Job.setjarbyclass (Totalsortmr.class);  
        Job.setinputformatclass (Keyvaluetextinputformat.class);  
        Job.setmapoutputkeyclass (Text.class);  
        Job.setmapoutputvalueclass (Text.class);  
              
        Job.setnumreducetasks (Reducenumber);  
              
        Partitioner class is set into Totalorderpartitioner job.setpartitionerclass (totalorderpartitioner.class);  
        Fileinputformat.setinputpaths (Job, InputPath);  
        Fileoutputformat.setoutputpath (Job, OutputPath);  
              
      Outputpath.getfilesystem (conf). Delete (OutputPath, true);  Write partition file to Mapreduce.totalorderpartitioner.path inputsampler.writepartitionfile (job, sampler); Return Job.waitforcompletion (True)?  
              
    0:1;  
    public static void Main (string[] args) throws exception{System.exit (Runtotalsortjob (args)); }  
}

The above example is to use Inputsampler to create partition file, in fact, you can also use MapReduce to create, you can customize a inputformat to sample, output key to a reducer

Ps:hive 0.12 implements the parallel order by (HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/HIVE-1402) and is based on Totalorderpartitioner, A very reliable new feature.

Author: csdn Blog Lalaguozhe

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.