Using Hadoop mapreduce for sorting

Source: Internet
Author: User
Tags split hadoop mapreduce

The example Terasort in Hadoop is an example of sorting using Mapredue. This article references and simplifies this example:

The basic idea of sequencing is to take advantage of the automatic sequencing of MapReduce, in Hadoop, from the map to the reduce phase, the map structure will be assigned to each key according to the hash value of each reduce, wherein in reduce all the keys are ordered. If we were to use a reduce, we would just have to output it, but that doesn't reflect the benefits of distribution, so we're going to use multiple reduce to run.

Let's say we have 1000 1-10000 data, run 10 ruduce tasks, and if we run partition, we can allocate the data in 1-1000 to the first reduce and 1001-2000 of the data to the second reduce, And so on That is, the data allocated by the nth reduce is all greater than the data in the n-1 reduce. In this way, each reduce comes out of order, we just want all the cat output files to become a large file, it is orderly.

The basic idea is this, but now there is a question of how the interval of data is divided, the amount of data is large, and we do not know the distribution of data in the case. A relatively simple method is sampling, if there are 100 million of the data, we can sample data, such as taking 10,000 data samples, and then the sampling data to be divided into intervals. In Hadoop, patition we can replace the default partition with Totalorderpartitioner. Then pass the result of the sample to him and you can implement the partition we want. At the time of sampling, we can use several sampling tools of Hadoop, Randomsampler,inputsampler,intervalsampler.

This allows us to sort large amounts of data using a distributed file system, and we can override the Compare function in the Partitioner class to define the rules for comparisons, so that you can sort the strings or other non-numeric types, or even order two or more times.

Reference: "The Hadoop authoritative guide" contains detailed explanations

Cxfinputformat.java

Package com.alibaba.cxf.sort;

Import java.io.IOException;

Import org.apache.hadoop.io.IntWritable;
Import org.apache.hadoop.io.LongWritable;
Import org.apache.hadoop.io.NullWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapred.FileInputFormat;
Import Org.apache.hadoop.mapred.FileSplit;
Import Org.apache.hadoop.mapred.InputSplit;
Import org.apache.hadoop.mapred.JobConf;
Import Org.apache.hadoop.mapred.LineRecordReader;
Import Org.apache.hadoop.mapred.RecordReader;
Import Org.apache.hadoop.mapred.Reporter;

public class Cxfinputformat extends fileinputformat<intwritable,text>{
@Override
Public recordreader<intwritable, Text> Getrecordreader (Inputsplit split,
jobconf job, Reporter Reporter) throws IOException {
return new Cxfrecordreader (Job, (Filesplit) split);
}
Class Cxfrecordreader implements Recordreader<intwritable,text> {

Private Linerecordreader in;
Private longwritable junk = new longwritable ();
Private text line = new text ();
private int key_length = 10;
Public Cxfrecordreader (jobconf Job,filesplit split) throws ioexception{
in = new Linerecordreader (job, split);
}
@Override
public void Close () throws IOException {
In.close ();
}
@Override
Public intwritable CreateKey () {
return new intwritable ();
}
@Override
Public Text CreateValue () {

return new Text ();
}
@Override
Public long GetPos () throws IOException {

return In.getpos ();
}
@Override
public float getprogress () throws IOException {

return in.getprogress ();
}
@Override
Public boolean next (intwritable key, Text value) throws IOException {
if (In.next (junk, line)) {
if (Line.getlength () < Key_length) {
Key.set (Integer.parseint (line.tostring ()));
Value = new Text ();
Value.clear ();
} else {
byte[] bytes = Line.getbytes ();
Key.set (Integer.parseint (New String (bytes). substring (0, key_length)));
Value = new Text ();
}
return true;
} else {
return false;
}
}
}
}

Sortbymapreduce.java

Package com.alibaba.cxf.sort;

Import java.io.IOException;
Import Java.net.URI;
Import java.net.URISyntaxException;
Import Org.apache.hadoop.filecache.DistributedCache;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.IntWritable;
Import org.apache.hadoop.io.NullWritable;
Import Org.apache.hadoop.mapred.FileInputFormat;
Import Org.apache.hadoop.mapred.FileOutputFormat;
Import org.apache.hadoop.mapred.JobClient;
Import org.apache.hadoop.mapred.JobConf;
Import Org.apache.hadoop.mapred.TextOutputFormat;
Import Org.apache.hadoop.mapred.lib.InputSampler;
Import Org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
public class Sortbymapreduce {

/**
* @param args
* @throws URISyntaxException
* @throws IOException
*/
public static void Main (string[] args) throws IOException, URISyntaxException {
Runjob (args);
}

private static void Runjob (string[] args) throws IOException, URISyntaxException {

jobconf conf = new jobconf (sortbymapreduce.class);

Fileinputformat.setinputpaths (conf, new Path (Args[0]));
Fileoutputformat.setoutputpath (conf, new Path (Args[1]));
Conf.setjobname ("Sortbymapreduce");

Conf.setinputformat (Cxfinputformat.class);
Conf.setoutputkeyclass (Intwritable.class);
Conf.setoutputformat (Textoutputformat.class);
Conf.setnumreducetasks (5);
Conf.setpartitionerclass (Totalorderpartitioner.class);
Inputsampler.randomsampler<intwritable, nullwritable> sampler =
New Inputsampler.randomsampler<intwritable, nullwritable> (0.1,10000,10);

Path input = fileinputformat.getinputpaths (conf) [0];
input = Input.makequalified (Input.getfilesystem (conf));
Path partitionfile = new Path (input, "_partitions");
Totalorderpartitioner.setpartitionfile (conf, partitionfile);
Inputsampler.writepartitionfile (conf, sampler);

Uri Partitionuri = new Uri (partitionfile.tostring () + "#_partitions");
Distributedcache.addcachefile (Partitionuri, conf);
Distributedcache.createsymlink (conf);
Jobclient.runjob (conf);
}
}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.