Using Hadoop mapreduce for sorting

Last Update:2018-07-25 Source: Internet

Author: User

Tags split hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The example Terasort in Hadoop is an example of sorting using Mapredue. This article references and simplifies this example:

The basic idea of sequencing is to take advantage of the automatic sequencing of MapReduce, in Hadoop, from the map to the reduce phase, the map structure will be assigned to each key according to the hash value of each reduce, wherein in reduce all the keys are ordered. If we were to use a reduce, we would just have to output it, but that doesn't reflect the benefits of distribution, so we're going to use multiple reduce to run.

Let's say we have 1000 1-10000 data, run 10 ruduce tasks, and if we run partition, we can allocate the data in 1-1000 to the first reduce and 1001-2000 of the data to the second reduce, And so on That is, the data allocated by the nth reduce is all greater than the data in the n-1 reduce. In this way, each reduce comes out of order, we just want all the cat output files to become a large file, it is orderly.

The basic idea is this, but now there is a question of how the interval of data is divided, the amount of data is large, and we do not know the distribution of data in the case. A relatively simple method is sampling, if there are 100 million of the data, we can sample data, such as taking 10,000 data samples, and then the sampling data to be divided into intervals. In Hadoop, patition we can replace the default partition with Totalorderpartitioner. Then pass the result of the sample to him and you can implement the partition we want. At the time of sampling, we can use several sampling tools of Hadoop, Randomsampler,inputsampler,intervalsampler.

This allows us to sort large amounts of data using a distributed file system, and we can override the Compare function in the Partitioner class to define the rules for comparisons, so that you can sort the strings or other non-numeric types, or even order two or more times.

Reference: "The Hadoop authoritative guide" contains detailed explanations

Cxfinputformat.java

Package com.alibaba.cxf.sort;

Import java.io.IOException;

Import org.apache.hadoop.io.IntWritable;
Import org.apache.hadoop.io.LongWritable;
Import org.apache.hadoop.io.NullWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapred.FileInputFormat;
Import Org.apache.hadoop.mapred.FileSplit;
Import Org.apache.hadoop.mapred.InputSplit;
Import org.apache.hadoop.mapred.JobConf;
Import Org.apache.hadoop.mapred.LineRecordReader;
Import Org.apache.hadoop.mapred.RecordReader;
Import Org.apache.hadoop.mapred.Reporter;

public class Cxfinputformat extends fileinputformat<intwritable,text>{
@Override
Public recordreader<intwritable, Text> Getrecordreader (Inputsplit split,
jobconf job, Reporter Reporter) throws IOException {
return new Cxfrecordreader (Job, (Filesplit) split);
}
Class Cxfrecordreader implements Recordreader<intwritable,text> {

Private Linerecordreader in;
Private longwritable junk = new longwritable ();
Private text line = new text ();
private int key_length = 10;
Public Cxfrecordreader (jobconf Job,filesplit split) throws ioexception{
in = new Linerecordreader (job, split);
}
@Override
public void Close () throws IOException {
In.close ();
}
@Override
Public intwritable CreateKey () {
return new intwritable ();
}
@Override
Public Text CreateValue () {

return new Text ();
}
@Override
Public long GetPos () throws IOException {

return In.getpos ();
}
@Override
public float getprogress () throws IOException {

return in.getprogress ();
}
@Override
Public boolean next (intwritable key, Text value) throws IOException {
if (In.next (junk, line)) {
if (Line.getlength () < Key_length) {
Key.set (Integer.parseint (line.tostring ()));
Value = new Text ();
Value.clear ();
} else {
byte[] bytes = Line.getbytes ();
Key.set (Integer.parseint (New String (bytes). substring (0, key_length)));
Value = new Text ();
}
return true;
} else {
return false;
}
}
}
}

Sortbymapreduce.java

Package com.alibaba.cxf.sort;

Import java.io.IOException;
Import Java.net.URI;
Import java.net.URISyntaxException;
Import Org.apache.hadoop.filecache.DistributedCache;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.IntWritable;
Import org.apache.hadoop.io.NullWritable;
Import Org.apache.hadoop.mapred.FileInputFormat;
Import Org.apache.hadoop.mapred.FileOutputFormat;
Import org.apache.hadoop.mapred.JobClient;
Import org.apache.hadoop.mapred.JobConf;
Import Org.apache.hadoop.mapred.TextOutputFormat;
Import Org.apache.hadoop.mapred.lib.InputSampler;
Import Org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
public class Sortbymapreduce {

/**
* @param args
* @throws URISyntaxException
* @throws IOException
*/
public static void Main (string[] args) throws IOException, URISyntaxException {
Runjob (args);
}

private static void Runjob (string[] args) throws IOException, URISyntaxException {

jobconf conf = new jobconf (sortbymapreduce.class);

Fileinputformat.setinputpaths (conf, new Path (Args[0]));
Fileoutputformat.setoutputpath (conf, new Path (Args[1]));
Conf.setjobname ("Sortbymapreduce");

Conf.setinputformat (Cxfinputformat.class);
Conf.setoutputkeyclass (Intwritable.class);
Conf.setoutputformat (Textoutputformat.class);
Conf.setnumreducetasks (5);
Conf.setpartitionerclass (Totalorderpartitioner.class);
Inputsampler.randomsampler<intwritable, nullwritable> sampler =
New Inputsampler.randomsampler<intwritable, nullwritable> (0.1,10000,10);

Path input = fileinputformat.getinputpaths (conf) [0];
input = Input.makequalified (Input.getfilesystem (conf));
Path partitionfile = new Path (input, "_partitions");
Totalorderpartitioner.setpartitionfile (conf, partitionfile);
Inputsampler.writepartitionfile (conf, sampler);

Uri Partitionuri = new Uri (partitionfile.tostring () + "#_partitions");
Distributedcache.addcachefile (Partitionuri, conf);
Distributedcache.createsymlink (conf);
Jobclient.runjob (conf);
}
}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More