Hadoop MapReduce Sequencing principle

Last Update:2014-12-27 Source: Internet

Author: User

Tags sorts hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop mapreduce sequencing principle

Hadoop Case 3 Simple problem----sorting data (Entry level)

"Data Sorting" is the first work to be done when many actual tasks are executed,
such as student performance appraisal, data indexing and so on. This example and data deduplication is similar to the original data is initially processed, for further data operations to lay a good foundation. Enter this example below.

1. Requirements Description
Sorts the data in the input file. Each line in the input file is a number, that is, one data.
A number that requires two intervals for each line in the output, where the first represents the original data in the original dataset, and the second represents the original data.

2. Raw Data

1) File1:

2

32

654

32

15

756

65223

2) File2:
5956

22

650

92

3) File3:

26

54

6

Sample output:

1 2

2 6

3 15

4 22

5 26

6 32

7 32

8 54

9 92

10 650

11 654

12 756

13 5956

14 65223

3. Design Thinking
This example simply requires ordering the input data, and readers familiar with the MapReduce process will soon think of a sort in the mapreduce process and whether this default sort can be exploited.
Instead of having to make a specific sort of it yourself? The answer is yes.

But first you need to know the default collation before using it. It is sorted by the key value, and if key is the intwritable type that encapsulates int, then MapReduce sorts the key by the number size,
If key is a text type encapsulated as String, then mapreduce sorts the strings in dictionary order.

Knowing this detail, we know that we should use the intwritable-type data structure that encapsulates int. That is, the data that is read in the map is converted into a intwritable type and then output as a key value (value arbitrary).
After reduce gets <key,value-list>, the input key is output as value, and the number of outputs is determined based on the number of elements in the value-list.
The output key (that is, linenum in the code) is a global variable that counts the position of the current key. It is important to note that Combiner is not configured in this program, that is, combiner is not used in the mapreduce process.
This is mainly because the use of map and reduce is already able to complete the task.

4. Map code
Package com.wy.hadoop.sort;

Import java.io.IOException;

Import org.apache.hadoop.io.IntWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Mapper;

public class Intsortmapper extends Mapper<object, Text, intwritable, intwritable> {

Intwritable val = new intwritable (1);
@Override
protected void Map (Object key, Text value,context Context)
Throws IOException, Interruptedexception {

Context.write (New Intwritable (Integer.valueof (value.tostring ())), Val);

}

}

5. Reduce Code
Package com.wy.hadoop.sort;

Import java.io.IOException;

Import org.apache.hadoop.io.IntWritable;
Import Org.apache.hadoop.mapreduce.Reducer;

public class Intsortreducer extends Reducer<intwritable, intwritable, intwritable, intwritable> {

intwritable num = new intwritable (1);
@Override
protected void reduce (intwritable key, iterable<intwritable> values,context Context)
Throws IOException, Interruptedexception {
for (intwritable tmp:values) {
Context.write (NUM, tmp);
num = new Intwritable (Num.get () +1);
}
}

}

6. Main code
Package com.wy.hadoop.sort;

Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.IntWritable;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Import Org.apache.hadoop.util.Tool;
Import Org.apache.hadoop.util.ToolRunner;

Import Com.wy.hadoop.join.two.UserJob;

public class Intsortjob extends Configuration implements Tool, Runnable {

Private String InputPath = null;
Private String OutputPath = null;

Public Intsortjob (String inputpath,string outputPath) {
This.inputpath = InputPath;
This.outputpath = OutputPath;
}
Public Intsortjob () {}

@Override
Public Configuration getconf () {
TODO auto-generated Method Stub
return null;
}

@Override
public void setconf (Configuration arg0) {
TODO auto-generated Method Stub

}

@Override
public void Run () {
try{
string[] args = {This.inputpath,this.outputpath};

Start (args);

}catch (Exception e) {
E.printstacktrace ();
}

}

private void Start (string[] args) throws exception{

Toolrunner.run (New Userjob (), args);
}

@Override
public int run (string[] args) throws Exception {
Configuration configuration = new configuration ();
FileSystem fs = filesystem.get (configuration);
Fs.delete (New Path (Args[1]), true);

Job Job = new Job (configuration, "uniquejob");
Job.setjarbyclass (Intsortjob.class);

Job.setmapperclass (Intsortmapper.class);
Job.setreducerclass (Intsortreducer.class);

Job.setoutputkeyclass (Intwritable.class);
Job.setoutputvalueclass (Intwritable.class);

Fileinputformat.addinputpath (Job, New Path (Args[0]));
Fileoutputformat.setoutputpath (Job, New Path (Args[1]));

Boolean success = Job.waitforcompletion (true);

return success?0:1;
}

}

Package com.wy.hadoop.sort;

public class Jobmain {

/**
* @param args
*/
public static void Main (string[] args) {
if (args.length==2) {
New Thread (New Intsortjob (Args[0],args[1])). Start ();
}

}

}

7. Create 3 files locally, put the test data inside, and then store the files in Hadoop HDFs

8. Package program to Linux system, execute hadoop jar command run code

9. View Results

Hadoop MapReduce Sequencing principle

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More