Hadoop MapReduce Sequencing principle

Source: Internet
Author: User
Tags sorts hadoop mapreduce

Hadoop mapreduce sequencing principle

Hadoop Case 3 Simple problem----sorting data (Entry level)


"Data Sorting" is the first work to be done when many actual tasks are executed,
such as student performance appraisal, data indexing and so on. This example and data deduplication is similar to the original data is initially processed, for further data operations to lay a good foundation. Enter this example below.


1. Requirements Description
Sorts the data in the input file. Each line in the input file is a number, that is, one data.
A number that requires two intervals for each line in the output, where the first represents the original data in the original dataset, and the second represents the original data.


2. Raw Data


1) File1:


2


32


654


32


15


756


65223


2) File2:
5956


22


650


92


3) File3:


26


54


6



Sample output:


1 2


2 6


3 15


4 22


5 26


6 32


7 32


8 54


9 92


10 650


11 654


12 756


13 5956


14 65223


3. Design Thinking
This example simply requires ordering the input data, and readers familiar with the MapReduce process will soon think of a sort in the mapreduce process and whether this default sort can be exploited.
Instead of having to make a specific sort of it yourself? The answer is yes.


But first you need to know the default collation before using it. It is sorted by the key value, and if key is the intwritable type that encapsulates int, then MapReduce sorts the key by the number size,
If key is a text type encapsulated as String, then mapreduce sorts the strings in dictionary order.


Knowing this detail, we know that we should use the intwritable-type data structure that encapsulates int. That is, the data that is read in the map is converted into a intwritable type and then output as a key value (value arbitrary).
After reduce gets <key,value-list>, the input key is output as value, and the number of outputs is determined based on the number of elements in the value-list.
The output key (that is, linenum in the code) is a global variable that counts the position of the current key. It is important to note that Combiner is not configured in this program, that is, combiner is not used in the mapreduce process.
This is mainly because the use of map and reduce is already able to complete the task.


4. Map code
Package com.wy.hadoop.sort;


Import java.io.IOException;


Import org.apache.hadoop.io.IntWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Mapper;


public class Intsortmapper extends Mapper<object, Text, intwritable, intwritable> {


Intwritable val = new intwritable (1);
@Override
protected void Map (Object key, Text value,context Context)
Throws IOException, Interruptedexception {

Context.write (New Intwritable (Integer.valueof (value.tostring ())), Val);

}



}


5. Reduce Code
Package com.wy.hadoop.sort;


Import java.io.IOException;


Import org.apache.hadoop.io.IntWritable;
Import Org.apache.hadoop.mapreduce.Reducer;


public class Intsortreducer extends Reducer<intwritable, intwritable, intwritable, intwritable> {


intwritable num = new intwritable (1);
@Override
protected void reduce (intwritable key, iterable<intwritable> values,context Context)
Throws IOException, Interruptedexception {
for (intwritable tmp:values) {
Context.write (NUM, tmp);
num = new Intwritable (Num.get () +1);
}
}

}




6. Main code
Package com.wy.hadoop.sort;


Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.IntWritable;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Import Org.apache.hadoop.util.Tool;
Import Org.apache.hadoop.util.ToolRunner;


Import Com.wy.hadoop.join.two.UserJob;


public class Intsortjob extends Configuration implements Tool, Runnable {


Private String InputPath = null;
Private String OutputPath = null;

Public Intsortjob (String inputpath,string outputPath) {
This.inputpath = InputPath;
This.outputpath = OutputPath;
}
Public Intsortjob () {}

@Override
Public Configuration getconf () {
TODO auto-generated Method Stub
return null;
}


@Override
public void setconf (Configuration arg0) {
TODO auto-generated Method Stub


}


@Override
public void Run () {
try{
string[] args = {This.inputpath,this.outputpath};

Start (args);

}catch (Exception e) {
E.printstacktrace ();
}


}


private void Start (string[] args) throws exception{

Toolrunner.run (New Userjob (), args);
}


@Override
public int run (string[] args) throws Exception {
Configuration configuration = new configuration ();
FileSystem fs = filesystem.get (configuration);
Fs.delete (New Path (Args[1]), true);

Job Job = new Job (configuration, "uniquejob");
Job.setjarbyclass (Intsortjob.class);

Job.setmapperclass (Intsortmapper.class);
Job.setreducerclass (Intsortreducer.class);

Job.setoutputkeyclass (Intwritable.class);
Job.setoutputvalueclass (Intwritable.class);

Fileinputformat.addinputpath (Job, New Path (Args[0]));
Fileoutputformat.setoutputpath (Job, New Path (Args[1]));

Boolean success = Job.waitforcompletion (true);

return success?0:1;
}


}


Package com.wy.hadoop.sort;


public class Jobmain {


/**
* @param args
*/
public static void Main (string[] args) {
if (args.length==2) {
New Thread (New Intsortjob (Args[0],args[1])). Start ();
}

}


}






7. Create 3 files locally, put the test data inside, and then store the files in Hadoop HDFs


8. Package program to Linux system, execute hadoop jar command run code


9. View Results












Hadoop MapReduce Sequencing principle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.