How to use Hadoop's Chainmapper and Chainreducer

Source: Internet
Author: User
Tags static class hadoop mapreduce
The Mr Job in Hadoop supports chained processing, similar to a production line for milk, where each stage has a specific task to deal with, such as providing milk boxes, filling milk, sealing boxes, printing out dates, and so on, through this further division of labor, thus improving productivity, So also in our Hadoop mapreduce, which supports chained processing, these mapper, like the Linux pipeline, redirect the output of the previous mapper directly to the next mapper input, forming a pipeline, This is very similar to the filter mechanism in Lucene and SOLR, where the Hadoop project is derived from Lucene and naturally draws on some of the processing methods in Lucene.

For example, dealing with some of the forbidden words in text, or sensitive words, and so on, the chained operation in Hadoop, supported in the form of a regular map+ rrduce map*, means that there can only be a single reduce in the global, However, before and after the reduce, there can be an infinite number of mapper to do some preprocessing or rehabilitation work.

Let's take a look at the examples of today's test of the Stray immortals, and see our data and requirements.

The data are as follows:


<pre name= "code" class= "Java" > Mobile 5000
Computer 2000
Clothes 300
Shoes 1200
Skirt 434
Gloves 12
Book 12510
Commodity 5
Commodity 3
Order 2</pre>
Requirements are:

<pre name= "code" class= "Java" >/**
Needs
* Filter data greater than 100 million in the first mapper
* The second mapper filter out more than 100-10000 of the data
* Reduce inside to subtotal and output
* Reduce data in the mapper of the product name greater than 3
*/</pre>
<pre name= "code" class= "Java" >
The results are expected to be processed:
Gloves 12
Order 2

</pre>
The version of Hadoop is 1.2, and in the 1.2 version, Hadoop supports new APIs, but chained Chainmapper classes and Chainreduce classes do not support new New in the hadoop2.x inside can be used, the difference is not big, scattered fairy today is given is the old API, need to pay attention to.
The code is as follows:


<pre name= "code" class= "Java" >
Package com.qin.test.hadoop.chain;

Import java.io.IOException;
Import Java.util.Iterator;

Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import org.apache.hadoop.mapred.JobClient;
Import org.apache.hadoop.mapred.JobConf;
Import Org.apache.hadoop.mapred.MapReduceBase;
Import Org.apache.hadoop.mapred.Mapper;
Import Org.apache.hadoop.mapred.OutputCollector;
Import Org.apache.hadoop.mapred.Reducer;
Import Org.apache.hadoop.mapred.Reporter;
Import Org.apache.hadoop.mapred.lib.ChainMapper;
Import Org.apache.hadoop.mapred.lib.ChainReducer;




Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

Import com.qin.reducejoin.NewReduceJoin2;


/**
*
* Test the inside of Hadoop
* Use of Chainmapper and Reducemapper
*
* @author Qindongliang
* @date May 7, 2014
*
* Big Data exchange Group: 376932160
*
*
*
*
* ***/
public class Haoopchain {

/**
Needs
* Filter data greater than 100 million in the first mapper
* The second mapper filter out more than 100-10000 of the data
* Reduce inside to subtotal and output
* Reduce data in the mapper of the product name greater than 3
*/




/**
*
* Filter out more than 100 million of the data
*
* */
private static class AMapper01 extends Mapreducebase implements mapper< longwritable, text, text, text> {


@Override
public void Map (longwritable key, Text value, outputcollector< Text, text> Output, Reporter Reporter)
Throws IOException {
String text=value.tostring ();
String texts[]=text.split ("");

SYSTEM.OUT.PRINTLN ("The Data inside the AMapper01:" +text);
if (Texts[1]!=null&&texts[1].length () >0) {
int Count=integer.parseint (texts[1]);
if (count>10000) {
System.out.println ("AMapper01 filters out more than 10000 data:" +value.tostring ());
Return
}else{
Output.collect (new text (Texts[0]), new text (texts[1]));

}

}
}
}


/**
*
* Filter out more than 100-10000 of the data
*
* */
private static class AMapper02 extends Mapreducebase implements mapper< text, text, text, text> {

@Override
public void Map (text key, text value,
outputcollector< Text, text> Output, Reporter Reporter)
Throws IOException {

int Count=integer.parseint (value.tostring ());
if (count>=100&&count<=10000) {
System.out.println ("AMapper02 filter out less than 10000 more than 100 of the data:" +key+ "" +value ");
Return
} else{

Output.collect (key, value);
}

}
}


/**
* Reuduce inside of the same product
* The amount of data can be added
*
* **/
private static class AReducer03 extends Mapreducebase implements reducer< text, text, text, text> {

@Override
public void reduce (Text key, iterator< Text> Values
outputcollector< Text, text> Output, Reporter Reporter)
Throws IOException {
int sum=0;
SYSTEM.OUT.PRINTLN ("into reduce");

while (Values.hasnext ()) {

Text T=values.next ();
Sum+=integer.parseint (T.tostring ());

}

A collection of legacy APIs that do not support foreach iterations
for (Text t:values) {
Sum+=integer.parseint (T.tostring ());
// }

Output.collect (Key, New Text (sum+ ""));

}

}


/***
*
* Mapper filter after reduce
* Filter out product names that are longer than 3
*
* **/

private static class AMapper04 extends Mapreducebase implements mapper< text, text, text, text> {

@Override
public void Map (text key, text value,
outputcollector< Text, text> Output, Reporter Reporter)
Throws IOException {


int len=key.tostring (). Trim (). length ();

if (len>=3) {
SYSTEM.OUT.PRINTLN ("Reduce the mapper filter out the product name is greater than 3:" + key.tostring () + "" +value.tostring ());
return;
}else{
Output.collect (key, value);
}

}


}



/***
* Drive Main class
* **/
public static void Main (string[] args) throws exception{
Job Job=new Job (conf, "Myjoin");
jobconf conf=new jobconf (haoopchain.class);
Conf.set ("Mapred.job.tracker", "192.168.75.130:9001");
Conf.setjobname ("T7");
Conf.setjar ("Tt.jar");
Conf.setjarbyclass (Haoopchain.class);

Job Job=new Job (conf, "2222222");
Job.setjarbyclass (Haoopchain.class);
System.out.println ("mode:" +conf.get ("Mapred.job.tracker"));;

Job.setmapoutputkeyclass (Text.class);
Job.setmapoutputvalueclass (Text.class);


Filtration of MAP1
Jobconf mapa01=new jobconf (false);
Chainmapper.addmapper (conf, Amapper01.class, Longwritable.class, Text.class, Text.class, Text.class, False, mapA01);

Filtration of MAP2
Jobconf mapa02=new jobconf (false);
Chainmapper.addmapper (conf, Amapper02.class, Text.class, Text.class, Text.class, Text.class, False, mapA02);


Set up reduce
Jobconf recducefinallyconf=new jobconf (false);
Chainreducer.setreducer (conf, Areducer03.class, Text.class, Text.class, Text.class, Text.class, False, RECDUCEFINALLYCONF);


Mapper filtering after reduce
Jobconf reducea01=new jobconf (false);
Chainreducer.addmapper (conf, Amapper04.class, Text.class, Text.class, Text.class, Text.class, True, reduceA01);


Conf.setoutputkeyclass (Text.class);
Conf.setoutputvalueclass (Text.class);

Conf.setinputformat (Org.apache.hadoop.mapred.TextInputFormat.class);
Conf.setoutputformat (Org.apache.hadoop.mapred.TextOutputFormat.class);


FileSystem fs=filesystem.get (conf);
//
Path op=new path ("Hdfs://192.168.75.130:9000/root/outputchain");
if (fs.exists (OP)) {
Fs.delete (OP, True);
SYSTEM.OUT.PRINTLN ("This output path exists, deleted ... ");
}
//
//

Org.apache.hadoop.mapred.FileInputFormat.setInputPaths (conf, new Path ("hdfs://192.168.75.130:9000/root/ Inputchain "));
Org.apache.hadoop.mapred.FileOutputFormat.setOutputPath (conf, op);
//
System.exit (Conf.waitforcompletion (true)? 0:1);
Jobclient.runjob (conf);


}





}

</pre>



Run the log as follows:

<pre name= "code" class= "Java" >
Mode: 192.168.75.130:9001
This output path exists, deleted ...
Warn-jobclient.copyandconfigurefiles (746) | Use Genericoptionsparser for parsing the arguments. Applications should implement Tool for the same.
Warn-nativecodeloader.<clinit> (52) | Unable to load Native-hadoop library for your platform ... using Builtin-java classes where applicable
Warn-loadsnappy.<clinit> (46) | Snappy Native Library not loaded
Info-fileinputformat.liststatus (199) | Total input paths to process:1
Info-jobclient.monitorandprintjob (1380) | Running job:job_201405072054_0009
Info-jobclient.monitorandprintjob (1393) | Map 0% Reduce 0%
Info-jobclient.monitorandprintjob (1393) | Map 50% Reduce 0%
Info-jobclient.monitorandprintjob (1393) | Map 100% Reduce 0%
Info-jobclient.monitorandprintjob (1393) | Map 100% Reduce 33%
Info-jobclient.monitorandprintjob (1393) | Map 100% Reduce 100%
Info-jobclient.monitorandprintjob (1448) | Job complete:job_201405072054_0009
Info-counters.log (585) | Counters:30
Info-counters.log (587) | Job Counters
Info-counters.log (589) | launched reduce Tasks=1
Info-counters.log (589) | slots_millis_maps=11357
Info-counters.log (589) | Total time spent by all reduces waiting after reserving slots (ms) =0
Info-counters.log (589) | Total time spent by all maps waiting after reserving slots (ms) =0
Info-counters.log (589) | Launched Map tasks=2
Info-counters.log (589) | Data-local Map tasks=2
Info-counters.log (589) | slots_millis_reduces=9972
Info-counters.log (587) | File Input Format Counters
Info-counters.log (589) | Bytes read=183
Info-counters.log (587) | File Output Format Counters
Info-counters.log (589) | Bytes written=19
Info-counters.log (587) | Filesystemcounters
Info-counters.log (589) | file_bytes_read=57
Info-counters.log (589) | hdfs_bytes_read=391
Info-counters.log (589) | file_bytes_written=174859
Info-counters.log (589) | Hdfs_bytes_written=19
Info-counters.log (587) | Map-reduce Framework
Info-counters.log (589) | Map output materialized bytes=63
Info-counters.log (589) | Map input records=10
Info-counters.log (589) | Reduce Shuffle bytes=63
Info-counters.log (589) | Spilled records=8
Info-counters.log (589) | Map Output bytes=43
Info-counters.log (589) | Total committed heap usage (bytes) =336338944
Info-counters.log (589) | CPU Time Spent (ms) =1940
Info-counters.log (589) | Map input bytes=122
Info-counters.log (589) | split_raw_bytes=208
Info-counters.log (589) | Combine input Records=0
Info-counters.log (589) | Reduce input records=4
Info-counters.log (589) | Reduce input groups=3
Info-counters.log (589) | Combine Output Records=0
Info-counters.log (589) | Physical memory (bytes) snapshot=460980224
Info-counters.log (589) | Reduce Output records=2
Info-counters.log (589) | Virtual memory (bytes) snapshot=2184105984
Info-counters.log (589) | Map Output records=4

</pre>


The resulting data is as follows:







Summary, the test process, found if the reduce behind, there is mapper execution, then note must, in Chainreducer inside first set a global unique reducer, and then add a mapper, otherwise, at run time, will report null pointer exception, This requires special attention.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.