Reprint http://www.cnblogs.com/shapherd/archive/2012/12/21/2827860.html
Hadoop supports the functionality of the reduce multi-output, where a reduce can be exported to multiple part-xxxxx-x files, where X is one of a-Z letters, and the program appends the "#X" suffix to the value after the output <key,value> pair , such as #a, the output of the file is part-00000-a, different suffixes can be key,value output to different files, easy to do the output type classification, #X仅仅用做指定输出文件后缀, will not be reflected in the output of the content
How to use
You need to specify-outputformat Org.apache.hadoop.mapred.lib.SuffixMultipleTextOutputFormat or-outputformat in the startup script Org.apache.hadoop.mapred.lib.SuffixMultipleSequenceFileOutputFormat, the output will be output as a multi-output file.
All standard output value must be added #X后缀, x for A-Z, or will be reported invalid suffix error
$HADOOP _home_path/bin/Hadoop streaming-dhadoop.job.ugi="$HADOOP _job_ugi" -file./map.SH -file./red.SH -file./config.SH -mapper"sh-x map.sh" -reducer"sh-x red.sh" -input $NEW _input_path-input $OLD _input_path-Output $OUTPUT _path-jobconf stream.num.map.output.key.fields=1 -partitioner Org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner-OutputFormat Org.apache.hadoop.mapred.lib.SuffixMultipleTextOutputFormat-jobconf mapred.job.name="Test-shapherd-dist-diff" -jobconf mapred.job.priority= High-jobconf mapred.job.map.capacity= - -jobconf mapred.job.reduce.capacity= - -jobconf mapred.reduce.tasks=3
In the red script, you can add a suffix to the output so that the output is part, such as a script that compares big data diff
Map.sh as follows:
Source./config.SH awk 'begin{}{ if(Match ("' ${map_input_file} '","' $OLD _input_path '") ) {print $0"\ t"0Next}if(Match ("' ${map_input_file} '","' $NEW _input_path '")) Print $0"\ t"1}'Exit0
Red.sh as follows
awk-F"\ t" 'begin{key=""Flag=0Num=0Old_num=0New_num=0Diff_num=0}{ if($NF = ="0") Old_num++ElseNew_num++if($1!=key) { if(Key! ="") { if(Num <=1) {Diff_num++if(Flag = ="0") Print $0"#A" ElsePrint $0"#B"}} Key=$1Flag=$NF Num=1Next}if(Key = = $1) {num++Next}} end{if(num = =1) { if(Flag = ="0") Print $0"#A" ElsePrint $0"#B"} print Old_num"\tshapherd#c"Print New_num"\tshapherd#d"Print Diff_num"\tshapherd#e"}'Exit0
My two big data does not have a diff, so the output is:
Part-00000-c
Part-00000-d
Part-00000-e
Part-00001-c
Part-00001-d
Part-00001-e
Part-00002-c
Part-00002-d
Part-00002-e
No, A and B end.
Precautions
- Multiple outputs support up to 26 paths, which means that letters can only be a-Z range.
- Reduce's input key and value delimiter is \ t, if there is no \t,reduce script in the output as a key, value is empty, then if added #x, will be reported invalid suffix error, because #x as part of the key, One such problem is to ensure that your key and value are separated by \ t and that you specify the delimiter you want.
Hadoop streaming multi-output [reprint]