Code test environment: hadoop2.4
Application Scenario: this technique can be used to customize the output data format, including the display form, output path, and output file name of the output data.
Hadoop's built-in output file formats include:
1) fileoutputformat <K, V> common parent class;
2) textoutputformat <K, V> default output string output format;
3) sequencefileoutputformat <K, V> serializes the file output;
4) multipleoutputs <K, V> can deliver output data to different directories;
5) nulloutputformat <K, V> outputs the output to/dev/null, that is, no data is output. This application scenario performs logical processing in mr, at the same time, the output file has been output in Mr, instead of output;
6) lazyoutputformat <K, V> the file is generated only when the write method is called. In this way, no empty file is generated if the write method is not called;
Steps:
Similar to the input data format, you can follow the steps below to customize the output data format.
1) define a class that inherits from outputformat, But it generally inherits fileoutputformat;
2) Implement the getrecordwriter method and return a recordwriter type;
3) define a class that inherits recordwriter, define its write method, and write file data for each <key, value>;
Instance 1 (modify the default output file name and the delimiter of the default key and value ):
Input data:
Custom customfileoutputformat (replace the default file name prefix ):
Package FZ. outputformat; import Java. io. ioexception; import Org. apache. hadoop. FS. fsdataoutputstream; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. recordwriter; import Org. apache. hadoop. mapreduce. taskattemptcontext; import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; public class customoutputformat extends fileoutputformat <longwritable, text> {private string prefix = "Custom _"; @ overridepublic recordwriter <longwritable, text> combine (taskattemptcontext job) throws ioexception, interruptedexception {// create a writable file path outputdir = fileoutputformat. getoutputpath (job); // system. out. println ("outputdir. getname (): "+ outputdir. getname () + ", otuputdir. tostring (): "+ outputdir. tostring (); string subfix = job. gettaskattemptid (). gettaskid (). tostring (); Path = New Path (outputdir. tostring () + "/" + prefix + subfix. substring (subfix. length ()-5, subfix. length (); fsdataoutputstream fileout = path. getfilesystem (job. getconfiguration ()). create (PATH); return New customrecordwriter (fileout );}}
Custom mmwriter (specifying the key and value separator ):
package fz.outputformat;import java.io.IOException;import java.io.PrintWriter;import org.apache.hadoop.fs.FSDataOutputStream;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.RecordWriter;import org.apache.hadoop.mapreduce.TaskAttemptContext;public class CustomRecordWriter extends RecordWriter<LongWritable, Text> {private PrintWriter out;private String separator =",";public CustomRecordWriter(FSDataOutputStream fileOut) {out = new PrintWriter(fileOut);}@Overridepublic void write(LongWritable key, Text value) throws IOException,InterruptedException {out.println(key.get()+separator+value.toString());}@Overridepublic void close(TaskAttemptContext context) throws IOException,InterruptedException {out.close();}}
Main class:
package fz.outputformat;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class FileOutputFormatDriver extends Configured implements Tool{/** * @param args * @throws Exception */public static void main(String[] args) throws Exception {// TODO Auto-generated method stubToolRunner.run(new Configuration(), new FileOutputFormatDriver(),args);}@Overridepublic int run(String[] arg0) throws Exception {if(arg0.length!=3){System.err.println("Usage:\nfz.outputformat.FileOutputFormatDriver <in> <out> <numReducer>");return -1;}Configuration conf = getConf();Path in = new Path(arg0[0]);Path out= new Path(arg0[1]);boolean delete=out.getFileSystem(conf).delete(out, true);System.out.println("deleted "+out+"?"+delete);Job job = Job.getInstance(conf,"fileouttputformat test job");job.setJarByClass(getClass());job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(CustomOutputFormat.class);job.setMapperClass(Mapper.class);job.setMapOutputKeyClass(LongWritable.class);job.setMapOutputValueClass(Text.class);job.setOutputKeyClass(LongWritable.class);job.setOutputValueClass(Text.class);job.setNumReduceTasks(Integer.parseInt(arg0[2]));job.setReducerClass(Reducer.class);FileInputFormat.setInputPaths(job, in);FileOutputFormat.setOutputPath(job, out);return job.waitForCompletion(true)?0:-1;}}
View output:
The output result shows that the output format and file name are indeed output as expected.
Instance 2 (output data to different directories based on the key and value values ):
Custom main class (the main class actually modifies the output method ):
package fz.multipleoutputformat;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class FileOutputFormatDriver extends Configured implements Tool{/** * @param args * @throws Exception */public static void main(String[] args) throws Exception {// TODO Auto-generated method stubToolRunner.run(new Configuration(), new FileOutputFormatDriver(),args);}@Overridepublic int run(String[] arg0) throws Exception {if(arg0.length!=3){System.err.println("Usage:\nfz.multipleoutputformat.FileOutputFormatDriver <in> <out> <numReducer>");return -1;}Configuration conf = getConf();Path in = new Path(arg0[0]);Path out= new Path(arg0[1]);boolean delete=out.getFileSystem(conf).delete(out, true);System.out.println("deleted "+out+"?"+delete);Job job = Job.getInstance(conf,"fileouttputformat test job");job.setJarByClass(getClass());job.setInputFormatClass(TextInputFormat.class);//job.setOutputFormatClass(CustomOutputFormat.class);MultipleOutputs.addNamedOutput(job, "ignore", TextOutputFormat.class,LongWritable.class, Text.class);MultipleOutputs.addNamedOutput(job, "other", TextOutputFormat.class,LongWritable.class, Text.class);job.setMapperClass(Mapper.class);job.setMapOutputKeyClass(LongWritable.class);job.setMapOutputValueClass(Text.class);job.setOutputKeyClass(LongWritable.class);job.setOutputValueClass(Text.class);job.setNumReduceTasks(Integer.parseInt(arg0[2]));job.setReducerClass(MultipleReducer.class);FileInputFormat.setInputPaths(job, in);FileOutputFormat.setOutputPath(job, out);return job.waitForCompletion(true)?0:-1;}}
Custom CER (because data needs to be output to different directories based on the values of key and value, custom logic is required)
package fz.multipleoutputformat;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;public class MultipleReducer extendsReducer<LongWritable, Text, LongWritable, Text> {private MultipleOutputs<LongWritable,Text> out;@Overridepublic void setup(Context cxt){out = new MultipleOutputs<LongWritable,Text>(cxt);}@Overridepublic void reduce(LongWritable key ,Iterable<Text> value,Context cxt)throws IOException,InterruptedException{for(Text v:value){if(v.toString().startsWith("ignore")){//System.out.println("ignore--------------------value:"+v);out.write("ignore", key, v, "ign");}else{//System.out.println("other---------------------value:"+v);out.write("other", key, v, "oth");}}}@Overridepublic void cleanup(Context cxt)throws IOException,InterruptedException{out.close();}}
View output:
We can see that the output data is indeed written to different file directories according to different values of values, but here we can also see that there is a default file generation, and note that the size of this file is 0, this is not solved yet.
Conclusion: Custom output formats can be used to customize some special requirements. However, the built-in output formats of hadoop are generally used, which means that their application is not significant. However, using hadoop's built-in multipleoutputs can be output to different directories based on different data features, which is of practical significance.
Share, grow, and be happy
Reprinted please indicate blog address: http://blog.csdn.net/fansy1990