Hadoop programming tips (7) --- customize the output file format and output it to different directories

Source: Internet
Author: User

Code test environment: hadoop2.4

Application Scenario: this technique can be used to customize the output data format, including the display form, output path, and output file name of the output data.

Hadoop's built-in output file formats include:

1) fileoutputformat <K, V> common parent class;

2) textoutputformat <K, V> default output string output format;

3) sequencefileoutputformat <K, V> serializes the file output;

4) multipleoutputs <K, V> can deliver output data to different directories;

5) nulloutputformat <K, V> outputs the output to/dev/null, that is, no data is output. This application scenario performs logical processing in mr, at the same time, the output file has been output in Mr, instead of output;

6) lazyoutputformat <K, V> the file is generated only when the write method is called. In this way, no empty file is generated if the write method is not called;

Steps:

Similar to the input data format, you can follow the steps below to customize the output data format.

1) define a class that inherits from outputformat, But it generally inherits fileoutputformat;

2) Implement the getrecordwriter method and return a recordwriter type;

3) define a class that inherits recordwriter, define its write method, and write file data for each <key, value>;


Instance 1 (modify the default output file name and the delimiter of the default key and value ):

Input data:


Custom customfileoutputformat (replace the default file name prefix ):

Package FZ. outputformat; import Java. io. ioexception; import Org. apache. hadoop. FS. fsdataoutputstream; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. recordwriter; import Org. apache. hadoop. mapreduce. taskattemptcontext; import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; public class customoutputformat extends fileoutputformat <longwritable, text> {private string prefix = "Custom _"; @ overridepublic recordwriter <longwritable, text> combine (taskattemptcontext job) throws ioexception, interruptedexception {// create a writable file path outputdir = fileoutputformat. getoutputpath (job); // system. out. println ("outputdir. getname (): "+ outputdir. getname () + ", otuputdir. tostring (): "+ outputdir. tostring (); string subfix = job. gettaskattemptid (). gettaskid (). tostring (); Path = New Path (outputdir. tostring () + "/" + prefix + subfix. substring (subfix. length ()-5, subfix. length (); fsdataoutputstream fileout = path. getfilesystem (job. getconfiguration ()). create (PATH); return New customrecordwriter (fileout );}}
Custom mmwriter (specifying the key and value separator ):

package fz.outputformat;import java.io.IOException;import java.io.PrintWriter;import org.apache.hadoop.fs.FSDataOutputStream;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.RecordWriter;import org.apache.hadoop.mapreduce.TaskAttemptContext;public class CustomRecordWriter extends RecordWriter<LongWritable, Text> {private PrintWriter out;private String separator =",";public CustomRecordWriter(FSDataOutputStream fileOut) {out = new PrintWriter(fileOut);}@Overridepublic void write(LongWritable key, Text value) throws IOException,InterruptedException {out.println(key.get()+separator+value.toString());}@Overridepublic void close(TaskAttemptContext context) throws IOException,InterruptedException {out.close();}}

Main class:

package fz.outputformat;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class FileOutputFormatDriver extends Configured implements Tool{/** * @param args * @throws Exception  */public static void main(String[] args) throws Exception {// TODO Auto-generated method stubToolRunner.run(new Configuration(), new FileOutputFormatDriver(),args);}@Overridepublic int run(String[] arg0) throws Exception {if(arg0.length!=3){System.err.println("Usage:\nfz.outputformat.FileOutputFormatDriver <in> <out> <numReducer>");return -1;}Configuration conf = getConf();Path in = new Path(arg0[0]);Path out= new Path(arg0[1]);boolean delete=out.getFileSystem(conf).delete(out, true);System.out.println("deleted "+out+"?"+delete);Job job = Job.getInstance(conf,"fileouttputformat test job");job.setJarByClass(getClass());job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(CustomOutputFormat.class);job.setMapperClass(Mapper.class);job.setMapOutputKeyClass(LongWritable.class);job.setMapOutputValueClass(Text.class);job.setOutputKeyClass(LongWritable.class);job.setOutputValueClass(Text.class);job.setNumReduceTasks(Integer.parseInt(arg0[2]));job.setReducerClass(Reducer.class);FileInputFormat.setInputPaths(job, in);FileOutputFormat.setOutputPath(job, out);return job.waitForCompletion(true)?0:-1;}}

View output:

The output result shows that the output format and file name are indeed output as expected.


Instance 2 (output data to different directories based on the key and value values ):
Custom main class (the main class actually modifies the output method ):

package fz.multipleoutputformat;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class FileOutputFormatDriver extends Configured implements Tool{/** * @param args * @throws Exception  */public static void main(String[] args) throws Exception {// TODO Auto-generated method stubToolRunner.run(new Configuration(), new FileOutputFormatDriver(),args);}@Overridepublic int run(String[] arg0) throws Exception {if(arg0.length!=3){System.err.println("Usage:\nfz.multipleoutputformat.FileOutputFormatDriver <in> <out> <numReducer>");return -1;}Configuration conf = getConf();Path in = new Path(arg0[0]);Path out= new Path(arg0[1]);boolean delete=out.getFileSystem(conf).delete(out, true);System.out.println("deleted "+out+"?"+delete);Job job = Job.getInstance(conf,"fileouttputformat test job");job.setJarByClass(getClass());job.setInputFormatClass(TextInputFormat.class);//job.setOutputFormatClass(CustomOutputFormat.class);MultipleOutputs.addNamedOutput(job, "ignore", TextOutputFormat.class,LongWritable.class, Text.class);MultipleOutputs.addNamedOutput(job, "other", TextOutputFormat.class,LongWritable.class, Text.class);job.setMapperClass(Mapper.class);job.setMapOutputKeyClass(LongWritable.class);job.setMapOutputValueClass(Text.class);job.setOutputKeyClass(LongWritable.class);job.setOutputValueClass(Text.class);job.setNumReduceTasks(Integer.parseInt(arg0[2]));job.setReducerClass(MultipleReducer.class);FileInputFormat.setInputPaths(job, in);FileOutputFormat.setOutputPath(job, out);return job.waitForCompletion(true)?0:-1;}}
Custom CER (because data needs to be output to different directories based on the values of key and value, custom logic is required)

package fz.multipleoutputformat;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;public class MultipleReducer extendsReducer<LongWritable, Text, LongWritable, Text> {private MultipleOutputs<LongWritable,Text> out;@Overridepublic void setup(Context cxt){out = new MultipleOutputs<LongWritable,Text>(cxt);}@Overridepublic void reduce(LongWritable key ,Iterable<Text> value,Context cxt)throws IOException,InterruptedException{for(Text v:value){if(v.toString().startsWith("ignore")){//System.out.println("ignore--------------------value:"+v);out.write("ignore", key, v, "ign");}else{//System.out.println("other---------------------value:"+v);out.write("other", key, v, "oth");}}}@Overridepublic void cleanup(Context cxt)throws IOException,InterruptedException{out.close();}}

View output:


We can see that the output data is indeed written to different file directories according to different values of values, but here we can also see that there is a default file generation, and note that the size of this file is 0, this is not solved yet.


Conclusion: Custom output formats can be used to customize some special requirements. However, the built-in output formats of hadoop are generally used, which means that their application is not significant. However, using hadoop's built-in multipleoutputs can be output to different directories based on different data features, which is of practical significance.


Share, grow, and be happy

Reprinted please indicate blog address: http://blog.csdn.net/fansy1990


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.