Hadoop advanced programming (ii) --- custom input/output format

Source: Internet
Author: User

Hadoop provides a wide range of data input and output formats, which can meet many design implementations. However, in some cases, you need to customize the input and output formats.

The data input format is used to describe the data input specification of mapreduce jobs. The mapreduce framework checks the input specification after the data input format is complete (for example, the input file directory check ), data files are input in blocks (inputspilt), data is read row by row from the input speed, and converted to the input key-value equivalence function in the MAP process. Hadoop provides many input formats, textinputformat and keyvalueinputformat. Each input format has the corresponding recordreader, linerecordreader, and keyvaluelinerecordreader. You need to customize the input format to implement the createrecordreader () and getsplit () methods in inputformat, while getcurrentkey ().....

For example:

Package COM. RPC. NEFU; import Java. io. ioexception; import Org. apache. hadoop. FS. fsdatainputstream; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. intwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. inputsplit; import Org. apache. hadoop. mapreduce. recordreader; import Org. apache. hadoop. mapreduce. taskattemptcontext; impor T Org. apache. hadoop. mapreduce. lib. input. fileinputformat; import Org. apache. hadoop. util. linereader; import Org. apache. hadoop. mapreduce. lib. input. filesplit; // The Custom input format must inherit the fileinputformat interface public class zinputformat extends fileinputformat <intwritable, intwritable >{@ override // implement recordreader public recordreader <intwritable, intwritable> createrecordreader (inputsplit split, taskattemptcontext Context) Throws ioexception, interruptedexception {return New zrecordreader ();} // custom data type: public static class zrecordreader extends recordreader <intwritable, intwritable> {// data private linereader in; // input stream private Boolean more = true; // The system prompts that no data is available in the future. Private intwritable key = NULL; private intwritable value = NULL; // these three are saved to the current read location (that is, the location in the file) Private long start; private long end; private long Pos; // Private Log = logfactory. getlog (zrecordreader. class); // write logs to the system. You can add @ override public void initialize (inputsplit split, taskattemptcontext context) throws ioexception, interruptedexception {// initialization function filesplit inputsplit = (filesplit) split; Start = inputsplit. getstart (); // get the start position of this part. End = start + inputsplit. getlength (); // end the Part Location final path file = inputsplit. getpath (); // open the file filesystem FS = file. getf Ilesystem (context. getconfiguration (); fsdatainputstream filein = FS. open (inputsplit. getpath (); // move the file pointer to the current Shard, because each time the file is opened by default, its Pointer Points to the beginning of filein. seek (start); In = new linereader (filein, context. getconfiguration (); If (start! = 0) {system. out. println ("4"); // if this is not the first part, if the first part is 0-4, 4th locations have been read, skip 4, otherwise, a read error occurs, because start + = in. readline (new text (), 0, maxbytestoconsume (start);} Pos = start;} private int maxbytestoconsume (long POS) {return (INT) math. min (integer. max_value, end-Pos) ;}@ override public Boolean nextkeyvalue () throws ioexception, interruptedexception {// next set of values // tips: it is best not to have output in this function in the future, time-consuming/ /Log.info ("reading the next one, "); If (null = Key) {key = new intwritable () ;}if (null = value) {value = new intwritable ();} text nowline = new text (); // Save the content of the current row int readsize = in. readline (nowline); // update the current read location POS + = readsize; // If the POs value is greater than or equal to the end value, it indicates that the Shard has been read. If (Pos> = end) {more = false; return false;} If (0 = readsize) {key = NULL; value = NULL; more = false; // indicates that the object has been read to the end, more is false return false ;} String [] keyandvalue = nowline. tostring (). split (","); // exclude the first line if (keyandvalue [0]. endswith ("\" citing \ "") {readsize = in. readline (nowline); // update the current read location POS + = readsize; If (0 = readsize) {more = false; // It indicates that the file has been read to the end, more is false return false;} // redivide keyandvalue = nowline. tostring (). split (",") ;}// get the key and value // log.info ("Key is:" + key + "value is" + value); key. set (integer. parseint (keyandvalue [0]); value. set (integer. parseint (keyandvalue [1]); Return true ;}@ override public intwritable getcurrentkey () throws ioexception, interruptedexception {// get the current key return key ;} @ override public intwritable getcurrentvalue () throws ioexception, interruptedexception {// get the current value return value;} @ override public float getprogress () throws ioexception, interruptedexception {// calculate the processing progress of the current slice if (Fa LSE = more | End = Start) {return 0f;} else {return math. min (1.0f, (POS-Start)/(end-Start) ;}@ override public void close () throws ioexception {// close this input stream if (null! = In) {in. Close ();}}}}

package reverseIndex;import java.io.IOException;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.InputSplit;import org.apache.hadoop.mapreduce.RecordReader;import org.apache.hadoop.mapreduce.TaskAttemptContext;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.FileSplit;import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;public class FileNameLocInputFormat extends FileInputFormat<Text, Text>{@Overridepublic org.apache.hadoop.mapreduce.RecordReader<Text, Text> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, TaskAttemptContext context)throws IOException, InterruptedException {// TODO Auto-generated method stubreturn new FileNameLocRecordReader();}public static class FileNameLocRecordReader extends RecordReader<Text,Text>{String FileName;LineRecordReader line = new LineRecordReader();/** * ...... */ @Overridepublic Text getCurrentKey() throws IOException, InterruptedException {// TODO Auto-generated method stubreturn new Text("("+FileName+"@"+line.getCurrentKey()+")");}@Overridepublic Text getCurrentValue() throws IOException, InterruptedException {// TODO Auto-generated method stubreturn line.getCurrentValue();}@Overridepublic void initialize(InputSplit split, TaskAttemptContext arg1)throws IOException, InterruptedException {// TODO Auto-generated method stubline.initialize(split, arg1);FileSplit inputsplit = (FileSplit)split;FileName = (inputsplit).getPath().getName();}@Overridepublic void close() throws IOException {// TODO Auto-generated method stub}@Overridepublic float getProgress() throws IOException, InterruptedException {// TODO Auto-generated method stubreturn 0;}@Overridepublic boolean nextKeyValue() throws IOException, InterruptedException {// TODO Auto-generated method stubreturn false;}}}
Hadoop also has many built-in output formats and recordwriter. The output format completes output specification check and job result data output.

Custom output format:

public static class AlphaOutputFormat extends multiformat<Text, IntWritable>{@Overrideprotected String generateFileNameForKeyValue(Text key,IntWritable value, Configuration conf) {// TODO Auto-generated method stubchar c = key.toString().toLowerCase().charAt(0);if( c>='a' && c<='z'){return c+".txt";}else{return "other.txt";}}}

// Set the output format job. setoutputformatclass (alphaoutputformat. Class );

Package COM. RPC. NEFU; import Java. io. dataoutputstream; import Java. io. ioexception; import Java. util. hashmap; import Java. util. iterator; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. fsdataoutputstream; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. writable; import Org. apache. hadoop. io. writablecomparable; import Org. apache. hadoop. io. compress. compressioncodec; Import Org. apache. hadoop. io. compress. gzipcodec; import Org. apache. hadoop. mapreduce. outputcommitter; import Org. apache. hadoop. mapreduce. recordwriter; import Org. apache. hadoop. mapreduce. taskattemptcontext; import Org. apache. hadoop. mapreduce. lib. output. fileoutputcommitter; import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; import Org. apache. hadoop. util. reflectionutils; public abstract Class multiformat <k extends writablecomparable <?>, V extends writable> extends fileoutputformat <K, V> {private multirecordwriter writer = NULL; Public recordwriter <K, V> getrecordwriter (taskattemptcontext job) throws ioexception, interruptedexception {If (writer = NULL) {writer = new multirecordwriter (job, gettaskoutputpath (job);} return writer;} private path gettaskoutputpath (taskattemptcontext conf) throws ioexception {path workpath = nu Ll; outputcommitter committer = super. getoutputcommitter (CONF); If (committer instanceof fileoutputcommitter) {workpath = (fileoutputcommitter) committer ). getworkpath ();} else {path outputpath = super. getoutputpath (CONF); If (outputpath = NULL) {Throw new ioexception ("undefined job output-path");} workpath = outputpath;} return workpath ;} /** use key, value, and conf to determine the output file name (including the extension) */protected AB Stract string generatefilenameforkeyvalue (K key, V value, configuration conf); public class multirecordwriter extends recordwriter <K, V> {/** recordwriter cache */private hashmap <string, recordwriter <K, V> recordwriters = NULL; private taskattemptcontext job = NULL;/** output directory */private path workpath = NULL; Public multirecordwriter (taskattemptcontext job, path workpath) {super (); this. job = job; this. Workpath = workpath; recordwriters = new hashmap <string, recordwriter <K, V> () ;}@ override public void close (taskattemptcontext context) throws ioexception, interruptedexception {iterator <recordwriter <K, V> values = This. recordwriters. values (). iterator (); While (values. hasnext () {values. next (). close (context);} This. recordwriters. clear () ;}@ override public void write (K key, V value) throws Ioexception, interruptedexception {// get the output file name string basename = generatefilenameforkeyvalue (Key, value, Job. getconfiguration (); recordwriter <K, V> RW = This. recordwriters. get (basename); If (RW = NULL) {RW = getbaserecordwriter (job, basename); this. recordwriters. put (basename, RW);} RW. write (Key, value);} // $ {mapred. out. dir}/_ temporary/_ $ {taskid}/$ {namewithextension} private recordwriter <K, V> Getbaserecordwriter (taskattemptcontext job, string basename) throws ioexception, interruptedexception {configuration conf = job. getconfiguration (); Boolean iscompressed = getcompressoutput (job); string keyvalueseparator = ","; recordwriter <K, V> recordwriter = NULL; If (iscompressed) {class <? Extends compressioncodec> codecclass = getoutputcompressorclass (job, gzipcodec. class); compressioncodec codec = reflectionutils. newinstance (codecclass, conf); Path file = New Path (workpath, basename + codec. getdefaultextension (); fsdataoutputstream fileout = file. getfilesystem (CONF ). create (file, false); recordwriter = new linerecordwrite <K, V> (New dataoutputstream (codec. createoutputstream (fileout), keyvalueseparator);} else {path file = New Path (workpath, basename); fsdataoutputstream fileout = file. getfilesystem (CONF ). create (file, false); recordwriter = new linerecordwrite <K, V> (fileout, keyvalueseparator) ;}return recordwriter ;}}}




Hadoop advanced programming (ii) --- custom input/output format

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.