Hadoop programming tips (5) --- custom input file format class inputformat

Source: Internet
Author: User

Hadoop code test environment: hadoop2.4

Application: You can use a custom input file format class to filter and process data with certain conditions.

Hadoop built-in input file formats include:

1) fileinputformat <K, V> is the basic parent class. We can use it as the parent class as a user-defined object;

2) textinputformat <longwritable, text> is the default data format class, which is generally used for programming; key indicates the distance from the current row data to the start of the file, and the value code indicates the current row string;

3) sequencefileinputformat <K, V> is the input format of the sequence file, which can improve the efficiency but is not conducive to viewing the results. We recommend that you use the sequence file during the process, visual output can be used in the final presentation;

4) keyvaluetextinputformat <text, text>: Read data separated by tab (that is, \ t). If each line of data is separated by \ t, use this read, we can automatically regard the previous \ t as the key, and the latter as the value;

5) combinefileinputformat <K, V> is used to merge a large amount of small data;

6) multipleinputs: multiple inputs. You can specify a logical mapper for each input;

Principle:

The inputformat interface has two important functions:

1) getinputsplits is used to determine the input parts. When we inherit fileinputformat, we can ignore this function and use this function of fileinputformat;

2) createrecordreader: defines the input file format for the class for how data is read. In fact, this class is defined;

In each map function, the nextkeyvalue () function is called at the beginning. This function is defined in recordreader (we use different implementations to customize recordreader ), therefore, the nextkeyvalue function in the recordreader is called here. This function will process or initialize the key and value, and then return true to indicate that it has been processed. Then, getcurrentkey and getcurrentvalue are called to obtain the current key and value. Finally, return map and continue to execute the map logic.

Custom input file format:

Package FZ. inputformat; import Java. io. ioexception; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. inputsplit; import Org. apache. hadoop. mapreduce. recordreader; import Org. apache. hadoop. mapreduce. taskattemptcontext; import Org. apache. hadoop. mapreduce. lib. input. fileinputformat;/*** custom Input File Reading Class ** @ author fansy **/public class custominputformat extends fileinputformat <text, text >{@ overridepublic recordreader <text, text> createrecordreader (inputsplit split, taskattemptcontext context) throws ioexception, interruptedexception {// todo auto-generated method stubreturn new customreader ();}}
If fileinputformat is inherited, you do not need to care about getinputsplits. Instead, you only need to define recordreader.

Custom recordreader

Package FZ. inputformat; // import Java. io. bufferedreader; import Java. io. ioexception; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. fsdatainputstream; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. inputsplit; import Org. apache. hadoop. mapreduce. recordreader; import Org. apache. hadoop. mapre Duce. taskattemptcontext; import Org. apache. hadoop. mapreduce. lib. input. filesplit; import Org. apache. hadoop. util. linereader; public class customreader extends recordreader <text, text> {// Private bufferedreader in; private linereader LR; private text key = new text (); private text value = new text (); private long start; private long end; private long currentpos; private text line = new text (); @ overridepublic Void initialize (inputsplit, taskattemptcontext cxt) throws ioexception, interruptedexception {filesplit split = (filesplit) inputsplit; configuration conf = cxt. getconfiguration (); Path = split. getpath (); filesystem FS = path. getfilesystem (CONF); fsdatainputstream is = FS. open (PATH); LR = new linereader (is, conf); // Processing start point and end point start = split. getstart (); End = start + split. getlength (); is. seek (S Tart); If (start! = 0) {start + = LR. readline (new text (), 0, (INT) math. min (integer. max_value, end-Start) ;}currentpos = start ;}// process each row of Data @ overridepublic Boolean nextkeyvalue () throws ioexception, interruptedexception {If (currentpos> end) {return false;} currentpos + = LR. readline (line); If (line. getlength () = 0) {return false;} If (line. tostring (). startswith ("Ignore") {currentpos + = LR. readline (line);} string [] words = line. tostring (). split (","); // Exception Handling if (words. length <2) {system. err. println ("line:" + line. tostring () + ". "); Return false;} key. set (words [0]); value. set (words [1]); Return true ;}@ overridepublic text getcurrentkey () throws ioexception, interruptedexception {return key ;}@ overridepublic text getcurrentvalue () throws ioexception, interruptedexception {return value;} @ overridepublic float getprogress () throws ioexception, interruptedexception {If (START = END) {return 0.0f;} else {return math. min (1.0f, (currentpos-Start)/(float) (end-Start) ;}@overridepublic void close () throws ioexception {// todo auto-generated method stublr. close ();}}
There are two main functions: Initial and nextkeyvalue.

Initial is mainly used for initialization, including opening and reading files and defining the reading progress;

Nextkeyvalue is for each row of data (because linereader is used here, each row is read, and different reading methods are defined here to read different contents ), generates the corresponding key and value pairs. If no error is reported, true is returned. We can see that a rule is set. If the input data starts with ignore, it is ignored. At the same time, each row only obtains the data before and after the comma as the key and value respectively.

Practice:

Input data:

ignore,2a,3ignore,4c,1c,2,3,24,3,2ignore,34,2
Defines the main class. The Mapper of the main class is the default mapper without CER Cer.

package fz.inputformat;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class FileInputFormatDriver extends Configured implements Tool{/** * @param args * @throws Exception  */public static void main(String[] args) throws Exception {// TODO Auto-generated method stubToolRunner.run(new Configuration(), new FileInputFormatDriver(),args);}@Overridepublic int run(String[] arg0) throws Exception {if(arg0.length!=2){System.err.println("Usage:\nfz.inputformat.FileInputFormatDriver <in> <out>");return -1;}Configuration conf = getConf();Path in = new Path(arg0[0]);Path out= new Path(arg0[1]);out.getFileSystem(conf).delete(out, true);Job job = Job.getInstance(conf,"fileintputformat test job");job.setJarByClass(getClass());job.setInputFormatClass(CustomInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);job.setMapperClass(Mapper.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(Text.class);//job.setOutputKeyClass(LongWritable.class);//job.setOutputValueClass(VectorWritable.class);job.setNumReduceTasks(0);//System.out.println(job.getConfiguration().get("mapreduce.job.reduces"));//System.out.println(conf.get("mapreduce.job.reduces"));FileInputFormat.setInputPaths(job, in);FileOutputFormat.setOutputPath(job, out);return job.waitForCompletion(true)?0:-1;}}

View output:



We can see that the ignore data has been ignored, and each line only outputs the data before and after the comma.

Note that:

Here, a row of data is read into a null string, and the reason is not found yet.


Summary: Custom input data formats can be used to filter different data and perform some simple logic processing, which is similar to the map function. However, if this function is used only, that can be replaced by map. In fact, there are other functions in the input data format, such as merging a large amount of small data to improve efficiency.


Share, grow, and be happy

Reprinted please indicate blog address: http://blog.csdn.net/fansy1990




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.