Hadoop programming tips (5) --- custom input file format class inputformat

Last Update:2014-07-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop code test environment: hadoop2.4

Application: You can use a custom input file format class to filter and process data with certain conditions.

Hadoop built-in input file formats include:

1) fileinputformat <K, V> is the basic parent class. We can use it as the parent class as a user-defined object;

2) textinputformat <longwritable, text> is the default data format class, which is generally used for programming; key indicates the distance from the current row data to the start of the file, and the value code indicates the current row string;

3) sequencefileinputformat <K, V> is the input format of the sequence file, which can improve the efficiency but is not conducive to viewing the results. We recommend that you use the sequence file during the process, visual output can be used in the final presentation;

4) keyvaluetextinputformat <text, text>: Read data separated by tab (that is, \ t). If each line of data is separated by \ t, use this read, we can automatically regard the previous \ t as the key, and the latter as the value;

5) combinefileinputformat <K, V> is used to merge a large amount of small data;

6) multipleinputs: multiple inputs. You can specify a logical mapper for each input;

Principle:

The inputformat interface has two important functions:

1) getinputsplits is used to determine the input parts. When we inherit fileinputformat, we can ignore this function and use this function of fileinputformat;

2) createrecordreader: defines the input file format for the class for how data is read. In fact, this class is defined;

In each map function, the nextkeyvalue () function is called at the beginning. This function is defined in recordreader (we use different implementations to customize recordreader ), therefore, the nextkeyvalue function in the recordreader is called here. This function will process or initialize the key and value, and then return true to indicate that it has been processed. Then, getcurrentkey and getcurrentvalue are called to obtain the current key and value. Finally, return map and continue to execute the map logic.

Custom input file format:

Package FZ. inputformat; import Java. io. ioexception; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. inputsplit; import Org. apache. hadoop. mapreduce. recordreader; import Org. apache. hadoop. mapreduce. taskattemptcontext; import Org. apache. hadoop. mapreduce. lib. input. fileinputformat;/*** custom Input File Reading Class ** @ author fansy **/public class custominputformat extends fileinputformat <text, text >{@ overridepublic recordreader <text, text> createrecordreader (inputsplit split, taskattemptcontext context) throws ioexception, interruptedexception {// todo auto-generated method stubreturn new customreader ();}}

If fileinputformat is inherited, you do not need to care about getinputsplits. Instead, you only need to define recordreader.

Custom recordreader

Package FZ. inputformat; // import Java. io. bufferedreader; import Java. io. ioexception; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. fsdatainputstream; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. inputsplit; import Org. apache. hadoop. mapreduce. recordreader; import Org. apache. hadoop. mapre Duce. taskattemptcontext; import Org. apache. hadoop. mapreduce. lib. input. filesplit; import Org. apache. hadoop. util. linereader; public class customreader extends recordreader <text, text> {// Private bufferedreader in; private linereader LR; private text key = new text (); private text value = new text (); private long start; private long end; private long currentpos; private text line = new text (); @ overridepublic Void initialize (inputsplit, taskattemptcontext cxt) throws ioexception, interruptedexception {filesplit split = (filesplit) inputsplit; configuration conf = cxt. getconfiguration (); Path = split. getpath (); filesystem FS = path. getfilesystem (CONF); fsdatainputstream is = FS. open (PATH); LR = new linereader (is, conf); // Processing start point and end point start = split. getstart (); End = start + split. getlength (); is. seek (S Tart); If (start! = 0) {start + = LR. readline (new text (), 0, (INT) math. min (integer. max_value, end-Start) ;}currentpos = start ;}// process each row of Data @ overridepublic Boolean nextkeyvalue () throws ioexception, interruptedexception {If (currentpos> end) {return false;} currentpos + = LR. readline (line); If (line. getlength () = 0) {return false;} If (line. tostring (). startswith ("Ignore") {currentpos + = LR. readline (line);} string [] words = line. tostring (). split (","); // Exception Handling if (words. length <2) {system. err. println ("line:" + line. tostring () + ". "); Return false;} key. set (words [0]); value. set (words [1]); Return true ;}@ overridepublic text getcurrentkey () throws ioexception, interruptedexception {return key ;}@ overridepublic text getcurrentvalue () throws ioexception, interruptedexception {return value;} @ overridepublic float getprogress () throws ioexception, interruptedexception {If (START = END) {return 0.0f;} else {return math. min (1.0f, (currentpos-Start)/(float) (end-Start) ;}@overridepublic void close () throws ioexception {// todo auto-generated method stublr. close ();}}

There are two main functions: Initial and nextkeyvalue.

Initial is mainly used for initialization, including opening and reading files and defining the reading progress;

Nextkeyvalue is for each row of data (because linereader is used here, each row is read, and different reading methods are defined here to read different contents ), generates the corresponding key and value pairs. If no error is reported, true is returned. We can see that a rule is set. If the input data starts with ignore, it is ignored. At the same time, each row only obtains the data before and after the comma as the key and value respectively.

Practice:

Input data:

ignore,2a,3ignore,4c,1c,2,3,24,3,2ignore,34,2

Defines the main class. The Mapper of the main class is the default mapper without CER Cer.

package fz.inputformat;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class FileInputFormatDriver extends Configured implements Tool{/** * @param args * @throws Exception  */public static void main(String[] args) throws Exception {// TODO Auto-generated method stubToolRunner.run(new Configuration(), new FileInputFormatDriver(),args);}@Overridepublic int run(String[] arg0) throws Exception {if(arg0.length!=2){System.err.println("Usage:\nfz.inputformat.FileInputFormatDriver <in> <out>");return -1;}Configuration conf = getConf();Path in = new Path(arg0[0]);Path out= new Path(arg0[1]);out.getFileSystem(conf).delete(out, true);Job job = Job.getInstance(conf,"fileintputformat test job");job.setJarByClass(getClass());job.setInputFormatClass(CustomInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);job.setMapperClass(Mapper.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(Text.class);//job.setOutputKeyClass(LongWritable.class);//job.setOutputValueClass(VectorWritable.class);job.setNumReduceTasks(0);//System.out.println(job.getConfiguration().get("mapreduce.job.reduces"));//System.out.println(conf.get("mapreduce.job.reduces"));FileInputFormat.setInputPaths(job, in);FileOutputFormat.setOutputPath(job, out);return job.waitForCompletion(true)?0:-1;}}

View output:

We can see that the ignore data has been ignored, and each line only outputs the data before and after the comma.

Note that:

Here, a row of data is read into a null string, and the reason is not found yet.

Summary: Custom input data formats can be used to filter different data and perform some simple logic processing, which is similar to the map function. However, if this function is used only, that can be replaced by map. In fact, there are other functions in the input data format, such as merging a large amount of small data to improve efficiency.

Share, grow, and be happy

Reprinted please indicate blog address: http://blog.csdn.net/fansy1990

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop programming tips (5) --- custom input file format class inputformat

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop programming tips (5) --- custom input file format class inputformat

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support