MapReduce input Format

Source: Internet
Author: User

The file is the initial storage place for the MapReduce task data. Normally, the input file is usually stored in HDFS. The format of these files can be arbitrary: we can use row-based log files, or we can use binary format, multiple-line input records, or some other format. These files are generally very large, up to dozens of GB, or even larger. So how does MapReduce read this data? Now let's learn the InputFormat interface.

1. InputFormat interface

The InputFormat interface determines how the input file is divided into Hadoop tiles (split up) and accepted. InputFormat is able to get a split collection (inputsplit[]) from a job and then match the split collection with a suitable recordreader (Getrecordreader) to read the data in each split. Let's take a look at what abstract methods the InputFormat interface consists of

2. Abstract class method of InputFormat

The InputFormat contains two abstract methods, as shown below

 1 public abstract class inputformat< K, V> {2 3 public abstract List<inputsplit> getsplits (jobcontext context) throws  ioexception,interruptedexception; 4 5 public abstract recordreader<k,v> Createrecordreader (inputsplit split,taskattemptcontext context) throws Ioexception,interruptedexception; 7}                 

1)the Getsplits (Jobcontext context) method is responsible for dividing a large data logic into many slices. For example, database tables have 100 data and are stored in ascending order by primary key ID. Assuming that every 20 pieces, the size of this list is 5, and then each inputsplit record two parameters, the first is the start ID of this shard, the second is the size of the Shard data, here is 20. Obviously Inputsplit doesn't actually store the data. Just provides a way to shard the data.

2)Createrecordreader (inputsplit split,taskattemptcontext context) method, based on the method defined by Inputsplit, returns a data that reads the Shard record Recordreader. The getsplit is used to obtain the inputsplit computed by the input file, and later, when the calculation inputsplit, the input file can be divided into, the size of the file storage time block and file size, and other factors; Createrecordreader () Provides the implementation of the recordreader mentioned above, will key-value correctly read from the inputsplit, such as Linerecordreader, it is the offset value of key, each row of data is value, which makes all Createrecordreader () The InputFormat that returns Linerecordreader are read input shards in the form of value, with an offset of key and each row of data.

In fact, many times do not need us to implement InputFormat to read data, Hadoop comes with a lot of data input format, has implemented the InputFormat interface

3. InputFormat interface Implementation Class

There are many InputFormat interface implementation classes, as shown in the hierarchy

  

1,Fileinputformat

Fileinputformat is the base class for all InputFormat implementations that use files as their data source, and its primary role is to indicate the location of the input files for the job. Because the input to the job is set to a set of paths, this provides a lot of flexibility for specifying job input. Fileinputformat provides four static methods to set the input path for the Job:

1Publicstatic void2 3 public static void Addinputpaths (Job job,string Commaseparatedpaths); 4 5 public static void Setinputpaths (Job job,path ... inputpaths) ; 6 7 public static void setinputpaths (Job job,string commaseparatedpaths);     

The Addinputpath (), Addinputpaths () method can add one or more paths to the list of paths, and you can call both methods to establish a path list, and the Setinputpaths () method sets the complete list of paths at once, replacing the previous call in the Job All paths that are set on the. Their specific use method, see the following example

1//Set a source path2 Fileinputformat.addinputpath (Job,New Path ("Hdfs://ljc:9000/buaa/inputpath1"));34//Set multiple source paths, separated by commas between multiple source paths5 fileinputformat.addinputpaths (Job, "Hdfs://ljc:9000/buaa/inputpath1, Hdfs://ljc:9000/buaa/inputpath2,..."); 6 7 // inputpaths is an array of path types that can contain multiple source paths, such as HDFs://Ljc:9000/buaa/inputpath1,hdfs:/ / Ljc:9000/buaa/inputpath2, et 8 fileinputformat.setinputpaths (Job, inputpaths); 9  //  Set multiple source paths, separated by commas between multiple source paths fileinputformat.setinputpaths (Job, "Hdfs://ljc:9000/buaa/inputpath1, Hdfs://ljc:9000/ Buaa/inputpath2,... ");               

The Add method, set method allows you to specify the contained file. If you need to exclude specific files, you can use the Fileinputformat setinputpathfilter () method to set a filter

1 public static void Setinputpathfilter (Job job,class<? extends Pathfilter filter);

There is no further discussion about filters. Even if you do not set a filter, Fileinputformat uses a default filter to exclude hidden files. If a filter is set by calling Setinputpathfilter (), it will filter on the basis of the default filter. In other words, a custom filter can only see non-hidden files

In cases where the input data source is a file type, Hadoop is not only good at handling unstructured text data, but it can handle data in binary format, but their base class is Fileinputformat. Below we introduce several common input formats, all implemented the Fileinputformat base class

1, Textinputformat

Textinputformat is the default InputFormat. Each record is a line of input. The key is the longwritable type that stores the byte offset of the row in the entire file. The value is the contents of this line, excluding any line terminator (line break, carriage return), which is packaged into a Text object.

For example, a shard contains 5 text records, which are split between records using tab (Horizontal tab)

1 22

2 17

3 17

4 11

5 11

Each record is represented by the following key/value pairs:

(0, 1 22)

(5, 2 17)

(10,3 17)

(15,4 11)

(20,5 11)

Obviously, the key is not a line number. In general, it is difficult to get the line number because the file is divided into shards by byte rather than by row.

2, Keyvaluetextinputformat

Each row is a record, separated by a delimiter (the default is tab (\ t)) into key (text), value (text). You can use the Mapreduce.input.keyvaluelinerecordreader.key.value,separator property (or the old version of the API Key.value.separator.in.input.line) to set the separator character. Its default value is a tab character.

For example, a shard contains 5 text records, which are split between records using tab (Horizontal tab).

1 22

2 17

3 17

4 11

5 11

Each record is represented by the following key/value pairs:

(1,22)

(2,17)

(3,17)

(4,11)

(5,11)

The key at this point is the Text sequence that precedes each row of tabs.

3, Nlineinputformat

With Textinputformat and Keyvaluetextinputformat, each Mapper receives a different number of input rows. The number of rows depends on the size of the input shard and the length of the line. If you want Mapper to receive input for a fixed number of rows, you need to Nlineinputformat as InputFormat. As with Textinputformat, the key is the byte offset of the row in the file, and the value is the line itself. N is the number of input rows received per Mapper. When N is set to 1 (the default), each Mapper receives exactly one line of input. The Mapreduce.input.lineinputformat.linespermap property (Mapred.line.input.format.linespermap attribute in the older API) implements the N value setting.

The following is an example, still using the 4 line input above as an example.

1 22

2 17

3 17

4 11

5 11

For example, if N is 3, then each input shard contains three rows. A mapper receives three rows of key-value pairs:

1 22

2 17

3 17

The second mapper receives the next two lines (because a total of only 5 rows, all the other mapper can only receive two lines)

4 11

5 11

The keys and values here are the same as those generated by Textinputformat.

4, Sequencefileinputformat

Used to read sequence file. The keys and values are defined by the user. The sequence file is a compressed binary file format dedicated to Hadoop. It is dedicated to transmitting data between a mapreduce job and other mapreduce jobs (for multiple mapreduce link operations).

2. Multiple inputs

Although the input to a MapReduce job can contain multiple input files, all files are interpreted by the same inputformat and the same Mapper. However, data formats tend to evolve over time, so you must write your own Mapper to handle legacy data format problems in your application. Alternatively, some data sources provide the same data, but in a different format.

These problems can be handled properly using the Multipleinputs class, which allows you to specify InputFormat and Mapper for each input path. For example, if we want to keep the weather station data of the UK Met Office together with the NCDC weather station data to count the average temperature, you can set the input path in the following way.

1 Multipleinputs.addinputpath (Job,ncdcinputpath,textinputformat.  Class,ncdctemperaturemapper. class); 3 Multipleinputs.addinputpath (Job,metofficeinputpath,textinputformat.  Class,metofficetemperaturemapper. class);     

This code replaces the regular call to Fileinputformat.addinputpath () and Job.setmapperclass (). Both the Met Office and NCDC data are text files, so use the Textinputformat data type for both. However, the two data sources have different row formats, so we used two distinct Mapper, Ncdctemperaturemapper and Metofficetemperaturemapper, respectively. It is important that the output type of the two Mapper is the same, so that Reducer sees the clustered map output and does not know that these inputs are generated by different Mapper.

The Multipleinputs class also has an overloaded version of the Addinputpath () method, which has no mapper parameters. This method is useful if there are multiple input formats and only one Mapper (set by the Job's Setmapperclass () method). The specific method is shown below.

1 public static void Addinputpath (Job job,path Path,class<?) extends inputformat> inputformatclass);

3, Dbinputformat

    This input format is used to read data from the relational database using JDBC. Because it does not have any sharing capability, you must be very careful when accessing the database, and running too many mapper read data in the database can make the database unbearable. It is for this reason that Dbinputformat is best used to load a small number of datasets. The corresponding output format is Dboutputformat, which is suitable for transferring job output data (medium-sized data) to the database

4. Custom InputFormat

Sometimes the input format that comes with Hadoop does not fully meet the needs of the business, so we need to customize the InputFormat class according to the actual situation. While data sources are generally file data, it is more convenient to inherit the Fileinputformat class when customizing InputFormat, so that complex operations such as sharding are not considered. Custom input formats We are divided into the following steps:

1, inherit Fileinputformat base class.

2, rewrite Fileinputformat inside the Issplitable () method.

3, rewrite Fileinputformat inside the Createrecordreader () method.

How do I customize the input format according to the above steps? Let's use an example to enhance understanding.

We have a final exam score of five students, and now we want to count the total and average scores of each student. Sample data is shown below, with data formats for each row: study number, name, language score, Math score, English score, Physics score, Chemistry score

19020090040 Qin Core 123 131 100 95 100

19020090006 Li Lei 99 92 100 90 100

。。。。。

Here we write programs that implement custom input and find out the total and average scores for each student. In the following steps, just give the steps, code see below

The first step: in order to facilitate the calculation of each student's performance, here we need to customize a scorewritable class to implement the Writablecomparable interface, to encapsulate students ' results.

Step two: Customize the input Format Scoreinputformat class, first inherit Fileinputformat, and then rewrite the Issplitable () method and the Createrecordreader () method respectively. It is important to note that overriding the Createrecordreader () method, in fact, overrides the object Scorerecordreader it returns. Scorerecordreader class inherits Recordreader to realize data reading

The third step: Write a MapReduce program to count the student's total and average scores. It is important to note that our custom input format Scoreinputformat, which needs to be set in the MapReduce program,Job.setinputformatclass (Scoreinputformat.class); /Set Custom input format

In general, we do not need to customize the input format, Hadoop comes with a variety of input formats, basically meet the needs of our work

MapReduce input Format

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.