The InputFormat and source code analysis of Hadoop-2.4.1 Learning

Source: Internet
Author: User

When you submit a job to a Hadoop cluster, you specify the format of the job input (the default input format is textinputformat when not specified). Use the InputFormat class or InputFormat interface in Hadoop to describe the specification or format of the MapReduce job input, The InputFormat class or InputFormat interface is said to be because InputFormat is defined as an interface in the old API (hadoop-0.x), whereas in the new API (hadoop-1.x and hadoop-2.x), InputFormat is an abstract class existence , the InputFormat abstract class and its subclasses are mainly described in this article. InputFormat is mainly used to verify that the input of the job conforms to the specification, divides the input file into logical inputsplit, each inputsplit is assigned to a mapper task, provides Recordreader implementation, The implementation is responsible for collecting records from Inputsplit and handing them over to mapper tasks. Different InputFormat subclasses provide different inputsplit and recordreader, such as filesplit for files and dbinputsplit for databases, Linerecordreader and sequence files for text files, Sequencefilerecordreader, and so on. Diagram for InputFormat and its subclasses:

The InputFormat abstract class provides only two abstract methods, respectively, for obtaining inputsplit and Recordreader. The inputsplit is just a logical division of the input file instead of physically splitting the input file into blocks, or inputsplit just specifying a range of input files to be entered into a specific mapper. In the actual application, most of the cases are using Fileinputformat as input, so in this article will focus on learning Fileinputformat and its subclasses, for other classes such as Dbinputformat and so on would only do a brief description.

    • Dbinputformat: InputFormat that reads data from the SQL table. Dbinputformat uses the longwritable that contains the record number as the key, dbwritable as the value. The subclass of the class is Datadrivendbinputformat, which divides the inputsplit with a different mechanism from the parent class.
    • Composableinputformat:: Abstract class, subclasses of this class need to provide composablerecordreader instead of Recordreader. Additional methods are provided relative to Recordreader,composablerecordreader.
    • Compositeinputformat: A inputformat that performs a join operation on a set of data sources that have the same sort and partitioning.
    • Fileinputformat: Abstract class, which is the base class for all inputformat based on file input. This class provides a generic implementation of the Getsplits (Jobcontext), which can also be overridden by issplitable (Jobcontext, Path) to determine whether the input file can be split or the entire input file will be processed by a particular mapper task.

The main sub-categories of Fileinputformat are: Textinputformat, Sequencefileinputformat, Nlineinputformat, Keyvaluetextinputformat, Fixedlengthinputformat and Combinefileinputformat, the following is a brief overview of its uses and features.

  • Textinputformat: InputFormat for plain text files, which is also the default InputFormat. The input file is broken down into lines, carriage return or line wrapping as the end of the line tag, the key is the position of the line in the file, the value is the line content. The InputFormat uses Linerecordreader to read the contents of the Inputsplit.
  • Nlineinputformat: The n rows in the input are made as a inputsplit, where n can be specified by the parameter mapreduce.input.lineinputformat.linespermap, which defaults to 1. Recordreader is also used Linerecordreader.
  • Keyvaluetextinputformat: InputFormat for plain text files, each line is divided into keys and values using delimited bytes, The delimiter is specified by the parameter mapreduce.input.keyvaluelinerecordreader.key.value.separator and \ t is used by default. If the delimiter does not exist, the entire row is the key and the value is empty. Recordreader for Keyvaluelinerecordreader.
  • Fixedlengthinputformat: Used to read a file in a fixed-length inputformat, the contents of the file are not required to be text, or it can be binary data. The user must set the length of the record through the parameter fixedlengthinputformat.record.length, the default value is 0, and if the value of the parameter is less than or equal to 0, the IOException will be thrown. Recordreader for Fixedlengthrecordreader.
  • Combinefileinputformat: Abstract class, Getsplits (Jobcontext) of this class returns not list<filesplit> but list<combinefilesplit>. Combinefilesplit is constructed from a file in the input path that cannot have files in different pools, and each combinefilesplit may contain blocks of different files. The class has two subclasses for sequence files and plain text files, Combinesequencefileinputformat and Combinetextinputformat, respectively. The Recordreader are Sequencefilerecordreaderwrapper and Textrecordreaderwrapper respectively.
  • Sequencefileinputformat: InputFormat for sequence file, get Inputsplit method inherited from Fileinputformat, not rewritten, Its recordreader is sequencefilerecordreader. There are three subcategories of this class: Sequencefileasbinaryinputformat, Sequencefileastextinputformat, and Sequencefileinputfilter, Used to read the key value from the binary format of the sequence file, convert the key value in the sequence file to a string, sample from the sequence file, and then be processed by the MapReduce job, the sample is determined by the filter class. The recordreader of the three were: Sequencefileasbinaryrecordreader, Sequencefileastextrecordreader and Filterrecordreader.

After a comprehensive overview of the use of InputFormat and its subclasses, the following is a detailed analysis of how Fileinputformat splits the logic Inputsplit and how to use Recordreader to read the data, in conjunction with the source code. Here is the source code for Getsplits:

Public list<inputsplit> getsplits (Jobcontext job) throws IOException {//test execution time in nanoseconds Stopwatch sw = new STOPWATC    H (). Start (); The larger of the minimum shard size and mapreduce.input.fileinputformat.split.minsize settings in a particular format, the default is 1 long minSize = Math.max (    Getformatminsplitsize (), getminsplitsize (Job));    The value of the parameter mapreduce.input.fileinputformat.split.maxsize, which defaults to long.max_vaule long maxSize = getmaxsplitsize (Job);    Generate splits list<inputsplit> splits = new arraylist<inputsplit> ();    Gets the input file for the job list<filestatus> files = liststatus (Job);      for (Filestatus file:files) {Path PATH = File.getpath ();      Long length = File.getlen ();        if (length! = 0) {blocklocation[] blklocations; Gets the location of the block that the file represents that belongs to if (file instanceof locatedfilestatus) {blklocations = ((locatedfilestatus) file). Get        Blocklocations ();          } else {FileSystem fs = Path.getfilesystem (Job.getconfiguration ()); Blklocations = Fs.getfileblocklocatiONS (file, 0, length);          } if (issplitable (Job, path)) {//Get file block size long blockSize = File.getblocksize ();          Calculates the inputsplit size, typically returning a value of dfs.blocksize long splitsize = Computesplitsize (BlockSize, MinSize, maxSize);          Long bytesremaining = length; If the remaining value is greater than 1.1*splitsize, the file will continue to be divided, if it is less than or equal to the value, then as a inputsplit///= maximum value for each inputsplit is 1.1*splitsize, the minimum file is greater than 0.1* Splitsize while ((double) bytesremaining)/splitsize > Split_slop) {//Compute the index of the file chunking, where only the starting position of the inputsplit is calculated In a block//regardless of whether the size of the inputsplit exceeds the range of the block (Inputsplit is a logical concept) int blkindex = Getblockindex (blklocations, Length-b            ytesremaining);                                     The starting position is 0,splitsize,2splitsize......,length for Splitsize splits.add (makesplit (Path, length-bytesremaining, SplitSize,            Blklocations[blkindex].gethosts ()));          BytesRemaining-= splitsize; } if (bytesremaining! = 0) {int blkindex = GETBLOckindex (Blklocations, length-bytesremaining); Splits.add (Makesplit (Path, length-bytesremaining, BytesRemaining, blklocations[blkindex].gethosts ())          );        }} else {//not splitable Splits.add (makesplit (path, 0, length, blklocations[0].gethosts ())); }} else {//create empty hosts array for zero length files Splits.add (makesplit (path, 0, length, new      String[0])); }}//Save the number of input files for Metrics/loadgen//Set the value of Mapreduce.input.fileinputformat.numinputfiles to enter the text    The number of pieces job.getconfiguration (). Setlong (Num_input_files, Files.size ());    Sw.stop (); if (log.isdebugenabled ()) {Log.debug ("Total # of splits generated by getsplits:" + splits.size () + ", Time    Taken: "+ sw.elapsedmillis ());  } return splits; }

In the above code, if Splitsize equals blocksize, the starting position of Inputsplit is consistent with the corresponding block in the file, if Splitsize is less than blocksize, That is, by controlling the size of each inputsplit with the parameter mapreduce.input.fileinputformat.split.maxsize, the starting position of the Inputsplit is inside the file block. The logic of the split inputsplit is purely for size, so is there a possibility of dividing a row of records into two inputsplit? The answer is yes. However, the above code does not deal with this situation, it also means that there is a possibility of incomplete data in the map phase, Hadoop certainly will not allow this situation, and Recordreader is responsible for dealing with this situation, the following linerecordreader as an example, See how Hadoop handles records across Inputsplit. The Initialize method in Linerecordreader is called once at initialization, which locates the position of the first newline character in Inputsplit and positions the actual read data to the position after the first line break. If it is the first inputsplit do not have to deal with it. The filtered content is read by the Linerecordreader of split before reading the Inputsplit, and the code fragment that determines the position of the first newline character is (this assumes that the input is an uncompressed file):

Navigate to the starting position of the Filesplit in the input file      filein.seek (start);      in = new Splitlinereader (Filein, Job, this.recorddelimiterbytes);      Fileposition = Filein;    If This isn't the first split, we always throw away first record    //Because we all (except the last split) read One extra line in    //Next () method.    If it is not the first split (the start of the first split is the beginning of the input file, that is, 0), the first newline character in the split and the previous data are ignored    //The data may be a whole row of data, may be part of a row of data, or it may be just the line feed itself    if (start! = 0) {      //readline reads the length of the newline character (CR,LF and CRLF) from Filein      //And then navigates start to the beginning of the next line starting with      + = In.readline (New Text (), 0, Maxbytestoconsume (start));    }    This.pos = start;

The code to read the beginning of the inputsplit is in Nextkeyvalue, with the following code:

int newSize = 0;    We always read one extra line, which lies outside the upper    //split limit i.e. (end-1)    //This will read the next Inputsplit Data until the newline character, so initialize needs to remove the contents of the first line break from the next inputsplit while    (Getfileposition () <= End | | In.needadditionalrecordaftersplit ()) {      newSize = in.readline (value, Maxlinelength,          Math.max ( Maxbytestoconsume (POS), maxlinelength));      pos + = newSize;      if (NewSize < maxlinelength) {break        ;      }      Line too long. Try again      Log.info ("Skipped line of size" + NewSize + "at POS" +                (pos-newsize));    }

The function of ReadLine is to read a row of data into text because it is based on size when splitting filesplit, and if a row is split into two split, such as S1 and S2, when the last row of data in S1 is read, it is read to the first line break in S2. This is implemented in the Nextkeyvalue method, and when the S2 is processed, the data that has been read is skipped to avoid repeated reads, which is implemented in initialize. Other types of recordreader are how to implement read data that can read its source code.

Hadoop-2.4.1 Learning InputFormat and source code analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.