Sometimes you might want to read data from input data in different ways. Then you need to create your own InputFormat class.
InputFormat is an interface with only two functions .
1 Public Interface Inputformat<k, v> {2 inputsplit[] intthrows IOException; 3 throws IOException; 4 }
getsplits (): Marks all input data, then divides them into small input data blocks, each map task
processes a block of data ;
Getrecordreader () : Provides a recordreader to iterate over data from a given block of data, and then data is processed into the
<key,value> format .
Since no one is willing to care how to divide chunks into smaller chunks, you should inherit the Fileinputformat class, which is used to process the data chunks.
Most of the known InputFormat are subclasses of Fileinputformat .
InputFormat |
Description |
Textinputformat |
Each line in the input file is a record, and key is the byte offset of the line, and value is the contents of the line. Key:longwritable Value:text |
Keyvaluetextinputformat |
Each line in the input file is a record, and the first delimiter word segmentation each line. The content before the delimiter character is key, after which it is value. The delimiter variable is set by the Key.value.separator.in.input.line variable, which defaults to the (\ t) character. Key:text Value:text |
Sequencefileinputformat<k,v> |
A inputformat,<key,value> that is used to read the character stream data is customized for the user. Character stream data is a custom compressed binary data format for Hadoop. It is used to optimize the data transfer process from the output of one MapReduce task to the input of another mapreduce task. Key:k (user-defined) VALUE:V (user-defined) |
Nlineinputformat |
As with Textinputformat, but each chunk must be guaranteed to have only n rows, the Mapred.line.input.format.linespermap property, the default is 1, and set to N. Key:longwritable Value:text |
Fileinputformat implements the Getsplits () method , but still retains the Getrecordreader () method as abstract to make its subclasses implement. Fileinputformat's getsplits () implementation tries to limit the input data block size above the Numsplits value,numsplits< data block Fileinputformat Some subclasses can overload the protected function, such as issplitable (), which is used to determine whether you can slice a block and return it by default to true, indicating that as long as the data block is larger than the HDFS block size, Then it will be sliced. But sometimes you don't want to slice a file, such as when some binary sequence files cannot be sliced, you need to overload the function to return FALSE. when using Fileinputformat, your primary focus should be on the decomposition of data blocks into records, and the generation of Recordreader methods for <key,value> key-value pairs.
1 Public InterfaceRecordreader<k, v> {2 BooleanNext (K key, V value)throwsIOException;3 K CreateKey ();4 V CreateValue ();5 6 LongGetPos ()throwsIOException;7 Public voidClose ()throwsIOException;8 floatGetprogress ()throwsIOException;9}
Hadoop MapReduce InputFormat Basics