Hadoop MapReduce InputFormat Basics

Source: Internet
Author: User
Tags hadoop mapreduce

Sometimes you might want to read data from input data in different ways. Then you need to create your own InputFormat class.   InputFormat is an interface with only two functions .  
1  Public Interface Inputformat<k, v> {2     inputsplit[]   intthrows  IOException; 3     throws IOException; 4 }

getsplits (): Marks all input data, then divides them into small input data blocks, each map task processes a block of data ; Getrecordreader () : Provides a recordreader to iterate over data from a given block of data, and then data is processed into the <key,value> format .
Since no one is willing to care how to divide chunks into smaller chunks, you should inherit the Fileinputformat class, which is used to process the data chunks. Most of the known InputFormat are subclasses of Fileinputformat .
InputFormat Description
Textinputformat Each line in the input file is a record, and key is the byte offset of the line, and value is the contents of the line.
Key:longwritable
Value:text
Keyvaluetextinputformat Each line in the input file is a record, and the first delimiter word segmentation each line. The content before the delimiter character is key, after which it is value.
The delimiter variable is set by the Key.value.separator.in.input.line variable, which defaults to the (\ t) character.
Key:text
Value:text
Sequencefileinputformat<k,v> A inputformat,<key,value> that is used to read the character stream data is customized for the user. Character stream data is a custom compressed binary data format for Hadoop.
It is used to optimize the data transfer process from the output of one MapReduce task to the input of another mapreduce task.
Key:k (user-defined)
VALUE:V (user-defined)
Nlineinputformat As with Textinputformat, but each chunk must be guaranteed to have only n rows, the Mapred.line.input.format.linespermap property, the default is 1, and set to N.
Key:longwritable
Value:text
 
Fileinputformat implements the Getsplits () method , but still retains the Getrecordreader () method as abstract to make its subclasses implement. Fileinputformat's getsplits () implementation tries to limit the input data block size above the Numsplits value,numsplits< data block Fileinputformat Some subclasses can overload the protected function, such as issplitable (), which is used to determine whether you can slice a block and return it by default to true, indicating that as long as the data block is larger than the HDFS block size, Then it will be sliced. But sometimes you don't want to slice a file, such as when some binary sequence files cannot be sliced, you need to overload the function to return FALSE. when using Fileinputformat, your primary focus should be on the decomposition of data blocks into records, and the generation of Recordreader methods for <key,value> key-value pairs.  
1  Public InterfaceRecordreader<k, v> {2     BooleanNext (K key, V value)throwsIOException;3 K CreateKey ();4 V CreateValue ();5  6     LongGetPos ()throwsIOException;7      Public voidClose ()throwsIOException;8     floatGetprogress ()throwsIOException;9}

    

Hadoop MapReduce InputFormat Basics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.