Hadoop MapReduce InputFormat Basics

Last Update:2016-04-12 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Sometimes you might want to read data from input data in different ways. Then you need to create your own InputFormat class. InputFormat is an interface with only two functions .

1  Public Interface Inputformat<k, v> {2     inputsplit[]   intthrows  IOException; 3     throws IOException; 4 }

getsplits (): Marks all input data, then divides them into small input data blocks, each map task processes a block of data ; Getrecordreader () : Provides a recordreader to iterate over data from a given block of data, and then data is processed into the <key,value> format .

Since no one is willing to care how to divide chunks into smaller chunks, you should inherit the Fileinputformat class, which is used to process the data chunks. Most of the known InputFormat are subclasses of Fileinputformat .

InputFormat	Description
Textinputformat	Each line in the input file is a record, and key is the byte offset of the line, and value is the contents of the line. Key:longwritable Value:text
Keyvaluetextinputformat	Each line in the input file is a record, and the first delimiter word segmentation each line. The content before the delimiter character is key, after which it is value. The delimiter variable is set by the Key.value.separator.in.input.line variable, which defaults to the (\ t) character. Key:text Value:text
Sequencefileinputformat<k,v>	A inputformat,<key,value> that is used to read the character stream data is customized for the user. Character stream data is a custom compressed binary data format for Hadoop. It is used to optimize the data transfer process from the output of one MapReduce task to the input of another mapreduce task. Key:k (user-defined) VALUE:V (user-defined)
Nlineinputformat	As with Textinputformat, but each chunk must be guaranteed to have only n rows, the Mapred.line.input.format.linespermap property, the default is 1, and set to N. Key:longwritable Value:text

Fileinputformat implements the Getsplits () method , but still retains the Getrecordreader () method as abstract to make its subclasses implement. Fileinputformat's getsplits () implementation tries to limit the input data block size above the Numsplits value,numsplits< data block Fileinputformat Some subclasses can overload the protected function, such as issplitable (), which is used to determine whether you can slice a block and return it by default to true, indicating that as long as the data block is larger than the HDFS block size, Then it will be sliced. But sometimes you don't want to slice a file, such as when some binary sequence files cannot be sliced, you need to overload the function to return FALSE. when using Fileinputformat, your primary focus should be on the decomposition of data blocks into records, and the generation of Recordreader methods for <key,value> key-value pairs.
1  Public InterfaceRecordreader<k, v> {2  　　 BooleanNext (K key, V value)throwsIOException;3 K CreateKey ();4 V CreateValue ();5  6  　　 LongGetPos ()throwsIOException;7  　　  Public voidClose ()throwsIOException;8  　　 floatGetprogress ()throwsIOException;9}

Hadoop MapReduce InputFormat Basics

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More