Common inputformat and outputformat for hadoop Development

Source: Internet
Author: User

When you use streaming of hadoop to read data, if the input is sequence file, if you use "-inputformat Org. apache. hadoop. mapred. if sequencefileinputformat is configured for read, garbled characters are displayed in the read data because the read data is still in the sequence file format, including the sequencefile header information. change to "inputformat Org. apache. hadoop. mapred. sequencefileastextinputformat "can be read normally.

The following is a rough introduction to inputformat and outputformat from other places:

The map reduce framework in hadoop relies on inputformat to provide data and outputformat to output data. Each map reduceProgramThey are inseparable. Hadoop provides a series of inputformat and outputformat for convenient development. This article introduces several common ones.

 

Textinputformat
Used to read plain text files. Files are divided into a series of rows ending with LF or Cr. The key is the position of each row (offset, longwritable type), and the value is the content of each row, text type.

Keyvaluetextinputformat
It is also used to read files. If the row is split into two parts by the separator (Tab by default), the first part is the key, and the remaining part is the value. If there is no separator, the entire row is used as the key, value is empty

Sequencefileinputformat
Used to read Sequence File. Sequence File is a binary file used by hadoop to store custom data formats. It has two subclasses: sequencefileasbinaryinputformat, which reads the key and value in byteswritable type;

Sequencefileastextinputformat, which reads the key and value in the text type.

Sequencefileinputfilter
According to the filter, partial matching data is obtained from the sequence file, and filter is specified through setfilterclass. Three filters are built in, and record where the key value of regexfilter satisfies the specified regular expression; percentfilter uses the specified parameter F to retrieve records with the number of record rows % F = 0; md5filter uses the specified parameter F to retrieve records with MD5 (key) % F = 0.

Nlineinputformat
0.18.x is added to split the file in the unit of action. For example, each row of the file corresponds to a map. The obtained key is the position of each row (offset, longwritable type), value is the content of each row, and text type.

Compositeinputformat is used to join multiple data sources.

Textoutputformat: output to a plain text file in the format of key + "+ value.

Nulloutputformat,/dev/null in hadoop, sends the output to the black hole.

Sequencefileoutputformat, Which is output to the Sequence File Format File.

Multiplesequencefileoutputformat, multipletextoutputformat, output records to different files based on keys.

Dbinputformat and dboutputformat are read from the DB and output to the DB. It is expected to be added in version 0.19.

From http://www.cnblogs.com/xuxm2007/archive/2011/09/01/2161974.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.