When you use streaming of hadoop to read data, if the input is sequence file, if you use "-inputformat Org. apache. hadoop. mapred. if sequencefileinputformat is configured for read, garbled characters are displayed in the read data because the read data is still in the sequence file format, including the sequencefile header information. change to "inputformat Org. apache. hadoop. mapred. sequencefileastextinputformat "can be read normally.
The following is a rough introduction to inputformat and outputformat from other places:
The map reduce framework in hadoop relies on inputformat to provide data and outputformat to output data. Each map reduceProgramThey are inseparable. Hadoop provides a series of inputformat and outputformat for convenient development. This article introduces several common ones.
Textinputformat
Used to read plain text files. Files are divided into a series of rows ending with LF or Cr. The key is the position of each row (offset, longwritable type), and the value is the content of each row, text type.
Keyvaluetextinputformat
It is also used to read files. If the row is split into two parts by the separator (Tab by default), the first part is the key, and the remaining part is the value. If there is no separator, the entire row is used as the key, value is empty
Sequencefileinputformat
Used to read Sequence File. Sequence File is a binary file used by hadoop to store custom data formats. It has two subclasses: sequencefileasbinaryinputformat, which reads the key and value in byteswritable type;
Sequencefileastextinputformat, which reads the key and value in the text type.
Sequencefileinputfilter
According to the filter, partial matching data is obtained from the sequence file, and filter is specified through setfilterclass. Three filters are built in, and record where the key value of regexfilter satisfies the specified regular expression; percentfilter uses the specified parameter F to retrieve records with the number of record rows % F = 0; md5filter uses the specified parameter F to retrieve records with MD5 (key) % F = 0.
Nlineinputformat
0.18.x is added to split the file in the unit of action. For example, each row of the file corresponds to a map. The obtained key is the position of each row (offset, longwritable type), value is the content of each row, and text type.
Compositeinputformat is used to join multiple data sources.
Textoutputformat: output to a plain text file in the format of key + "+ value.
Nulloutputformat,/dev/null in hadoop, sends the output to the black hole.
Sequencefileoutputformat, Which is output to the Sequence File Format File.
Multiplesequencefileoutputformat, multipletextoutputformat, output records to different files based on keys.
Dbinputformat and dboutputformat are read from the DB and output to the DB. It is expected to be added in version 0.19.
From http://www.cnblogs.com/xuxm2007/archive/2011/09/01/2161974.html