Hadoop learning; custom Input/outputformat; class reference mapreduce.mapper; three modes

Last Update:2014-05-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The way Hadoop splits and reads input files is defined in an implementation of the InputFormat interface, Textinputformat is the default implementation, and when you want to fetch a row of content as input data at a time, there is no definite key. The key returned from Textinputformat is the byte offset of each row, but I don't see any use at this time.

Previously used in Mapper, longwritable (key) and text (value), in Textinputformat, because the key is a byte offset, can be a longwritable type, When using Keyvaluetextinputformat, the first delimiter is the text type before and after, so you have to modify the mapper implementation and the map () method to accommodate the new key type.

A mapreduce input is not necessarily an external data, often some other mapreduce output data, you can also customize the output format, The default output format is consistent with the data format that Keyvaluetextinputformat can read (each row in the record is a tab-delimited key and value), but Hadoop provides a more efficient binary compressed file format called a sequence file, This sequence file is optimized for Hadoop processing, it is preferred when connecting multiple MapReduce jobs, the class reading the sequence file is Sequencefileinputformat, the key and value objects of the sequence file can be customized by the user, the output and input types must match

To customize the InputFormat, implement two methods:

Getsplit () identifies all files used for input data and splits the input data into input shards, each map task processing a shard

Getrecordreader () loop extracts records from a given Shard and resolves each record as a predefined type of key and value

In practice, a shard is always sized in chunks, and the default block in HDFs is 64MB

Fileinputformat the Issplitable () method, check if you can shard the given file, return to True by default, sometimes you may want a file for its own chunking, you can set the return to False

Linerecordreader implementation Recordreader, implementation-based encapsulation, most operations are stored in next

We build our InputFormat class by extending Fileinputformat and implement a factory method to return the Recordreader

In addition to building classes, Timeurlrecordreader implements 6 methods in Recordreader, which are mainly in a package outside of Keyvalueinputformat, but the text type of the record is converted to urlwritable

When outputting data to a file, OutputFormat is used because each reducer only needs to write its output to its own file, and the output does not require sharding.

The output file is placed in a common directory , usually named part-nnnnn, where nnnnn is the reducer partition Id,recordwriter the output is formatted, and Recordreader parses the input format

Nulloutputformat simple implementation of the OutputFormat, no output, do not need to inherit fileoutputformat. More important is that OutputFormat (InputFormat) is dealing with a database, not a file

The personalization output can inherit the write () method in the encapsulated inheritance Recordreader class in the Fileoutputformat class, if you want to output not only to the file

JAR-XVF. /example.jar Unpacking the JAR package

Migrating local files to HDFs Yes, the address in the program is not written wrong, do not write on other unrelated machine

Complete the program in Eclipse, make a jar package, put it in the Hadoop folder, run the Hadoop command to see the results

If you use a third-party plug-in Fatjar, combine the MapReduce jar package and the Jedis jar package into Hadoop so that you do not need to modify the manifest configuration information

set up three modes, the general default stand-alone mode: Do not use HDFS, nor load any daemons, mainly for development debugging

Pseudo-distribution mode runs Hadoop on a "single node Cluster" where all daemons are on one machine, adding code debugging capabilities that allow checking memory usage, HDFS input and output, and other daemon interactions

Fully distributed mode, the real situation in this mode, emphasizing distributed storage and distributed computing, explicitly declares the host name of the Namenode and Jobtracker daemons. Increased HDFs backup parameters for distributed storage benefits

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop learning; custom Input/outputformat; class reference mapreduce.mapper; three modes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop learning; custom Input/outputformat; class reference mapreduce.mapper; three modes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support