Hadoop learning; custom Input/outputformat; class reference mapreduce.mapper; three modes

Source: Internet
Author: User

The way Hadoop splits and reads input files is defined in an implementation of the InputFormat interface, Textinputformat is the default implementation, and when you want to fetch a row of content as input data at a time, there is no definite key. The key returned from Textinputformat is the byte offset of each row, but I don't see any use at this time.

Previously used in Mapper, longwritable (key) and text (value), in Textinputformat, because the key is a byte offset, can be a longwritable type, When using Keyvaluetextinputformat, the first delimiter is the text type before and after, so you have to modify the mapper implementation and the map () method to accommodate the new key type.

A mapreduce input is not necessarily an external data, often some other mapreduce output data, you can also customize the output format, The default output format is consistent with the data format that Keyvaluetextinputformat can read (each row in the record is a tab-delimited key and value), but Hadoop provides a more efficient binary compressed file format called a sequence file, This sequence file is optimized for Hadoop processing, it is preferred when connecting multiple MapReduce jobs, the class reading the sequence file is Sequencefileinputformat, the key and value objects of the sequence file can be customized by the user, the output and input types must match

To customize the InputFormat, implement two methods:

Getsplit () identifies all files used for input data and splits the input data into input shards, each map task processing a shard

Getrecordreader () loop extracts records from a given Shard and resolves each record as a predefined type of key and value

In practice, a shard is always sized in chunks, and the default block in HDFs is 64MB

Fileinputformat the Issplitable () method, check if you can shard the given file, return to True by default, sometimes you may want a file for its own chunking, you can set the return to False

Linerecordreader implementation Recordreader, implementation-based encapsulation, most operations are stored in next

We build our InputFormat class by extending Fileinputformat and implement a factory method to return the Recordreader

In addition to building classes, Timeurlrecordreader implements 6 methods in Recordreader, which are mainly in a package outside of Keyvalueinputformat, but the text type of the record is converted to urlwritable

When outputting data to a file, OutputFormat is used because each reducer only needs to write its output to its own file, and the output does not require sharding.

The output file is placed in a common directory , usually named part-nnnnn, where nnnnn is the reducer partition Id,recordwriter the output is formatted, and Recordreader parses the input format

Nulloutputformat simple implementation of the OutputFormat, no output, do not need to inherit fileoutputformat. More important is that OutputFormat (InputFormat) is dealing with a database, not a file

The personalization output can inherit the write () method in the encapsulated inheritance Recordreader class in the Fileoutputformat class, if you want to output not only to the file


JAR-XVF. /example.jar Unpacking the JAR package

Migrating local files to HDFs Yes, the address in the program is not written wrong, do not write on other unrelated machine


Complete the program in Eclipse, make a jar package, put it in the Hadoop folder, run the Hadoop command to see the results

If you use a third-party plug-in Fatjar, combine the MapReduce jar package and the Jedis jar package into Hadoop so that you do not need to modify the manifest configuration information

set up three modes, the general default stand-alone mode: Do not use HDFS, nor load any daemons, mainly for development debugging

Pseudo-distribution mode runs Hadoop on a "single node Cluster" where all daemons are on one machine, adding code debugging capabilities that allow checking memory usage, HDFS input and output, and other daemon interactions

Fully distributed mode, the real situation in this mode, emphasizing distributed storage and distributed computing, explicitly declares the host name of the Namenode and Jobtracker daemons. Increased HDFs backup parameters for distributed storage benefits

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.