Interpretation: standard input/output format

Last Update:2015-08-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Input Format Class InputFormat

The input specification used to describe the Mr Job, the main functions: input Specification Check (such as input file directory check), the data file for input segmentation and data from the input block read out each, and converted to map input key value pairs .

The getsplits () method returns a collection of list<inputsplit> that logically divides the input file into multiple input shards.
The Createrecordreader () method returns a Recordreader object that is used to parse inputsplit into several key/value pairs. The MR Framework repeatedly calls the methods in the Recordreader object during map task execution, iterating over the key/value pair and handing it over to the map () function. " This method is implemented in a more specific subclass, such as Textinputformat"

Fileinputformat class

The main function is to provide a uniform getsplits () method implementation for subclasses. The two most important of these algorithms are:

1). File segmentation algorithm. It is mainly used to determine the number of inputsplit and the corresponding data segments for each inputsplit. Fileinputformat generates individual inputsplit for each file, with three attribute values determining the number of inputsplit corresponding to the file . See job Flow: Factors that determine the number of maps

2). Host selection algorithm. after the Inputsplit segmentation scheme is determined, the next step is to determine the meta-data information for each inputsplit. This is usually made up of four parts: <file,start,length,hosts>, which represents the file where the inputsplit resides, the starting position , the byte length , and The host (node) list that is located . Among them, the first three items are easy to determine, the difficulty lies in the Host list selection method.

Although the inputsplit corresponding block may be located on multiple nodes, given the efficiency of task scheduling, it is not usually possible to add all nodes to the Inputsplit host list, but instead to select the first few nodes that contain the largest amount of Inputsplit data (the default is 10, Excess filtering), as the main credential for determining whether a task is local when the task is scheduled. To this end, Fileinputformat has designed a simple and effective heuristic algorithm :

Sort rack by the amount of data contained in rack
Sort node within rack according to the amount of data each node contains
Take the host of the first n node as the host list of Inputsplit, where n is the number of block replicas

Example: the network topology of a Hadoop cluster, the number of block replicas in HDFs is 3, a inputsplit contains 3 blocks, the size is 100, 150, and 75, which is easy to calculate and 4 rack contain (the Inputsplit) The amount of data is 175, 250, 150, and 75, respectively. The Node1 in Node3 and Node4,rack1 in Rack2 will be added to the Inputsplit's host list.

Host selection algorithm process: sort the size of the amount of data for 4 rock containing (the Inputsplit): Rock2 > Rock1 > Rock3 > Rock4, so select Rock2 and Node3 node4 and File1, The file3 of Node1 in Rack1 is loaded into the Inputsplit's host list.

From the above host selection algorithm, when the inputsplit size is larger than the block size map task does not achieve full data locality, that is, there is always a part of the data need to be read from the remote node, thus the following conclusions can be drawn ~:

When implementing InputFormat based on Fileinputformat, to improve the data locality of the map task, you should try to make the inputsplit size the same as the block size.

Textinputformat class

Textinputformat inherits from the Fileinputformat class and indirectly inherits from the InputFormat class. The Textinputformat class contains two methods:

issplitable () function is to determine whether the file can be cut into the input slice
The createrecordreader () function is to read the contents of a file, returning an object of the Linerecordreader class

There is a initialize () method in the Linerecordreader class that is used to initialize the inputsplit. The other method in the class is to parse it into a key-value pair output.

Output Formatting Classes OutputFormat

OutputFormat is an abstract class that is used primarily to describe the format of the output data, and to write the user-supplied key/value to a file in a particular format.

The Getrecordwriter () method returns a Recordwriter class object. The Write () method in the class receives a Key/value pair and writes it to the file. During task execution, the MapReduce framework passes the result of the map () or the reduce () function to the write () method.
The checkoutputspecs () method is automatically called by jobclient to check if the output directory is legitimate before the user job is committed to RM.
The Getoutputcommitter () method returns an object of the Outputcommitter class. This class is the submitter of the output Mr Result.

Textoutputformat

The Fileoutputformat class is also an abstract class, and the implementation class is the Textoutputformat class, which contains a static inner class of Linerecordwriter that is responsible for outputting a single line of Mr Results.

Line 48th can see the text output when using UTF-8 encoding
Line 52nd indicates that the symbol for the branch is "\ n"
The 71st line of construction shows that the default delimiter for output key-value pairs is the tab "\ T". By code conf.set ("Mapreduce.output.textoutputformat.separator", "******"); The delimiter for the custom output.

Interpretation: standard input/output format

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Interpretation: standard input/output format

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Interpretation: standard input/output format

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support