Interpretation: standard input/output format

Source: Internet
Author: User

Input Format Class InputFormat

The input specification used to describe the Mr Job, the main functions: input Specification Check (such as input file directory check), the data file for input segmentation and data from the input block read out each, and converted to map input key value pairs .

    • The getsplits () method returns a collection of list<inputsplit> that logically divides the input file into multiple input shards.
    • The Createrecordreader () method returns a Recordreader object that is used to parse inputsplit into several key/value pairs. The MR Framework repeatedly calls the methods in the Recordreader object during map task execution, iterating over the key/value pair and handing it over to the map () function. " This method is implemented in a more specific subclass, such as Textinputformat"

Fileinputformat class

The main function is to provide a uniform getsplits () method implementation for subclasses. The two most important of these algorithms are:

1). File segmentation algorithm. It is mainly used to determine the number of inputsplit and the corresponding data segments for each inputsplit. Fileinputformat generates individual inputsplit for each file, with three attribute values determining the number of inputsplit corresponding to the file . See job Flow: Factors that determine the number of maps

2). Host selection algorithm. after the Inputsplit segmentation scheme is determined, the next step is to determine the meta-data information for each inputsplit. This is usually made up of four parts: <file,start,length,hosts>, which represents the file where the inputsplit resides, the starting position , the byte length , and The host (node) list that is located . Among them, the first three items are easy to determine, the difficulty lies in the Host list selection method.

Although the inputsplit corresponding block may be located on multiple nodes, given the efficiency of task scheduling, it is not usually possible to add all nodes to the Inputsplit host list, but instead to select the first few nodes that contain the largest amount of Inputsplit data (the default is 10, Excess filtering), as the main credential for determining whether a task is local when the task is scheduled. To this end, Fileinputformat has designed a simple and effective heuristic algorithm :

    1. Sort rack by the amount of data contained in rack
    2. Sort node within rack according to the amount of data each node contains
    3. Take the host of the first n node as the host list of Inputsplit, where n is the number of block replicas

Example: the network topology of a Hadoop cluster, the number of block replicas in HDFs is 3, a inputsplit contains 3 blocks, the size is 100, 150, and 75, which is easy to calculate and 4 rack contain (the Inputsplit) The amount of data is 175, 250, 150, and 75, respectively. The Node1 in Node3 and Node4,rack1 in Rack2 will be added to the Inputsplit's host list.

Host selection algorithm process: sort the size of the amount of data for 4 rock containing (the Inputsplit): Rock2 > Rock1 > Rock3 > Rock4, so select Rock2 and Node3 node4 and File1, The file3 of Node1 in Rack1 is loaded into the Inputsplit's host list.

From the above host selection algorithm, when the inputsplit size is larger than the block size map task does not achieve full data locality, that is, there is always a part of the data need to be read from the remote node, thus the following conclusions can be drawn ~:

When implementing InputFormat based on Fileinputformat, to improve the data locality of the map task, you should try to make the inputsplit size the same as the block size.

Textinputformat class

Textinputformat inherits from the Fileinputformat class and indirectly inherits from the InputFormat class. The Textinputformat class contains two methods:

    • issplitable () function is to determine whether the file can be cut into the input slice
    • The createrecordreader () function is to read the contents of a file, returning an object of the Linerecordreader class

There is a initialize () method in the Linerecordreader class that is used to initialize the inputsplit. The other method in the class is to parse it into a key-value pair output.

Output Formatting Classes OutputFormat

OutputFormat is an abstract class that is used primarily to describe the format of the output data, and to write the user-supplied key/value to a file in a particular format.

    • The Getrecordwriter () method returns a Recordwriter class object. The Write () method in the class receives a Key/value pair and writes it to the file. During task execution, the MapReduce framework passes the result of the map () or the reduce () function to the write () method.
    • The checkoutputspecs () method is automatically called by jobclient to check if the output directory is legitimate before the user job is committed to RM.
    • The Getoutputcommitter () method returns an object of the Outputcommitter class. This class is the submitter of the output Mr Result.

Textoutputformat

The Fileoutputformat class is also an abstract class, and the implementation class is the Textoutputformat class, which contains a static inner class of Linerecordwriter that is responsible for outputting a single line of Mr Results.

    • Line 48th can see the text output when using UTF-8 encoding
    • Line 52nd indicates that the symbol for the branch is "\ n"
    • The 71st line of construction shows that the default delimiter for output key-value pairs is the tab "\ T". By code conf.set ("Mapreduce.output.textoutputformat.separator", "******"); The delimiter for the custom output.

Interpretation: standard input/output format

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.