Design and implementation of OutputFormat interface in Hadoop

Source: Internet
Author: User

OutputFormat is primarily used to describe the format of the output data, which is capable of writing user-supplied key/value to files in a particular format. This article describes how Hadoop designs the OutputFormat interface, as well as some common OutputFormat implementations.

1. OutputFormat parsing of Legacy APIs

, in the Legacy API, OutputFormat is an interface that contains two methods:

Recordwriter<k, v> getrecordwriter (FileSystem ignored, jobconf job,                                     String name, progressable progress) c2/>throws  IOException; void throws IOException;

The Checkoutputspecs method is typically called automatically by jobclient before a user job is committed to Jobtracker to check that the output directory is legitimate.

The Getrecordwriter method returns a Recordwriter class object. The method in the class write receives a Key/value pair and writes it to the file. During Task execution, the MapReduce framework passes the result of the map () or the reduce () function into the Write method, with the main code (simplified) as follows. Suppose the user writes the map () function as follows:

 Public void Map (text key, text value,         outputcollector<text, text> output,         throws  IOException {         //  generate new output <newkey according to current Key/value, newvalue>, and output          ...         Output.collect (NewKey, newvalue);}

The inner code of the function Output.collect (NewKey, NewValue) is as follows:

Recordwriter<k, v> out = job.getoutputformat (). Getrecordwriter (...); O Ut.write (NewKey, newvalue);

Hadoop comes with a lot of OutputFormat implementations, which correspond to the InputFormat implementations, specifically. All file-based OutputFormat implementations of the base class are fileoutputformat and derive some implementations based on the text file format, binary file format, or multiple outputs.

In order to further analyze the implementation method of OutputFormat, we select the representative Fileoutputformat class for analysis. Like the idea of introducing InputFormat implementation, we first introduce the base class Fileoutputformat, and then introduce its derived class Textoutputformat. The base class Fileoutputformat needs to provide common functionality for all file-based OutputFormat implementations, summing up the following two main:
(1) Implement Checkoutputspecs interface
This interface is called before the job is run, and the default function is to check that the user-configured output directory exists and, if present, throw an exception to prevent the previous data from being overwritten.
(2) process Side-effect file for Side-effect file
Task It is not the final output file of a task, but a special-purpose task-specific file. Its typical application is to perform speculative tasks. In Hadoop, because of hardware aging, network failure, and so on, some tasks of the same job may perform significantly slower than other tasks, which slows down the execution of the entire job. In order to optimize this "slow task", Hadoop initiates an identical task on another node, which is called a speculative task, and the result of the first task is the result of the corresponding processing of this piece of data. To prevent write collisions when these two tasks write data to an output file at the same time, Fileoutputformat creates a Side-effect file for each task's data and writes the resulting data to the file temporarily, until the task finishes, then moves to the final output The record. The related operations of these files, such as Create, delete, move, etc., are completed by Outputcommitter. It is an interface, HADOOP provides the default implementation Fileoutputcommitter, the user can also write the Outputcommitter implementation according to their own needs, and by the parameter {Mapred.output.committer.class} specified. The implementation of the Outputcommitter interface definition and fileoutputcommitter corresponds as shown in the table.

Table--Outputcommitter interface definition and fileoutputcommitter corresponding implementation

to determine if a result is required Td>setupjob The
method when called fileoutputcommitter implementation
Job initialization Create temp directory ${mapred.out.dir}/_temporary
commitjob Job successfully runs complete Delete the temp directory and create an empty file under the ${mapred.out.dir} directory _success
abortjob job failed to run   Delete temp directory
setuptask Task initializes without any action. It was originally necessary to create the Side-effect file
in the temp directory, but it was created when it was used (create on demand)
needstaskcommit  returns True if Side-effect file exists
committask task successfully run complete commit results, move Side-effect file to ${mapred.out.dir} directory
aborttask task run failed Delete task Side-effect file note by default, when the job completes successfully, the final result directory ${mapred.out.dir} Build

Note: By default, when the job completes successfully, an empty file _success is generated under the final results directory ${mapred.out.dir}. This file is primarily for high-level applications to provide identification of the completion of the job run, for example, Oozie needs to determine whether the job is completed by detecting the presence of the file in the results directory.

2. OutputFormat parsing of the new API

, in addition to the interface becoming an abstract class, OutputFormat in the new API adds a new method: Getoutputcommitter, which allows the user to customize the appropriate Outputcommitter implementation.

Resources

Deep understanding of the design and implementation of MapReduce architecture in Hadoop technology Insider

Design and implementation of OutputFormat interface in Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.