Design and implementation of OutputFormat interface in Hadoop

Last Update:2015-03-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

OutputFormat is primarily used to describe the format of the output data, which is capable of writing user-supplied key/value to files in a particular format. This article describes how Hadoop designs the OutputFormat interface, as well as some common OutputFormat implementations.

1. OutputFormat parsing of Legacy APIs

, in the Legacy API, OutputFormat is an interface that contains two methods:

Recordwriter<k, v> getrecordwriter (FileSystem ignored, jobconf job,                                     String name, progressable progress) c2/>throws  IOException; void throws IOException;

The Checkoutputspecs method is typically called automatically by jobclient before a user job is committed to Jobtracker to check that the output directory is legitimate.

The Getrecordwriter method returns a Recordwriter class object. The method in the class write receives a Key/value pair and writes it to the file. During Task execution, the MapReduce framework passes the result of the map () or the reduce () function into the Write method, with the main code (simplified) as follows. Suppose the user writes the map () function as follows:

 Public void Map (text key, text value,         outputcollector<text, text> output,         throws  IOException {         //  generate new output <newkey according to current Key/value, newvalue>, and output          ...         Output.collect (NewKey, newvalue);}

The inner code of the function Output.collect (NewKey, NewValue) is as follows:

Recordwriter<k, v> out = job.getoutputformat (). Getrecordwriter (...); O Ut.write (NewKey, newvalue);

Hadoop comes with a lot of OutputFormat implementations, which correspond to the InputFormat implementations, specifically. All file-based OutputFormat implementations of the base class are fileoutputformat and derive some implementations based on the text file format, binary file format, or multiple outputs.

In order to further analyze the implementation method of OutputFormat, we select the representative Fileoutputformat class for analysis. Like the idea of introducing InputFormat implementation, we first introduce the base class Fileoutputformat, and then introduce its derived class Textoutputformat. The base class Fileoutputformat needs to provide common functionality for all file-based OutputFormat implementations, summing up the following two main:
(1) Implement Checkoutputspecs interface
This interface is called before the job is run, and the default function is to check that the user-configured output directory exists and, if present, throw an exception to prevent the previous data from being overwritten.
(2) process Side-effect file for Side-effect file
Task It is not the final output file of a task, but a special-purpose task-specific file. Its typical application is to perform speculative tasks. In Hadoop, because of hardware aging, network failure, and so on, some tasks of the same job may perform significantly slower than other tasks, which slows down the execution of the entire job. In order to optimize this "slow task", Hadoop initiates an identical task on another node, which is called a speculative task, and the result of the first task is the result of the corresponding processing of this piece of data. To prevent write collisions when these two tasks write data to an output file at the same time, Fileoutputformat creates a Side-effect file for each task's data and writes the resulting data to the file temporarily, until the task finishes, then moves to the final output The record. The related operations of these files, such as Create, delete, move, etc., are completed by Outputcommitter. It is an interface, HADOOP provides the default implementation Fileoutputcommitter, the user can also write the Outputcommitter implementation according to their own needs, and by the parameter {Mapred.output.committer.class} specified. The implementation of the Outputcommitter interface definition and fileoutputcommitter corresponds as shown in the table.

Table--Outputcommitter interface definition and fileoutputcommitter corresponding implementation

to determine if a result is required Td>setupjob The

method	when called	fileoutputcommitter implementation
Job initialization	Create temp directory ${mapred.out.dir}/_temporary
commitjob	Job successfully runs complete	Delete the temp directory and create an empty file under the ${mapred.out.dir} directory _success
abortjob	job failed to run	Delete temp directory
setuptask	Task initializes	without any action. It was originally necessary to create the Side-effect file in the temp directory, but it was created when it was used (create on demand)
needstaskcommit	returns True if Side-effect file exists
committask	task successfully run complete	commit results, move Side-effect file to ${mapred.out.dir} directory
aborttask	task run failed	Delete task Side-effect file note by default, when the job completes successfully, the final result directory ${mapred.out.dir} Build

Note: By default, when the job completes successfully, an empty file _success is generated under the final results directory ${mapred.out.dir}. This file is primarily for high-level applications to provide identification of the completion of the job run, for example, Oozie needs to determine whether the job is completed by detecting the presence of the file in the results directory.

2. OutputFormat parsing of the new API

, in addition to the interface becoming an abstract class, OutputFormat in the new API adds a new method: Getoutputcommitter, which allows the user to customize the appropriate Outputcommitter implementation.

Resources

Deep understanding of the design and implementation of MapReduce architecture in Hadoop technology Insider

Design and implementation of OutputFormat interface in Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Design and implementation of OutputFormat interface in Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Design and implementation of OutputFormat interface in Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support