OutputFormat is primarily used to describe the format of the output data, which is capable of writing user-supplied key/value to files in a particular format. This article describes how Hadoop designs the OutputFormat interface, as well as some common OutputFormat implementations.
1. OutputFormat parsing of Legacy APIs
, in the Legacy API, OutputFormat is an interface that contains two methods:
Recordwriter<k, v> getrecordwriter (FileSystem ignored, jobconf job, String name, progressable progress) c2/>throws IOException; void throws IOException;
The Checkoutputspecs method is typically called automatically by jobclient before a user job is committed to Jobtracker to check that the output directory is legitimate.
The Getrecordwriter method returns a Recordwriter class object. The method in the class write receives a Key/value pair and writes it to the file. During Task execution, the MapReduce framework passes the result of the map () or the reduce () function into the Write method, with the main code (simplified) as follows. Suppose the user writes the map () function as follows:
Public void Map (text key, text value, outputcollector<text, text> output, throws IOException { // generate new output <newkey according to current Key/value, newvalue>, and output ... Output.collect (NewKey, newvalue);}
The inner code of the function Output.collect (NewKey, NewValue) is as follows:
Recordwriter<k, v> out = job.getoutputformat (). Getrecordwriter (...); O Ut.write (NewKey, newvalue);
Hadoop comes with a lot of OutputFormat implementations, which correspond to the InputFormat implementations, specifically. All file-based OutputFormat implementations of the base class are fileoutputformat and derive some implementations based on the text file format, binary file format, or multiple outputs.
In order to further analyze the implementation method of OutputFormat, we select the representative Fileoutputformat class for analysis. Like the idea of introducing InputFormat implementation, we first introduce the base class Fileoutputformat, and then introduce its derived class Textoutputformat. The base class Fileoutputformat needs to provide common functionality for all file-based OutputFormat implementations, summing up the following two main:
(1) Implement Checkoutputspecs interface
This interface is called before the job is run, and the default function is to check that the user-configured output directory exists and, if present, throw an exception to prevent the previous data from being overwritten.
(2) process Side-effect file for Side-effect file
Task It is not the final output file of a task, but a special-purpose task-specific file. Its typical application is to perform speculative tasks. In Hadoop, because of hardware aging, network failure, and so on, some tasks of the same job may perform significantly slower than other tasks, which slows down the execution of the entire job. In order to optimize this "slow task", Hadoop initiates an identical task on another node, which is called a speculative task, and the result of the first task is the result of the corresponding processing of this piece of data. To prevent write collisions when these two tasks write data to an output file at the same time, Fileoutputformat creates a Side-effect file for each task's data and writes the resulting data to the file temporarily, until the task finishes, then moves to the final output The record. The related operations of these files, such as Create, delete, move, etc., are completed by Outputcommitter. It is an interface, HADOOP provides the default implementation Fileoutputcommitter, the user can also write the Outputcommitter implementation according to their own needs, and by the parameter {Mapred.output.committer.class} specified. The implementation of the Outputcommitter interface definition and fileoutputcommitter corresponds as shown in the table.
Table--Outputcommitter interface definition and fileoutputcommitter corresponding implementation
to determine if a result is required
method |
when called |
fileoutputcommitter implementation |
Td>setupjob
Job initialization |
Create temp directory ${mapred.out.dir}/_temporary |
commitjob |
Job successfully runs complete |
Delete the temp directory and create an empty file under the ${mapred.out.dir} directory _success |
abortjob |
job failed to run |
Delete temp directory |
The
setuptask |
Task initializes |
without any action. It was originally necessary to create the Side-effect file in the temp directory, but it was created when it was used (create on demand) |
needstaskcommit |
returns True if Side-effect file exists |
committask |
task successfully run complete |
commit results, move Side-effect file to ${mapred.out.dir} directory |
aborttask |
task run failed |
Delete task Side-effect file note by default, when the job completes successfully, the final result directory ${mapred.out.dir} Build |
Note: By default, when the job completes successfully, an empty file _success is generated under the final results directory ${mapred.out.dir}. This file is primarily for high-level applications to provide identification of the completion of the job run, for example, Oozie needs to determine whether the job is completed by detecting the presence of the file in the results directory.
2. OutputFormat parsing of the new API
, in addition to the interface becoming an abstract class, OutputFormat in the new API adds a new method: Getoutputcommitter, which allows the user to customize the appropriate Outputcommitter implementation.
Resources
Deep understanding of the design and implementation of MapReduce architecture in Hadoop technology Insider
Design and implementation of OutputFormat interface in Hadoop