View Original
When maptask or reducetask is run, the output results may need to be formatted to meet our needs.
Hadoop provides outputformat for conversion. Org. Apache. hadoop. mapreduce. Lib. Output. outputformat <K,
V>
// In the job, you can use the setoutputformatclass method to set formatting. sortedoutputformat. Class is the formatting class we want to compile.
Job. setoutputformatclass (sortedoutputformat. Class );
The formatting class we define must inherit the outputformat class. For details, see the class diagram.
Outputformat is an abstract class. The main abstract method is as follows:
I. checkoutputspecs: verify whether the output path exists
2. getoutputcommitter: obtains an outputcommitter object and is mainly responsible:
1. Some configuration information and temporary output folders are generated during job initialization.
2. Handle some work when the job is completed
3. Configure the temporary task File
4. Submit the scheduling task
5. Submit the task output file.
6. Cancel task file submission
III:Getrecordwriter:The recordwriter returned by this method will tell us how to output data to the output file.
We are writing the output formatting extension class to implement this method.
Next, let's take a look.Org. Apache. hadoop. mapred. recordwriter<K, V> this interface
Close(Reporter): Disable the operation.
Write(K key, V value): How to Write key/value
In fact, the focus is to build a class to implement this interface.
In org. Apache. hadoop. mapreduce. Recordwriter<K, V> there is such a class with the same name
New can be used directly.Recordwriter()
When writing an extension class, we only need to extend this class and rewrite the relevant methods.
/** Excerpt from linerecordwriter in textoutputformat. */
Public classLinerecordwriter<K, V> extends recordwriter <K, V> {
Private Static final string utf8 = "UTF-8 ";
Private Static final byte [] newline;
Static {
Try {
Newline = "\ n". getbytes (utf8 );
} Catch (unsupportedencodingexception UEE ){
Throw new illegalargumentexception ("can't find" + utf8 + "encoding ");
}
}
Protected dataoutputstream out;
Private Final byte [] keyvalueseparator;
Public linerecordwriter (dataoutputstream out, string keyvalueseparator ){
This. Out = out;
Try {
This. keyvalueseparator = keyvalueseparator. getbytes (utf8 );
} Catch (unsupportedencodingexception UEE ){
Throw new illegalargumentexception ("can't find" + utf8 + "encoding ");
}
}
Public linerecordwriter (dataoutputstream out ){
This (Out, "\ t ");
}
Private void writeobject (Object O) throws ioexception {
If (O instanceof text ){
Text to = (text) O;
Out. Write (to. getbytes (), 0, To. getlength ());
} Else {
Out. Write (O. tostring (). getbytes (utf8 ));
}
}
@ Override
Public synchronized void write (K key, V value) throws ioexception {
Boolean nullkey = Key = NULL | key instanceof nullwritable;
Boolean nullvalue = value = NULL | value instanceof nullwritable;
If (nullkey & nullvalue ){
Return;
}
If (! Nullkey ){
Writeobject (key );
}
If (! (Nullkey | nullvalue )){
Out. Write (keyvalueseparator );
}
If (! Nullvalue ){
Writeobject (value );
}
Out. Write (newline );
}
Public synchronized void write (integer num) throws ioexception {
If (num = NULL ){
Writeobject (null );
} Else {
Writeobject (Num );
}
}
@ Override
Public synchronized void close (taskattemptcontext context) throws ioexception {
Out. Close ();
}
}
Several formatting methods provided by hadoop
• Keyvaluetextinputformat: the key is the value before the first tab key, and the value is the remaining value. If there is no value left, the value is null (this is often used)
• Textinputformant: the key is the row number and the value is the row content.
• Nlineinputformat: similar to keyvaluetextinputformat, but the splits are based on N
Lines of input rather than y bytes of input.
• Multifileinputformat: an abstract class that lets the user implement an input format
That aggregates multiple files into one split. Multiple files are written.
• Sequencefileinputformat: the input file is a hadoop sequence file, containing serialized
Key/value pairs.