MAP/reduce output formatting

Source: Internet
Author: User

View Original

When maptask or reducetask is run, the output results may need to be formatted to meet our needs.

Hadoop provides outputformat for conversion. Org. Apache. hadoop. mapreduce. Lib. Output. outputformat <K,

V>

// In the job, you can use the setoutputformatclass method to set formatting. sortedoutputformat. Class is the formatting class we want to compile.

Job. setoutputformatclass (sortedoutputformat. Class );

The formatting class we define must inherit the outputformat class. For details, see the class diagram.

Outputformat is an abstract class. The main abstract method is as follows:

I. checkoutputspecs: verify whether the output path exists

2. getoutputcommitter: obtains an outputcommitter object and is mainly responsible:

1. Some configuration information and temporary output folders are generated during job initialization.

2. Handle some work when the job is completed

3. Configure the temporary task File

4. Submit the scheduling task

5. Submit the task output file.

6. Cancel task file submission

III:Getrecordwriter:The recordwriter returned by this method will tell us how to output data to the output file.

We are writing the output formatting extension class to implement this method.

Next, let's take a look.Org. Apache. hadoop. mapred. recordwriter<K, V> this interface

Close(Reporter): Disable the operation.
Write(K key, V value): How to Write key/value

In fact, the focus is to build a class to implement this interface.

In org. Apache. hadoop. mapreduce. Recordwriter<K, V> there is such a class with the same name

New can be used directly.Recordwriter()

When writing an extension class, we only need to extend this class and rewrite the relevant methods.

/** Excerpt from linerecordwriter in textoutputformat. */
Public classLinerecordwriter<K, V> extends recordwriter <K, V> {

Private Static final string utf8 = "UTF-8 ";

Private Static final byte [] newline;
Static {
Try {
Newline = "\ n". getbytes (utf8 );
} Catch (unsupportedencodingexception UEE ){
Throw new illegalargumentexception ("can't find" + utf8 + "encoding ");
}
}

Protected dataoutputstream out;

Private Final byte [] keyvalueseparator;

Public linerecordwriter (dataoutputstream out, string keyvalueseparator ){
This. Out = out;
Try {
This. keyvalueseparator = keyvalueseparator. getbytes (utf8 );
} Catch (unsupportedencodingexception UEE ){
Throw new illegalargumentexception ("can't find" + utf8 + "encoding ");
}
}

Public linerecordwriter (dataoutputstream out ){
This (Out, "\ t ");
}

Private void writeobject (Object O) throws ioexception {
If (O instanceof text ){
Text to = (text) O;
Out. Write (to. getbytes (), 0, To. getlength ());
} Else {
Out. Write (O. tostring (). getbytes (utf8 ));
}
}

@ Override
Public synchronized void write (K key, V value) throws ioexception {
Boolean nullkey = Key = NULL | key instanceof nullwritable;
Boolean nullvalue = value = NULL | value instanceof nullwritable;
If (nullkey & nullvalue ){
Return;
}
If (! Nullkey ){
Writeobject (key );
}
If (! (Nullkey | nullvalue )){
Out. Write (keyvalueseparator );
}
If (! Nullvalue ){
Writeobject (value );
}
Out. Write (newline );
}

Public synchronized void write (integer num) throws ioexception {
If (num = NULL ){
Writeobject (null );
} Else {
Writeobject (Num );
}
}

@ Override
Public synchronized void close (taskattemptcontext context) throws ioexception {
Out. Close ();
}
}

Several formatting methods provided by hadoop

• Keyvaluetextinputformat: the key is the value before the first tab key, and the value is the remaining value. If there is no value left, the value is null (this is often used)
• Textinputformant: the key is the row number and the value is the row content.

• Nlineinputformat: similar to keyvaluetextinputformat, but the splits are based on N
Lines of input rather than y bytes of input.
• Multifileinputformat: an abstract class that lets the user implement an input format
That aggregates multiple files into one split. Multiple files are written.
• Sequencefileinputformat: the input file is a hadoop sequence file, containing serialized
Key/value pairs.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.