MAP/reduce output formatting

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

View Original

When maptask or reducetask is run, the output results may need to be formatted to meet our needs.

Hadoop provides outputformat for conversion. Org. Apache. hadoop. mapreduce. Lib. Output. outputformat <K,

V>

// In the job, you can use the setoutputformatclass method to set formatting. sortedoutputformat. Class is the formatting class we want to compile.

Job. setoutputformatclass (sortedoutputformat. Class );

The formatting class we define must inherit the outputformat class. For details, see the class diagram.

Outputformat is an abstract class. The main abstract method is as follows:

I. checkoutputspecs: verify whether the output path exists

2. getoutputcommitter: obtains an outputcommitter object and is mainly responsible:

1. Some configuration information and temporary output folders are generated during job initialization.

2. Handle some work when the job is completed

3. Configure the temporary task File

4. Submit the scheduling task

5. Submit the task output file.

6. Cancel task file submission

III:Getrecordwriter:The recordwriter returned by this method will tell us how to output data to the output file.

We are writing the output formatting extension class to implement this method.

Next, let's take a look.Org. Apache. hadoop. mapred. recordwriter<K, V> this interface

Close(Reporter): Disable the operation.
Write(K key, V value): How to Write key/value

In fact, the focus is to build a class to implement this interface.

In org. Apache. hadoop. mapreduce. Recordwriter<K, V> there is such a class with the same name

New can be used directly.Recordwriter()

When writing an extension class, we only need to extend this class and rewrite the relevant methods.

/** Excerpt from linerecordwriter in textoutputformat. */
Public classLinerecordwriter<K, V> extends recordwriter <K, V> {

Private Static final string utf8 = "UTF-8 ";

Private Static final byte [] newline;
Static {
Try {
Newline = "\ n". getbytes (utf8 );
} Catch (unsupportedencodingexception UEE ){
Throw new illegalargumentexception ("can't find" + utf8 + "encoding ");
}
}

Protected dataoutputstream out;

Private Final byte [] keyvalueseparator;

Public linerecordwriter (dataoutputstream out, string keyvalueseparator ){
This. Out = out;
Try {
This. keyvalueseparator = keyvalueseparator. getbytes (utf8 );
} Catch (unsupportedencodingexception UEE ){
Throw new illegalargumentexception ("can't find" + utf8 + "encoding ");
}
}

Public linerecordwriter (dataoutputstream out ){
This (Out, "\ t ");
}

Private void writeobject (Object O) throws ioexception {
If (O instanceof text ){
Text to = (text) O;
Out. Write (to. getbytes (), 0, To. getlength ());
} Else {
Out. Write (O. tostring (). getbytes (utf8 ));
}
}

@ Override
Public synchronized void write (K key, V value) throws ioexception {
Boolean nullkey = Key = NULL | key instanceof nullwritable;
Boolean nullvalue = value = NULL | value instanceof nullwritable;
If (nullkey & nullvalue ){
Return;
}
If (! Nullkey ){
Writeobject (key );
}
If (! (Nullkey | nullvalue )){
Out. Write (keyvalueseparator );
}
If (! Nullvalue ){
Writeobject (value );
}
Out. Write (newline );
}

Public synchronized void write (integer num) throws ioexception {
If (num = NULL ){
Writeobject (null );
} Else {
Writeobject (Num );
}
}

@ Override
Public synchronized void close (taskattemptcontext context) throws ioexception {
Out. Close ();
}
}

Several formatting methods provided by hadoop

• Keyvaluetextinputformat: the key is the value before the first tab key, and the value is the remaining value. If there is no value left, the value is null (this is often used)
• Textinputformant: the key is the row number and the value is the row content.

• Nlineinputformat: similar to keyvaluetextinputformat, but the splits are based on N
Lines of input rather than y bytes of input.
• Multifileinputformat: an abstract class that lets the user implement an input format
That aggregates multiple files into one split. Multiple files are written.
• Sequencefileinputformat: the input file is a hadoop sequence file, containing serialized
Key/value pairs.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

MAP/reduce output formatting

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

MAP/reduce output formatting

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support