The capabilities of the MapReduce program to invoke each class

Source: Internet
Author: User
Tags map class

Transferred from:http://www.cnblogs.com/z1987/p/5052409.html

1. Map class

The map class inherits the mapper in the library class, namely Mapper<keyin, Valuein, Keyout, valueout>. The map method is usually overridden in the map class, where map accepts only one key-value at a time, then pre-processes it, and then sends out the processed data. The Map method is:

protected void Map (Object key, value value, context context)        throws IOException, interruptedexception{    Context.write ((Keyout) key, (valueout) value);}

2, Reducer class

The Reducer class inherits the reducer in the class library, the prototype is Reducer<keyin, Valuein, Keyout, Valueout>,reduce class except for the reduce method, the others are the same as the map, and the functions are the same. The Reduce method is:

protected void reduce (Text key, interable<interwrite> values, context context)        throws IOException, interruptedexception {    for (interable<intwritable> value:values) {        context.write (Text key, intwritable value;    }           }

3. MapReduce Drive

In the simplest case, the code in the main function, typically includes:

        Configuration conf = new configuration ();        Get input Output file path        string[] Otherargs = new Genericoptionsparser (Conf,args). Getremainingargs ();        if (otherargs.length! = 2) {            System.err.println ("Usage WordCount <int> <out>");            System.exit (2);        }        Job Job = new Job (conf, "Dedup");        Job.setjarbyclass (dedup.class);          Main class                                              Job.setmapperclass (Map.class);           Map class        Job.setcombinerclass (Reduce.class);  Job Synthesis class        Job.setreducerclass (Reduce.class);    Reduce class        Job.setoutputkeyclass (Text.class);      Set the job output Data key class        Job.setoutputvalueclass (Text.class);   Set the value class of the job output data        fileinputformat.addinputpath (Job, New Path (Otherargs[0]));     File input        fileoutputformat.setoutputpath (Job, New Path (Otherargs[1]));//File Output        system.exit ( Job.waitforcompletion (True)? 0:1);    }

In fact, it also includes a maprecude minimum driver called the Minimapreducedriver class,

        Job Job = new Job (conf, "Dedup");        Job.setjarbyclass (dedup.class);        Fileinputformat.addinputpath (Job, New Path (Otherargs[0]));        Fileoutputformat.setoutputpath (Job, New Path (Otherargs[1]));        System.exit (Job.waitforcompletion (true)? 0:1);

4. InputFormat interface

The hierarchical structure of the InputFormat class is as follows. Textinputformat is the default implementation of InputFormat, which is effective when there is no explicit key-value in the input data, and the returned key represents the offset of this row of data, and value is the contents of the line.

5, Inputsplit class

By default, Fileinputformat and its subclasses split the file as a radix of 64MB (the same as the proposed split size). By processing files in chunks, you can have multiple map tasks work in parallel with one file. For large files, performance is greatly improved. The input to the map is one of the input shards, which is inputsplits.

Inputsplit subclasses have Filesplit and Combinefilesplit. Both include the file path, the Shard start location, the Shard size, and the host list where the Shard data is stored. But Combinefilesplit is for small files, it will be a lot of small files in a inputsplit, so that can handle a lot of small files.

For some files are not fragmented, you can do it in two ways, the first is to set the minimum shard size of the file to be larger than the file size, the second method is to use the Fileinputformat subclass, and overload the Issplitable method, set the return value to False.

6, Recordreader class

Inputsplit defines how to slice the work, and the Recordreader class defines how to load the data and convert it to a key-value pair that is appropriate for the map method to read. Its default input format is Textinputformat.

7, OutputFormat class

Similar to InputFormat, most of them inherit from Fileoutformat, except Nulloutputformat and Dboutputformat. Its default format is Textoutputformat. OutputFormat provides an implementation of the Recordwriter, which specifies how the data is serialized. The Recordwriter class can handle jobs that contain a single key-value pair and write the results to a outputformat-ready seat. Recordwriter is implemented primarily through write and close two functions. The Write function takes a key-value pair out of a mapreduce job and writes its bytes to disk. The close function closes the data stream for Hadoop to the output file.

The OutputFormat hierarchy chart is as follows:

8, Recordwriter class

Linerecordwriter is the default use of Recordwriter, which includes: The byte of key, a tab for bounding, the byte of value, and a newline character.

The capabilities of the MapReduce program to invoke each class

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.