1. Map class
The map class inherits the mapper in the library class, namely Mapper<keyin, Valuein, Keyout, valueout>. The map method is usually overridden in the map class, where map accepts only one key-value at a time, then pre-processes it, and then sends out the processed data. The Map method is:
protected void map (Object key, value value, context context) throws IOException, interruptedexception{ context.write ((keyout) key, (valueout) value);
2, Reducer class
The Reducer class inherits the reducer in the class library, the prototype is Reducer<keyin, Valuein, Keyout, Valueout>,reduce class except for the reduce method, the others are the same as the map, and the functions are the same. The Reduce method is:
protected void reduce (Text key, interable<interwrite> values, context context) throws IOException, interruptedexception { for(interable<intwritable> value:values) { Context.write (Text key, intwritable value; } }
3. MapReduce Drive
In the simplest case, the code in the main function, typically includes:
Configuration conf =NewConfiguration (); //get Input Output file pathstring[] Otherargs =NewGenericoptionsparser (Conf,args). Getremainingargs (); if(Otherargs.length! = 2) {System.err.println ("Usage WordCount <int> <out>"); System.exit (2); } Job Job=NewJob (conf, "Dedup"); Job.setjarbyclass (Dedup.class);//Main classJob.setmapperclass (Map.class);//Map ClassJob.setcombinerclass (Reduce.class);//Job Composition ClassJob.setreducerclass (Reduce.class);//Reduce classJob.setoutputkeyclass (Text.class);//set the key class for the job output dataJob.setoutputvalueclass (Text.class);//set the value class for the job output dataFileinputformat.addinputpath (Job,NewPath (Otherargs[0]));//file InputFileoutputformat.setoutputpath (Job,NewPath (otherargs[1]));//file OutputSystem.exit (Job.waitforcompletion (true) ? 0:1); }
In fact, it also includes a maprecude minimum driver called the Minimapreducedriver class,
New Job (conf, "Dedup"); Job.setjarbyclass (Dedup. class ); New Path (otherargs[0])); New Path (otherargs[1])); System.exit (Job.waitforcompletion (true)? 0:1);
4. InputFormat interface
The hierarchical structure of the InputFormat class is as follows. Textinputformat is the default implementation of InputFormat, which is effective when there is no explicit key-value in the input data, and the returned key represents the offset of this row of data, and value is the contents of the line.
5, Inputsplit class
By default, Fileinputformat and its subclasses split the file as a radix of 64MB (the same as the proposed split size). By processing files in chunks, you can have multiple map tasks work in parallel with one file. For large files, performance is greatly improved. The input to the map is one of the input shards, which is inputsplits.
Inputsplit subclasses have Filesplit and Combinefilesplit. Both include the file path, the Shard start location, the Shard size, and the host list where the Shard data is stored. But Combinefilesplit is for small files, it will be a lot of small files in a inputsplit, so that can handle a lot of small files.
For some files are not fragmented, you can do it in two ways, the first is to set the minimum shard size of the file to be larger than the file size, the second method is to use the Fileinputformat subclass, and overload the Issplitable method, set the return value to False.
6, Recordreader class
Inputsplit defines how to slice the work, and the Recordreader class defines how to load the data and convert it to a key-value pair that is appropriate for the map method to read. Its default input format is Textinputformat.
The capabilities of the MapReduce program to invoke each class