1. MapReduce Code Entry
New // sets the MapReduce input format job.waitforcompletion (true);
2. InputFormat Analysis
Public Abstract class Inputformat<k, v> { // Gets the shard of the input file, is only a logical shard, and does not have a physical shard public Abstract list<inputsplit> getsplits (jobcontext context); // Create Recordreader to read Data from Inputsplit Public Abstract Recordreader<k,v> Createrecordreader (inputsplit split,taskattemptcontext context);}
Different InputFormat will implement different methods of file reading and sharding, each input shard (inputsplit) will be a separate map task as the data source
3, Inputsplit
The input to the mapper is one input shard (inputsplit)
Public Abstract classInputsplit { Public Abstract LongGetLength (); Public Abstractstring[] Getlocations ();} Public classFilesplitextendsInputsplitImplementswritable{PrivatePath file;//file path Private LongStart//Shard Start Location Private LongLength//Shard Length PrivateString[] hosts;//hosts that store shards PublicFilesplit (Path file,LongStartLonglength, string[] hosts) { This. File =file; This. Start =start; This. length =length; This. hosts =hosts; }}
A filesplit corresponds to an input file for mapper, no matter how small the file is, and is handled as a separate inputsplit;
When the input file is composed of a large number of small files in the scene, there will be a lot of inputsplit, which requires a lot of mapper processing;
A large amount of mapper task creation and destruction overhead will be huge, and multiple small files can be combined with combinefilesplit to be processed by mapper task;
4, Fileinputformat
PublicList<inputsplit> getsplits (Jobcontext job)throwsIOException {/*** Getformatminsplitsize () = 1 * job.getconfiguration (). Getlong (Split_minsize, 1L) * split_minsize = "MapR Educe.input.fileinputformat.split.minsize "* Mapred-default.xml parameter is 0*/ LongMinSize = Math.max (Getformatminsplitsize (), getminsplitsize (Job));//Calculate the minimum value of a shard: Max (1,0) = 1 /*** split_maxsize = "mapreduce.input.fileinputformat.split.maxsize" * parameter is empty in Mapred-default.xml*/ LongMaxSize = getmaxsplitsize (Job);//calculate the maximum number of shards: Long.max_value//Storing shard results for input filesList<inputsplit> splits =NewArraylist<inputsplit>(); List<FileStatus> files =liststatus (Job); for(filestatus file:files) {path Path=File.getpath (); LongLength =File.getlen (); if(Length! = 0) { ... if(Issplitable (Job, Path)) {//can shard LongBlockSize =file.getblocksize (); LongSplitsize =computesplitsize (blockSize, MinSize, maxSize); { //max (1, Min (Long.max_value, 64M)) = 64M Splitsize=blocksize By default returnMath.max (MinSize, Math.min (MaxSize, blockSize)); } //Loop Shard, when the remaining data and shard size ratio is greater than split_slop, continue the Shard, less than equals, stop the Shard LongBytesRemaining =length; while(((Double) bytesremaining)/splitsize > Split_slop) {//Split_slop = 1.1 intBlkindex = Getblockindex (blklocations, length-bytesremaining); Splits.add (makesplit (path, length-bytesremaining, Splitsize, Blklocations[blkindex].gethosts ())); BytesRemaining-=splitsize; } //processing the rest of the data if(BytesRemaining! = 0) { intBlkindex = Getblockindex (blklocations, length-bytesremaining); Splits.add (makesplit (path, length-bytesremaining, BytesRemaining, Blklocations[blkindex].gethosts ())); } } Else{//non-sharding, full block return (some compression is not fragmented)Splits.add (makesplit (path, 0, length, blklocations[0].gethosts ())); } } Else{splits.add (makesplit (Path,0, Length,NewString[0])); }} job.getconfiguration (). Setlong (Num_input_files, Files.size ()); //set the number of input filesLog.debug ("Total # of Splits:" +splits.size ()); returnsplits;}
5, Pathfilter
protected throws IOException { ... ListNew arraylist<pathfilter>(); Filters.add (hiddenfilefilter); = getinputpathfilter (job); if NULL ) { filters.add (jobfilter); } New Multipathfilter (filters); ......}
Pathfilter file filter interface, which we can control which files are to be input, which are not as input;
Pathfilter has an accept (path) method that returns true if the received path is to be included, otherwise false;
public interface Pathfilter { boolean Accept (path path); // filter out files with the file name _ or. private static final Pathfilter Hiddenfilefilter = new Pathfilter () { Span style= "color: #0000ff;" >public accept (Path p) {String NA Me = P.getname (); return !name.startswith ("_") &&!name.startswith (".") ); }};
6, Recordreader
Recordreader split the inputsplit into Key-value pairs
Public Abstract classRecordreader<keyin, valuein>Implementscloseable {//Inputsplit Initialization Public Abstract voidInitialize (inputsplit split,taskattemptcontext context); //Read The Shard next <key, value> to Public Abstract BooleanNextkeyvalue ()throwsIOException, interruptedexception; //get the key currently read to Public AbstractKeyin Getcurrentkey ()throwsIOException, interruptedexception; //Gets the value currently read to Public AbstractValuein GetCurrentValue ()throwsIOException, interruptedexception; //track the progress of reading shards Public Abstract floatGetprogress ()throwsIOException, interruptedexception; //Close Recordreader Public Abstract voidClose ()throwsIOException;}
7, Mapper
Public classMapper<keyin, Valuein, Keyout, valueout> { Public Abstract classContextImplementsMapcontext<keyin,valuein,keyout,valueout> { } //preprocessing, run only once when the map task is started protected voidSetup (Context context)throwsIOException, interruptedexception {}//for each pair of <key in Inputsplit, Value> will run once protected voidMap (keyin key, Valuein value, context context)throwsIOException, interruptedexception {context.write (keyout) key, (valueout) value); } //Finishing work, such as closing the flow, etc. protected voidCleanup (context context)throwsIOException, interruptedexception {} Public voidRun (context context)throwsIOException, Interruptedexception {setup (context); Try { while(Context.nextkeyvalue ()) {map (Context.getcurrentkey (), Context.getcurrentvalue (), context); } } finally{Cleanup (context); } }}
Application of Template mode: Run Method:
1) Setup
2) loop from the Inputsplit to the KV to call the map function to process
3) Cleanup
This completes the mapreduce input file is filtered , fragmented , read , read out "K-v", and then handed over to the Mapper class to handle
The process of mapreduce from input files to mapper processing