The process of mapreduce from input files to mapper processing

Source: Internet
Author: User

1. MapReduce Code Entry

New // sets the MapReduce input format job.waitforcompletion (true);

2. InputFormat Analysis

 Public Abstract class Inputformat<k, v> {    // Gets the shard of the input file, is only a logical shard, and does not have a physical shard public      Abstract  list<inputsplit> getsplits (jobcontext context);         // Create Recordreader to read Data    from Inputsplit  Public Abstract  Recordreader<k,v> Createrecordreader (inputsplit split,taskattemptcontext context);}

Different InputFormat will implement different methods of file reading and sharding, each input shard (inputsplit) will be a separate map task as the data source

3, Inputsplit

The input to the mapper is one input shard (inputsplit)

 Public Abstract classInputsplit { Public Abstract LongGetLength ();  Public Abstractstring[] Getlocations ();} Public classFilesplitextendsInputsplitImplementswritable{PrivatePath file;//file path    Private LongStart//Shard Start Location    Private LongLength//Shard Length    PrivateString[] hosts;//hosts that store shards         PublicFilesplit (Path file,LongStartLonglength, string[] hosts) {         This. File =file;  This. Start =start;  This. length =length;  This. hosts =hosts; }}

A filesplit corresponds to an input file for mapper, no matter how small the file is, and is handled as a separate inputsplit;
When the input file is composed of a large number of small files in the scene, there will be a lot of inputsplit, which requires a lot of mapper processing;
A large amount of mapper task creation and destruction overhead will be huge, and multiple small files can be combined with combinefilesplit to be processed by mapper task;

4, Fileinputformat

 PublicList<inputsplit> getsplits (Jobcontext job)throwsIOException {/*** Getformatminsplitsize () = 1 * job.getconfiguration (). Getlong (Split_minsize, 1L) * split_minsize = "MapR Educe.input.fileinputformat.split.minsize "* Mapred-default.xml parameter is 0*/    LongMinSize = Math.max (Getformatminsplitsize (), getminsplitsize (Job));//Calculate the minimum value of a shard: Max (1,0) = 1        /*** split_maxsize = "mapreduce.input.fileinputformat.split.maxsize" * parameter is empty in Mapred-default.xml*/    LongMaxSize = getmaxsplitsize (Job);//calculate the maximum number of shards: Long.max_value//Storing shard results for input filesList<inputsplit> splits =NewArraylist<inputsplit>(); List<FileStatus> files =liststatus (Job);  for(filestatus file:files) {path Path=File.getpath (); LongLength =File.getlen (); if(Length! = 0) {            ...            if(Issplitable (Job, Path)) {//can shard                LongBlockSize =file.getblocksize (); LongSplitsize =computesplitsize (blockSize, MinSize, maxSize); {                    //max (1, Min (Long.max_value, 64M)) = 64M Splitsize=blocksize By default                    returnMath.max (MinSize, Math.min (MaxSize, blockSize)); }                //Loop Shard, when the remaining data and shard size ratio is greater than split_slop, continue the Shard, less than equals, stop the Shard                LongBytesRemaining =length;  while(((Double) bytesremaining)/splitsize > Split_slop) {//Split_slop = 1.1                    intBlkindex = Getblockindex (blklocations, length-bytesremaining); Splits.add (makesplit (path, length-bytesremaining, Splitsize, Blklocations[blkindex].gethosts ())); BytesRemaining-=splitsize; }                //processing the rest of the data                if(BytesRemaining! = 0) {                    intBlkindex = Getblockindex (blklocations, length-bytesremaining); Splits.add (makesplit (path, length-bytesremaining, BytesRemaining, Blklocations[blkindex].gethosts ())); }            } Else{//non-sharding, full block return (some compression is not fragmented)Splits.add (makesplit (path, 0, length, blklocations[0].gethosts ())); }        } Else{splits.add (makesplit (Path,0, Length,NewString[0])); }} job.getconfiguration (). Setlong (Num_input_files, Files.size ()); //set the number of input filesLog.debug ("Total # of Splits:" +splits.size ()); returnsplits;}

5, Pathfilter

protected throws IOException {    ...    ListNew arraylist<pathfilter>();    Filters.add (hiddenfilefilter);     = getinputpathfilter (job);     if NULL ) {      filters.add (jobfilter);    }     New Multipathfilter (filters);    ......}

Pathfilter file filter interface, which we can control which files are to be input, which are not as input;
Pathfilter has an accept (path) method that returns true if the received path is to be included, otherwise false;

 public  interface   Pathfilter { boolean   Accept (path path);  //  filter out files with the file name _ or.  private  static  final  Pathfilter Hiddenfilefilter = new   Pathfilter () { Span style= "color: #0000ff;" >public   accept (Path p) {String NA         Me  = P.getname ();  return !name.startswith ("_") &&!name.startswith (".")     ); }}; 

6, Recordreader

Recordreader split the inputsplit into Key-value pairs

 Public Abstract classRecordreader<keyin, valuein>Implementscloseable {//Inputsplit Initialization     Public Abstract voidInitialize (inputsplit split,taskattemptcontext context); //Read The Shard next <key, value> to     Public Abstract BooleanNextkeyvalue ()throwsIOException, interruptedexception; //get the key currently read to     Public AbstractKeyin Getcurrentkey ()throwsIOException, interruptedexception; //Gets the value currently read to      Public AbstractValuein GetCurrentValue ()throwsIOException, interruptedexception; //track the progress of reading shards     Public Abstract floatGetprogress ()throwsIOException, interruptedexception; //Close Recordreader     Public Abstract voidClose ()throwsIOException;}

7, Mapper

 Public classMapper<keyin, Valuein, Keyout, valueout> {     Public Abstract classContextImplementsMapcontext<keyin,valuein,keyout,valueout> {    }      //preprocessing, run only once when the map task is started    protected voidSetup (Context context)throwsIOException, interruptedexception {}//for each pair of <key in Inputsplit, Value> will run once    protected voidMap (keyin key, Valuein value, context context)throwsIOException, interruptedexception {context.write (keyout) key, (valueout) value); }    //Finishing work, such as closing the flow, etc.    protected voidCleanup (context context)throwsIOException, interruptedexception {} Public voidRun (context context)throwsIOException, Interruptedexception {setup (context); Try {             while(Context.nextkeyvalue ()) {map (Context.getcurrentkey (), Context.getcurrentvalue (), context); }        } finally{Cleanup (context); }    }}

Application of Template mode: Run Method:
1) Setup
2) loop from the Inputsplit to the KV to call the map function to process
3) Cleanup

This completes the mapreduce input file is filtered , fragmented , read , read out "K-v", and then handed over to the Mapper class to handle

The process of mapreduce from input files to mapper processing

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.