The process of mapreduce from input files to mapper processing

Last Update:2014-11-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. MapReduce Code Entry

New // sets the MapReduce input format job.waitforcompletion (true);

2. InputFormat Analysis

 Public Abstract class Inputformat<k, v> {    // Gets the shard of the input file, is only a logical shard, and does not have a physical shard public      Abstract  list<inputsplit> getsplits (jobcontext context);         // Create Recordreader to read Data    from Inputsplit  Public Abstract  Recordreader<k,v> Createrecordreader (inputsplit split,taskattemptcontext context);}

Different InputFormat will implement different methods of file reading and sharding, each input shard (inputsplit) will be a separate map task as the data source

3, Inputsplit

The input to the mapper is one input shard (inputsplit)

 Public Abstract classInputsplit { Public Abstract LongGetLength ();  Public Abstractstring[] Getlocations ();} Public classFilesplitextendsInputsplitImplementswritable{PrivatePath file;//file path    Private LongStart//Shard Start Location    Private LongLength//Shard Length    PrivateString[] hosts;//hosts that store shards         PublicFilesplit (Path file,LongStartLonglength, string[] hosts) {         This. File =file;  This. Start =start;  This. length =length;  This. hosts =hosts; }}

A filesplit corresponds to an input file for mapper, no matter how small the file is, and is handled as a separate inputsplit;
When the input file is composed of a large number of small files in the scene, there will be a lot of inputsplit, which requires a lot of mapper processing;
A large amount of mapper task creation and destruction overhead will be huge, and multiple small files can be combined with combinefilesplit to be processed by mapper task;

4, Fileinputformat

 PublicList<inputsplit> getsplits (Jobcontext job)throwsIOException {/*** Getformatminsplitsize () = 1 * job.getconfiguration (). Getlong (Split_minsize, 1L) * split_minsize = "MapR Educe.input.fileinputformat.split.minsize "* Mapred-default.xml parameter is 0*/    LongMinSize = Math.max (Getformatminsplitsize (), getminsplitsize (Job));//Calculate the minimum value of a shard: Max (1,0) = 1        /*** split_maxsize = "mapreduce.input.fileinputformat.split.maxsize" * parameter is empty in Mapred-default.xml*/    LongMaxSize = getmaxsplitsize (Job);//calculate the maximum number of shards: Long.max_value//Storing shard results for input filesList<inputsplit> splits =NewArraylist<inputsplit>(); List<FileStatus> files =liststatus (Job);  for(filestatus file:files) {path Path=File.getpath (); LongLength =File.getlen (); if(Length! = 0) {            ...            if(Issplitable (Job, Path)) {//can shard                LongBlockSize =file.getblocksize (); LongSplitsize =computesplitsize (blockSize, MinSize, maxSize); {                    //max (1, Min (Long.max_value, 64M)) = 64M Splitsize=blocksize By default                    returnMath.max (MinSize, Math.min (MaxSize, blockSize)); }                //Loop Shard, when the remaining data and shard size ratio is greater than split_slop, continue the Shard, less than equals, stop the Shard                LongBytesRemaining =length;  while(((Double) bytesremaining)/splitsize > Split_slop) {//Split_slop = 1.1                    intBlkindex = Getblockindex (blklocations, length-bytesremaining); Splits.add (makesplit (path, length-bytesremaining, Splitsize, Blklocations[blkindex].gethosts ())); BytesRemaining-=splitsize; }                //processing the rest of the data                if(BytesRemaining! = 0) {                    intBlkindex = Getblockindex (blklocations, length-bytesremaining); Splits.add (makesplit (path, length-bytesremaining, BytesRemaining, Blklocations[blkindex].gethosts ())); }            } Else{//non-sharding, full block return (some compression is not fragmented)Splits.add (makesplit (path, 0, length, blklocations[0].gethosts ())); }        } Else{splits.add (makesplit (Path,0, Length,NewString[0])); }} job.getconfiguration (). Setlong (Num_input_files, Files.size ()); //set the number of input filesLog.debug ("Total # of Splits:" +splits.size ()); returnsplits;}

5, Pathfilter

protected throws IOException {    ...    ListNew arraylist<pathfilter>();    Filters.add (hiddenfilefilter);     = getinputpathfilter (job);     if NULL ) {      filters.add (jobfilter);    }     New Multipathfilter (filters);    ......}

Pathfilter file filter interface, which we can control which files are to be input, which are not as input;
Pathfilter has an accept (path) method that returns true if the received path is to be included, otherwise false;

 public  interface   Pathfilter { boolean   Accept (path path);  //  filter out files with the file name _ or.  private  static  final  Pathfilter Hiddenfilefilter = new   Pathfilter () { Span style= "color: #0000ff;" >public   accept (Path p) {String NA         Me  = P.getname ();  return !name.startswith ("_") &&!name.startswith (".")     ); }};

6, Recordreader

Recordreader split the inputsplit into Key-value pairs

 Public Abstract classRecordreader<keyin, valuein>Implementscloseable {//Inputsplit Initialization     Public Abstract voidInitialize (inputsplit split,taskattemptcontext context); //Read The Shard next <key, value> to     Public Abstract BooleanNextkeyvalue ()throwsIOException, interruptedexception; //get the key currently read to     Public AbstractKeyin Getcurrentkey ()throwsIOException, interruptedexception; //Gets the value currently read to      Public AbstractValuein GetCurrentValue ()throwsIOException, interruptedexception; //track the progress of reading shards     Public Abstract floatGetprogress ()throwsIOException, interruptedexception; //Close Recordreader     Public Abstract voidClose ()throwsIOException;}

7, Mapper

 Public classMapper<keyin, Valuein, Keyout, valueout> {     Public Abstract classContextImplementsMapcontext<keyin,valuein,keyout,valueout> {    }      //preprocessing, run only once when the map task is started    protected voidSetup (Context context)throwsIOException, interruptedexception {}//for each pair of <key in Inputsplit, Value> will run once    protected voidMap (keyin key, Valuein value, context context)throwsIOException, interruptedexception {context.write (keyout) key, (valueout) value); }    //Finishing work, such as closing the flow, etc.    protected voidCleanup (context context)throwsIOException, interruptedexception {} Public voidRun (context context)throwsIOException, Interruptedexception {setup (context); Try {             while(Context.nextkeyvalue ()) {map (Context.getcurrentkey (), Context.getcurrentvalue (), context); }        } finally{Cleanup (context); }    }}

Application of Template mode: Run Method:
1) Setup
2) loop from the Inputsplit to the KV to call the map function to process
3) Cleanup

This completes the mapreduce input file is filtered , fragmented , read , read out "K-v", and then handed over to the Mapper class to handle

The process of mapreduce from input files to mapper processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The process of mapreduce from input files to mapper processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The process of mapreduce from input files to mapper processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support