There are several factors that determine the number of maps in Hadoop, and depending on the version, the determinants are different, and mastering these factors can help you understand the partitioning of Hadoop shards.
And it also has great benefits for optimizing Hadoop performance.
Getsplits methods in the old API:
1 PublicInputsplit[] Getsplits (jobconf job,intnumsplits)2 throwsIOException {3filestatus[] Files =liststatus (Job);4 5 //Save the number of input files in the job-conf6 Job.setlong (Num_input_files, files.length);7 Longtotalsize = 0;//Compute Total size8 for(Filestatus file:files) {//Check we have valid files9 if(File.isdir ()) {Ten Throw NewIOException ("Not a file:" +File.getpath ()); One } ATotalSize + =File.getlen (); - } - the LongGoalsize = totalsize/(Numsplits = = 0? 1: numsplits); - LongMinSize = Math.max (Job.getlong ("Mapred.min.split.size", 1), - minsplitsize); - + //Generate splits -Arraylist<filesplit> splits =NewArraylist<filesplit>(numsplits); +Networktopology Clustermap =Newnetworktopology (); A for(Filestatus file:files) { atPath PATH =File.getpath (); -FileSystem fs =Path.getfilesystem (Job); - LongLength =File.getlen (); -blocklocation[] blklocations = fs.getfileblocklocations (file, 0, length); - if((Length! = 0) &&issplitable (FS, path)) { - LongBlockSize =file.getblocksize (); in LongSplitsize =computesplitsize (goalsize, MinSize, blockSize); - to LongBytesRemaining =length; + while(((Double) bytesremaining)/splitsize >split_slop) { -string[] Splithosts =getsplithosts (blklocations, thelength-bytesremaining, Splitsize, clustermap); *Splits.add (NewFilesplit (Path, length-bytesremaining, Splitsize, $ splithosts));Panax NotoginsengBytesRemaining-=splitsize; - } the + if(BytesRemaining! = 0) { ASplits.add (NewFilesplit (Path, length-bytesremaining, BytesRemaining, theBlklocations[blklocations.length-1].gethosts ())); + } -}Else if(Length! = 0) { $string[] splithosts = getsplithosts (blklocations,0, length,clustermap); $Splits.add (NewFilesplit (path, 0, length, splithosts)); -}Else { - //Create empty hosts array for zero length files theSplits.add (NewFilesplit (path, 0, length,NewString[0])); - }Wuyi } theLog.debug ("Total # of Splits:" +splits.size ()); - returnSplits.toarray (Newfilesplit[splits.size ()]); Wu } - About protected LongComputesplitsize (LongGoalsize,LongMinSize, $ LongblockSize) { - returnMath.max (MinSize, Math.min (Goalsize, blockSize)); -}
View Code
Getsplits method in the new API:
1 PublicList<inputsplit>getsplits (Jobcontext job2)throwsIOException {3 LongMinSize =Math.max (Getformatminsplitsize (), getminsplitsize (Job));4 LongMaxSize =getmaxsplitsize (Job);5 6 //Generate splits7List<inputsplit> splits =NewArraylist<inputsplit>();8List<filestatus>files =liststatus (Job);9 for(Filestatus file:files) {TenPath PATH =File.getpath (); OneFileSystem fs =Path.getfilesystem (Job.getconfiguration ()); A LongLength =File.getlen (); -blocklocation[] blklocations = fs.getfileblocklocations (file, 0, length); - if((Length! = 0) &&issplitable (Job, Path)) { the LongBlockSize =file.getblocksize (); - LongSplitsize =computesplitsize (blockSize, MinSize, maxSize); - - LongBytesRemaining =length; + while(((Double) bytesremaining)/splitsize >split_slop) { - intBlkindex = Getblockindex (blklocations, length-bytesremaining); +Splits.add (NewFilesplit (Path, length-bytesremaining, Splitsize, A blklocations[blkindex].gethosts ())); atBytesRemaining-=splitsize; - } - - if(BytesRemaining! = 0) { -Splits.add (NewFilesplit (Path, length-bytesremaining, BytesRemaining, -Blklocations[blklocations.length-1].gethosts ())); in } -}Else if(Length! = 0) { toSplits.add (NewFilesplit (path, 0, length, blklocations[0].gethosts ())); +}Else { - //Create empty hosts array for zero length files theSplits.add (NewFilesplit (path, 0, length,NewString[0])); * } $ }Panax Notoginseng - //Save the number of input files in the job-conf the job.getconfiguration (). Setlong (Num_input_files, Files.size ()); + ALog.debug ("Total # of Splits:" +splits.size ()); the returnsplits; + } - $ protected LongComputesplitsize (LongBlockSize,LongMinSize, $ LongmaxSize) { - returnMath.max (MinSize, Math.min (MaxSize, blockSize)); -}
View Code
Test an input file size of:0.52 KB log as follows:
NEW:
blocksize:67108864 minsize:1 maxsize:9223372036854775807
splitsize:67108864
The determining factor is the size of the blocksize. It's easy to understand.
Old
blocksize:67108864 totalsize:529 numsplits:2 goalsize:264 minsplitsize:1 minsize:1
splitsize:264
Numsplits is 2, this is incoming in the call getsplits, this place to note, after finding that this parameter is Job.getnummaptasks () the value is as follows
jobconf:public int Getnummaptasks () {return getInt ("Mapred.map.tasks", 1);}
In Mapred-default.xml:
<property>
<name>mapred.map.tasks</name>
<value>2</value>
<description>the default number of map tasks per job.
Ignored when Mapred.job.tracker is "local".
</description>
</property>
So the MP program, written using the old API, produces 2 maps, while using the new API produces 1 maps.
Calculation of the number of Hadoop map tasks