The advantage of the MapReduce framework is the ability to run mapper and reducer tasks in parallel in the cluster, how to determine the number of mapper and reducer, or how to programmatically control the number of mapper and reducer that the job starts? In the mapper and reducer of the Hadoop-2.4.1 study, it was mentioned that the number of recommended reducer is (0.95~1.75) * Number of nodes * Maximum number of containers per node, and method Job.setnumreducetasks can be used (int ), the number of mapper is determined by the size of the input file, and there is no corresponding Setnummaptasks method, but can be set by Configuration.set (jobcontext.num_maps, int), The value of Jobcontext.num_maps is Mapreduce.job.maps, and the description of the parameter on the official Hadoop website is subtly interactive with the MapReduce framework and job configuration, and is more complex to set up. From such a vague words can not know exactly how to determine the number of mapper, obviously can only resort to source code.
In Hadoop, the MapReduce job submits a job to the system through the submitjobinternal (Jobjob, Cluster Cluster) method of the Jobsubmitter class (this method not only sets the mapper quantity, Also performed some other operations, such as checking the output format, interested can refer to the source code, in this method with the setting of mapper related code is as follows:
int maps = Writesplits (job, Submitjobdir); Conf.setint (mrjobconfig.num_maps, maps); Log.info ("Number of splits:" + maps);
Method Writesplits Returns the number of Mapper, the source code for the method is as follows:
private int writesplits (Org.apache.hadoop.mapreduce.JobContext job,path jobsubmitdir) throws IOException, Interruptedexception, classnotfoundexception { jobconf jconf = (jobconf) job.getconfiguration (); int maps; if (Jconf.getusenewmapper ()) { maps = writenewsplits (job, Jobsubmitdir); } else { maps = Writeoldsplits ( Jconf, Jobsubmitdir); } return maps; }
In this method, depending on whether a new version of Jobcontext is used and a different method is used to calculate the number of mapper, the reality is that Jconf.getusenewmapper () will return true, so writenewsplits (job, JOBSUBMITDIR) statement, the source code for the method is as follows:
Configuration conf = job.getconfiguration ();inputformat<?,? > input = Reflectionutils.newinstance (Job.getinputformatclass (), conf); List<inputsplit> splits = input.getsplits (Job); t[] Array = (t[]) Splits.toarray (new Inputsplit[splits.size ());//Sort the splits into order based on size, so, the B iggest//go firstarrays.sort (array, new Splitcomparator ()); Jobsplitwriter.createsplitfiles (Jobsubmitdir, Conf, Jobsubmitdir.getfilesystem (CONF), array); return array.length;
The above code can be learned that the actual number of Mapper is the number of input shards, and the number of shards is determined by the input format used, the default is Textinputformat, the class is a subclass of Fileinputformat. The task of determining the number of shards is done by Fileinputformat's getsplits (Job), which adds Fileinputformat inherits from the abstract class InputFormat, which defines the input specification for the MapReduce job. The abstract method list<inputsplit> Getsplits (Jobcontext context) defines how the input is split into inputsplit, and different inputs have different separation logic. Each inputsplit that is separated is processed by a different mapper, so the return value of the method determines the number of mapper. The following will be divided into two parts to learn how the method is implemented in Fileinputformat, in order to focus on the most important part of the log output and other information will not be introduced, the complete implementation can refer to the source code.
First, the first part of the code calculates the maximum inputsplit and minimum inputsplit values, as follows:
Long minSize = Math.max (Getformatminsplitsize (), getminsplitsize (Job)), Long maxSize = getmaxsplitsize (Job);
The Getminsplitsize and Getmaxsplitsize methods are used to obtain the values of the minimum inputsplit and the maximum Inputsplit, respectively. The corresponding configuration parameters are mapreduce.input.fileinputformat.split.minsize, the default value is 1L and mapreduce.input.fileinputformat.split.maxsize, the default value is Long.max _value, the hexadecimal value is 0X7FFFFFFFFFFFFFFFL, and the corresponding decimal for the 9223372036854775807,getformatminsplitsize method returns the lower bound of inputsplit under the input format. The units of the above numbers are byte. This concludes that the size of the minsize is 1l,maxsize and the size is long.max_value.
Next is the second part of generating inputsplit. In this section, a list containing inputsplit is generated, and the list size is the number of inputsplit, which in turn determines the number of mapper. Some of the important codes are:
if ( Issplitable (Job, path)) {Long blockSize = File.getblocksize (); Long splitsize = Computesplitsize (BlockSize, MinSize, maxSize); Long bytesremaining = length; while ((double) bytesremaining)/splitsize > Split_slop) {int blkindex = Getblockindex (blklocations, length -bytesremaining); Splits.add (Makesplit (Path, length-bytesremaining, Splitsize, blklocations[blkindex].ge Thosts ())); BytesRemaining-= splitsize; } if (bytesremaining! = 0) {int blkindex = Getblockindex (blklocations, length-bytesremaining); Splits.add (Makesplit (Path, length-bytesremaining, BytesRemaining, Blklocations[blkindex].getho STS ())); }}
The value of BlockSize is the value of the parameter dfs.blocksize, which defaults to 128M. Method Computesplitsize (BlockSize, MinSize, maxSize) determines the blocksize,minsize,maxsize size according to Inputsplit, the source code is as follows:
Math.max (MinSize, Math.min (MaxSize, blockSize))
From this code and combined with the first part of the analysis, the size of the inputsplit depends on Dfs.blocksiz, mapreduce.input.fileinputformat.split.minsize, Mapreduce.input.fileinputformat.split.maxsize and the input format used. In cases where the input format is Textinputformat, and the maximum and minimum values of inputsplit are not modified, the final value of Inputsplit is the value of dfs.blocksize.
The value of the variable split_slop is 1.1, which determines when the remaining file size is too large to stop dividing the file by the variable splitsize. According to the code, when the remaining files are less than or equal to 1.1 times times splitsize, the remaining files will be made as a inputsplit, that is, the last inputsplit of the largest size is 1.1 times times splitsize.
Summary This article analyzes how to determine the number of mapper in the case where the input format is the default Textinputformat. Without modifying the source code (modifying the Inputsplit lower limit of the input format), programmers can set Dfs.blocksiz, Mapreduce.input.fileinputformat.split.minsize, The value of the Mapreduce.input.fileinputformat.split.maxsize parameter sets the size of the inputsplit to affect the number of inputsplit, which in turn determines the number of mapper. When the input is other format, the processing logic is not the same, for example, when the input format is Dbinputformat, depending on the number of rows in the input table (record number) determines the amount of mapper, more details can refer to the source code.
Hadoop-2.4.1 Learn how to determine the number of mapper