Hadoop-2.4.1 Learn how to determine the number of mapper

Last Update:2014-11-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The advantage of the MapReduce framework is the ability to run mapper and reducer tasks in parallel in the cluster, how to determine the number of mapper and reducer, or how to programmatically control the number of mapper and reducer that the job starts? In the mapper and reducer of the Hadoop-2.4.1 study, it was mentioned that the number of recommended reducer is (0.95~1.75) * Number of nodes * Maximum number of containers per node, and method Job.setnumreducetasks can be used (int ), the number of mapper is determined by the size of the input file, and there is no corresponding Setnummaptasks method, but can be set by Configuration.set (jobcontext.num_maps, int), The value of Jobcontext.num_maps is Mapreduce.job.maps, and the description of the parameter on the official Hadoop website is subtly interactive with the MapReduce framework and job configuration, and is more complex to set up. From such a vague words can not know exactly how to determine the number of mapper, obviously can only resort to source code.

In Hadoop, the MapReduce job submits a job to the system through the submitjobinternal (Jobjob, Cluster Cluster) method of the Jobsubmitter class (this method not only sets the mapper quantity, Also performed some other operations, such as checking the output format, interested can refer to the source code, in this method with the setting of mapper related code is as follows:

int maps = Writesplits (job, Submitjobdir); Conf.setint (mrjobconfig.num_maps, maps); Log.info ("Number of splits:" + maps);

Method Writesplits Returns the number of Mapper, the source code for the method is as follows:

private int writesplits (Org.apache.hadoop.mapreduce.JobContext job,path jobsubmitdir) throws IOException, Interruptedexception, classnotfoundexception {    jobconf jconf = (jobconf) job.getconfiguration ();    int maps;    if (Jconf.getusenewmapper ()) {      maps = writenewsplits (job, Jobsubmitdir);    } else {      maps = Writeoldsplits ( Jconf, Jobsubmitdir);    }    return maps;  }

In this method, depending on whether a new version of Jobcontext is used and a different method is used to calculate the number of mapper, the reality is that Jconf.getusenewmapper () will return true, so writenewsplits (job, JOBSUBMITDIR) statement, the source code for the method is as follows:

Configuration conf = job.getconfiguration ();inputformat<?,? > input = Reflectionutils.newinstance (Job.getinputformatclass (), conf); List<inputsplit> splits = input.getsplits (Job); t[] Array = (t[]) Splits.toarray (new Inputsplit[splits.size ());//Sort the splits into order based on size, so, the B iggest//go firstarrays.sort (array, new Splitcomparator ()); Jobsplitwriter.createsplitfiles (Jobsubmitdir, Conf, Jobsubmitdir.getfilesystem (CONF), array); return array.length;

The above code can be learned that the actual number of Mapper is the number of input shards, and the number of shards is determined by the input format used, the default is Textinputformat, the class is a subclass of Fileinputformat. The task of determining the number of shards is done by Fileinputformat's getsplits (Job), which adds Fileinputformat inherits from the abstract class InputFormat, which defines the input specification for the MapReduce job. The abstract method list<inputsplit> Getsplits (Jobcontext context) defines how the input is split into inputsplit, and different inputs have different separation logic. Each inputsplit that is separated is processed by a different mapper, so the return value of the method determines the number of mapper. The following will be divided into two parts to learn how the method is implemented in Fileinputformat, in order to focus on the most important part of the log output and other information will not be introduced, the complete implementation can refer to the source code.

First, the first part of the code calculates the maximum inputsplit and minimum inputsplit values, as follows:

Long minSize = Math.max (Getformatminsplitsize (), getminsplitsize (Job)), Long maxSize = getmaxsplitsize (Job);

The Getminsplitsize and Getmaxsplitsize methods are used to obtain the values of the minimum inputsplit and the maximum Inputsplit, respectively. The corresponding configuration parameters are mapreduce.input.fileinputformat.split.minsize, the default value is 1L and mapreduce.input.fileinputformat.split.maxsize, the default value is Long.max _value, the hexadecimal value is 0X7FFFFFFFFFFFFFFFL, and the corresponding decimal for the 9223372036854775807,getformatminsplitsize method returns the lower bound of inputsplit under the input format. The units of the above numbers are byte. This concludes that the size of the minsize is 1l,maxsize and the size is long.max_value.

Next is the second part of generating inputsplit. In this section, a list containing inputsplit is generated, and the list size is the number of inputsplit, which in turn determines the number of mapper. Some of the important codes are:

if (          Issplitable (Job, path)) {Long blockSize = File.getblocksize ();          Long splitsize = Computesplitsize (BlockSize, MinSize, maxSize);          Long bytesremaining = length; while ((double) bytesremaining)/splitsize > Split_slop) {int blkindex = Getblockindex (blklocations, length            -bytesremaining); Splits.add (Makesplit (Path, length-bytesremaining, Splitsize, blklocations[blkindex].ge            Thosts ()));          BytesRemaining-= splitsize;            } if (bytesremaining! = 0) {int blkindex = Getblockindex (blklocations, length-bytesremaining); Splits.add (Makesplit (Path, length-bytesremaining, BytesRemaining, Blklocations[blkindex].getho          STS ())); }}

The value of BlockSize is the value of the parameter dfs.blocksize, which defaults to 128M. Method Computesplitsize (BlockSize, MinSize, maxSize) determines the blocksize,minsize,maxsize size according to Inputsplit, the source code is as follows:

Math.max (MinSize, Math.min (MaxSize, blockSize))

From this code and combined with the first part of the analysis, the size of the inputsplit depends on Dfs.blocksiz, mapreduce.input.fileinputformat.split.minsize, Mapreduce.input.fileinputformat.split.maxsize and the input format used. In cases where the input format is Textinputformat, and the maximum and minimum values of inputsplit are not modified, the final value of Inputsplit is the value of dfs.blocksize.

The value of the variable split_slop is 1.1, which determines when the remaining file size is too large to stop dividing the file by the variable splitsize. According to the code, when the remaining files are less than or equal to 1.1 times times splitsize, the remaining files will be made as a inputsplit, that is, the last inputsplit of the largest size is 1.1 times times splitsize.

Summary This article analyzes how to determine the number of mapper in the case where the input format is the default Textinputformat. Without modifying the source code (modifying the Inputsplit lower limit of the input format), programmers can set Dfs.blocksiz, Mapreduce.input.fileinputformat.split.minsize, The value of the Mapreduce.input.fileinputformat.split.maxsize parameter sets the size of the inputsplit to affect the number of inputsplit, which in turn determines the number of mapper. When the input is other format, the processing logic is not the same, for example, when the input format is Dbinputformat, depending on the number of rows in the input table (record number) determines the amount of mapper, more details can refer to the source code.

Hadoop-2.4.1 Learn how to determine the number of mapper

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop-2.4.1 Learn how to determine the number of mapper

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop-2.4.1 Learn how to determine the number of mapper

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support