Hadoop-2.4.1 study How to determine the number of mapper

Source: Internet
Author: User

The advantage of the MapReduce framework is the ability to run mapper and reducer tasks in parallel in the cluster, how to determine the number of mapper and reducer, or how to programmatically control the number of mapper and reducer that the job initiates? In the mapper and reducer of the Hadoop-2.4.1 study, the number of reducer is (0.95~1.75) * Number of nodes * Maximum number of containers per node, and method Job.setnumreducetasks (int) can be used. The number of mapper is determined by the size of the input file. And there is no corresponding Setnummaptasks method, but can be set by Configuration.set (jobcontext.num_maps, int), in which jobcontext.num_ The value of maps is Mapreduce.job.maps, and the description of the reference on the official site of Hadoop is described as subtly interacting with the MapReduce framework and job configuration. And it's more complicated to set up.

From such a vague remark, it is impossible to know exactly how to determine the number of mapper. It is obvious that we can only resort to source code.

In Hadoop, the MapReduce job submits a job to the system through the submitjobinternal (Jobjob, Cluster Cluster) method of the Jobsubmitter class (this method not only sets the mapper quantity. Also run some other operations such as check output format, interested in can refer to the source code), in this method with the setting of mapper-related codes such as the following:

int maps = Writesplits (job, Submitjobdir); Conf.setint (mrjobconfig.num_maps, maps); Log.info ("Number of splits:" + maps);

Method Writesplits Returns the number of Mapper, the source code for the method is as follows:

private int writesplits (Org.apache.hadoop.mapreduce.JobContext job,path jobsubmitdir) throws IOException, Interruptedexception, classnotfoundexception {    jobconf jconf = (jobconf) job.getconfiguration ();    int maps;    if (Jconf.getusenewmapper ()) {      maps = writenewsplits (job, Jobsubmitdir);    } else {      maps = Writeoldsplits ( Jconf, Jobsubmitdir);    }    return maps;  }

In this method, the number of mapper is calculated using different methods depending on whether the new version number is used Jobcontext. The reality is that Jconf.getusenewmapper () will return true, so the writenewsplits (JOB,JOBSUBMITDIR) statement is run, and the source code for the method is as follows:

configuration conf = job.getconfiguration ();inputformat<?,? 

> input = reflectionutils.newinstance (Job.getinputformatclass (), Conf); List<inputsplit> splits = input.getsplits (Job); t[] Array = (t[]) Splits.toarray (new Inputsplit[splits.size ());//Sort the splits into order based on size, so, the B iggest//go firstarrays.sort (array, new Splitcomparator ()); Jobsplitwriter.createsplitfiles (Jobsubmitdir, Conf, Jobsubmitdir.getfilesystem (CONF), array); return array.length;

Through the above code can be learned that the actual number of Mapper is the number of input shards, and the number of shards is determined by the input format used, the default feel Textinputformat, the class is Fileinputformat subclass. The task of determining the number of shards is handed over to Fileinputformat's getsplits (Job), which adds Fileinputformat inherits from the abstract class InputFormat, which defines the input specification for the MapReduce job. The abstract method list<inputsplit> Getsplits (Jobcontext context) defines how the input is cut to inputsplit. Different inputs have different separation logic, and each separated inputsplit is handled by a different mapper, so the return value of the method determines the number of mapper. The following will be divided into two parts to learn how the method is implemented in Fileinputformat, in order to focus on the most important part. The log output and other information will not be introduced, the complete implementation can refer to the source code.

First, the first part of the code calculates the maximum inputsplit and minimum inputsplit values, such as the following:

Long minSize = Math.max (Getformatminsplitsize (), getminsplitsize (Job)), Long maxSize = getmaxsplitsize (Job);

The Getminsplitsize and Getmaxsplitsize methods are used to obtain the values of the minimum inputsplit and maximum inputsplit respectively. The corresponding configuration parameters are mapreduce.input.fileinputformat.split.minsize respectively. The default value is 1L and mapreduce.input.fileinputformat.split.maxsize, the default value is Long.max_value, and the hexadecimal value is 0X7FFFFFFFFFFFFFFFL. The corresponding decimal for the 9223372036854775807,getformatminsplitsize method returns the lower bound of the inputsplit in the input format.

The units of the above numbers are byte. This results in a minsize size of 1L. The size of the maxsize is long.max_value.

Next is the second part of generating inputsplit. In this section, a list that includes Inputsplit is generated, and the list size is the number of inputsplit, which in turn determines the number of mapper. The important code is:

if (issplitable (Job, path)) {Long blockSize = File.getblocksize ();          Long splitsize = Computesplitsize (BlockSize, MinSize, maxSize);          Long bytesremaining = length; while ((double) bytesremaining)/splitsize > Split_slop) {int blkindex = Getblockindex (blklocations, length            -bytesremaining); Splits.add (Makesplit (Path, length-bytesremaining, Splitsize, blklocations[blkindex].ge            Thosts ()));          BytesRemaining-= splitsize;            } if (bytesremaining! = 0) {int blkindex = Getblockindex (blklocations, length-bytesremaining); Splits.add (Makesplit (Path, length-bytesremaining, BytesRemaining, Blklocations[blkindex].getho          STS ())); }}

The value of the blocksize is the value of the dfs.blocksize, which defaults to 128M. Methods Computesplitsize (BlockSize, MinSize, maxSize) are based on blocksize,minsize. MaxSize determine the size of inputsplit, source code such as the following:

Math.max (MinSize, Math.min (MaxSize, blockSize))

From this code and the analysis of the first part, we can tell that the size of the inputsplit depends on Dfs.blocksiz, mapreduce.input.fileinputformat.split.minsize, Mapreduce.input.fileinputformat.split.maxsize and the input format used.

In cases where the input format is Textinputformat, and the maximum and minimum values of inputsplit are not changed, the value of Inputsplit is dfs.blocksize.

The value of the variable split_slop is 1.1. Determines when the remaining file size is too large to stop cutting files according to the variable splitsize.

According to the code, when the remaining files are less than or equal to 1.1 times times splitsize, the remaining files will be made as a inputsplit. That is, the last inputsplit has a maximum size of 1.1 times times splitsize.

In this paper, we analyze how to determine the number of mapper in the case where the input format is the default Textinputformat. Without altering the source code (change the Inputsplit lower limit of the input format). Program apes can be set by Dfs.blocksiz, Mapreduce.input.fileinputformat.split.minsize, The value of the Mapreduce.input.fileinputformat.split.maxsize parameter sets the size of the inputsplit to affect the number of inputsplit. The number of mapper is then determined.

When the input is in a different format, the processing logic is not the same, for example when the input format is Dbinputformat. The number of mapper is determined based on the number of rows in the input table (number of records). Many other details can be involved in the source code.


Hadoop-2.4.1 study How to determine the number of mapper

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.