How to determine Mapper quantity for Hadoop-2.4.1 Learning
The advantage of MapReduce framework is that it can run mapper and reducer tasks in parallel in the cluster. How can we determine the number of mapper and reducer tasks, or how does Hadoop programmatically control the number of ER er and reducer started by jobs? In the Mapper and Reducer of Hadoop-2.4.1 learning, it was mentioned that the number of recommended reducers is (0.95 ~ 1.75) * number of nodes * the maximum number of containers on each node. You can use the Job method. setNumReduceTasks (int). The number of er ers is determined by the size of the input file, and there is no corresponding setNumMapTasks method, but you can use Configuration. set (JobContext. NUM_MAPS, int), where JobContext. the NUM_MAPS value is mapreduce. job. maps, and the description of this parameter on the official website of Hadoop is to skillfully interact with the MapReduce framework and job configuration, and the settings are more complex. From such vague words, I cannot know how to determine the number of ER er. Obviously, I have to turn to the source code.
In Hadoop, MapReduce jobs submit jobs to the system through the submitJobInternal (Jobjob, Cluster cluster) method of the JobSubmitter class (this method not only sets the number of mappers, you have also executed some other operations, such as checking the output format. If you are interested, refer to the source code). The Code related to setting mapper in this method is as follows:
Int maps = writeSplits (job, submitJobDir );
Conf. setInt (MRJobConfig. NUM_MAPS, maps );
LOG.info ("number of splits:" + maps );
The writeSplits method returns the mapper quantity. The source code of this method is as follows:
Private int writeSplits (org. apache. hadoop. mapreduce. JobContext job, Path jobSubmitDir)
Throws IOException, InterruptedException, ClassNotFoundException {
JobConf jConf = (JobConf) job. getConfiguration ();
Int maps;
If (jConf. getUseNewMapper ()){
Maps = writeNewSplits (job, jobSubmitDir );
} Else {
Maps = writeOldSplits (jConf, jobSubmitDir );
}
Return maps;
}
In this method, the number of mappers is calculated using different methods based on whether the new JobContext version is used. The actual situation is jConf. getUseNewMapper () returns true, so the writeNewSplits (job, jobSubmitDir) Statement is executed. The source code of this method is as follows:
Configuration conf = job. getConfiguration ();
InputFormat <?, ?> Input = ReflectionUtils. newInstance (job. getInputFormatClass (), conf );
List <InputSplit> splits = input. getSplits (job );
T [] array = (T []) splits. toArray (new InputSplit [splits. size ()]);
// Sort the splits into order based on size, so that the biggest
// Go first
Arrays. sort (array, new SplitComparator ());
JobSplitWriter. createSplitFiles (jobSubmitDir, conf, jobSubmitDir. getFileSystem (conf), array );
Return array. length;
The code above shows that the actual mapper quantity is the number of input parts, and the number of parts is determined by the input format used. The default value is TextInputFormat, which is a subclass of FileInputFormat. The task of determining the number of parts is handed over to getSplits (job) of FileInputFormat. Here we add that FileInputFormat inherits from the abstract class InputFormat, which defines the input specification of MapReduce jobs, the abstract method List <InputSplit> getSplits (JobContext context) defines how to split input into InputSplit. Different inputs have different separation logics, each InputSplit obtained by separation is processed by different mappers. Therefore, the return value of this method determines the number of mappers. The following two sections describe how this method is implemented in FileInputFormat. In order to focus attention on the most important part, we will not describe the log output and other information, for complete implementation, refer to the source code.
The first part is the first part. The Code calculates the values of maximum InputSplit and minimum InputSplit, as follows:
Long minSize = Math. max (getFormatMinSplitSize (), getMinSplitSize (job ));
Long maxSize = getMaxSplitSize (job );
The getMinSplitSize and getMaxSplitSize methods are used to obtain the minimum InputSplit and maximum InputSplit values, and the corresponding configuration parameters are mapreduce. input. fileinputformat. split. minsize. The default value is 1L and mapreduce. input. fileinputformat. split. maxsize. The default value is Long. MAX_VALUE: The hexadecimal value is 0x7fffffffffffffl, And the decimal value is 9223372036854775807. The getFormatMinSplitSize method returns the lower limit of InputSplit in the input format. The unit of the preceding numbers is byte. The minSize is 1L, And the maxSize is Long. MAX_VALUE.
The second step is to generate InputSplit. In this Part, a List containing InputSplit is generated, and the List size is the number of InputSplit, And the mapper quantity is determined. The important code is:
If (isSplitable (job, path )){
Long blockSize = file. getBlockSize ();
Long splitSize = computeSplitSize (blockSize, minSize, maxSize );
Long bytesRemaining = length;
While (double) bytesRemaining)/splitSize> SPLIT_SLOP ){
Int blkIndex = getBlockIndex (blkLocations, length-bytesRemaining );
Splits. add (makeSplit (path, length-bytesRemaining, splitSize,
BlkLocations [blkIndex]. getHosts ()));
BytesRemaining-= splitSize;
}
If (bytesRemaining! = 0 ){
Int blkIndex = getBlockIndex (blkLocations, length-bytesRemaining );
Splits. add (makeSplit (path, length-bytesRemaining, bytesRemaining,
BlkLocations [blkIndex]. getHosts ()));
}
}
The value of blockSize is the value of dfs. blocksize. The default value is 128 MB. The computeSplitSize (blockSize, minSize, maxSize) method determines the size of InputSplit Based on blockSize, minSize, and maxSize. The source code is as follows:
Math. max (minSize, Math. min (maxSize, blockSize ))
From the code and the analysis in the first part, we can know that the size of InputSplit depends on dfs. blocksiz, mapreduce. input. fileinputformat. split. minsize, mapreduce. input. fileinputformat. split. maxsize and the input format used. When the input format is TextInputFormat and the maximum and minimum values of InputSplit are not modified, the final value of InputSplit is the value of dfs. blocksize.
The value of the SPLIT_SLOP variable is 1.1, which determines that when the size of the remaining files is large, the file will be split according to the variable splitSize. According to the Code, when the remaining files are less than or equal to 1.1 times the splitSize, the remaining files will be treated as an InputSplit, that is, the last InputSplit is up to 1.1 times the splitSize.
Summary
This article analyzes how to determine the number of ER er when the input format is the default TextInputFormat. Without modifying the source code (modifying the InputSplit lower limit of the input format), programmers can set dfs. blocksiz, mapreduce. input. fileinputformat. split. minsize, mapreduce. input. fileinputformat. split. the value of the maxsize parameter sets the InputSplit size to affect the number of InputSplit, and then determines the number of ER er. When the input is in another format, the processing logic is different. For example, when the input format is DBInputFormat, the number of mappers is determined based on the number of rows (number of records) in the input table, for more details, refer to the source code.
Install and configure Hadoop2.2.0 on CentOS
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition
Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)