Reprinted from: How to control the number of maps in Hadoop
Hadoop provides a parameter mapred.map.tasks that sets the number of maps, which we can use to control the number of maps. However, setting the number of maps in this way is not always valid. The reason is that mapred.map.tasks is just a reference value for Hadoop, and the number of final maps depends on other factors as well.
To facilitate the introduction, first look at a few nouns:
BLOCK_SIZE:HDFS File block size, default is 64M, can be set by parameter dfs.block.size
Total_size: The overall size of the input file
Input_file_num: Number of input files
(1) Default map number
If you do not make any settings, the default number of maps is associated with blcok_size.
Default_num = total_size/block_size;
(2) Expected size
The number of maps expected by the programmer can be set by the parameter mapred.map.tasks, but only when the number is greater than default_num will it take effect.
Goal_num = Mapred.map.tasks;
(3) Set the file size to be processed
You can set the file size for each task processing through mapred.min.split.size, but this size will only take effect if it is greater than block_size.
Split_size = Max (mapred.min.split.size, block_size);
Split_num = total_size/split_size;
(4) Number of maps calculated
compute_map_num = min (Split_num, Max (Default_num, Goal_num))
In addition to these configurations, MapReduce follows a number of principles. Each map of MapReduce handles data that cannot be spanned by a file, that is, Max_map_num <= input_file_num. So, the final map number should be:
final_map_num = min (compute_map_num, input_file_num)
After the above analysis, in setting the number of map, you can simply summarize the following points:
(1) If you want to increase the number of maps, set Mapred.map.tasks to a larger value.
(2) If you want to reduce the number of maps, set Mapred.min.split.size to a larger value.
(3) If there are many small files in the input, still want to reduce the number of maps, you need to merger small files into large files, and then use guideline 2.
How to control the number of maps in Hadoop