How to control the number of maps in Hadoop

Source: Internet
Author: User

Reprinted from: How to control the number of maps in Hadoop

Hadoop provides a parameter mapred.map.tasks that sets the number of maps, which we can use to control the number of maps. However, setting the number of maps in this way is not always valid. The reason is that mapred.map.tasks is just a reference value for Hadoop, and the number of final maps depends on other factors as well.
To facilitate the introduction, first look at a few nouns:
BLOCK_SIZE:HDFS File block size, default is 64M, can be set by parameter dfs.block.size
Total_size: The overall size of the input file
Input_file_num: Number of input files

(1) Default map number
If you do not make any settings, the default number of maps is associated with blcok_size.
Default_num = total_size/block_size;

(2) Expected size
The number of maps expected by the programmer can be set by the parameter mapred.map.tasks, but only when the number is greater than default_num will it take effect.
Goal_num = Mapred.map.tasks;

(3) Set the file size to be processed
You can set the file size for each task processing through mapred.min.split.size, but this size will only take effect if it is greater than block_size.
Split_size = Max (mapred.min.split.size, block_size);
Split_num = total_size/split_size;

(4) Number of maps calculated
compute_map_num = min (Split_num, Max (Default_num, Goal_num))

In addition to these configurations, MapReduce follows a number of principles. Each map of MapReduce handles data that cannot be spanned by a file, that is, Max_map_num <= input_file_num. So, the final map number should be:
final_map_num = min (compute_map_num, input_file_num)

After the above analysis, in setting the number of map, you can simply summarize the following points:
(1) If you want to increase the number of maps, set Mapred.map.tasks to a larger value.
(2) If you want to reduce the number of maps, set Mapred.min.split.size to a larger value.
(3) If there are many small files in the input, still want to reduce the number of maps, you need to merger small files into large files, and then use guideline 2.

How to control the number of maps in Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.