"Hadoop" map and reduce number problems

Source: Internet
Author: User

In Hadoop, when a task is not set, the number of maps that the task executes is determined by the amount of data in the task itself, and the calculation method is described below, while the number of reduce Hadoop is set to 1 by default. Why is it set to 1, because the number of files for the output of a task is determined by the number of reduce. in general, the result of a task is output to a file by default, so the number of reduce is set to 1. So if we're going to make adjustments to the number of maps and reduce in order to improve the execution speed of the task.

Before you start, take a look at how the official Hadoop documentation is described.

Number of Maps
The number of maps is usually driven by the number of DFS blocks in the files. Although that causes people to adjust their DFS block size to adjust the number of maps. The right level of parallelism for maps seems to being around 10-100 Maps/node, although we have taken it up to Very cpu-light map tasks. Task setup takes awhile, so it's best if the maps take at least a minute to execute.
Actually controlling the number of maps is subtle. The Mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files are treated as an upper bound for input splits. A lower bound on the split size can is set via Mapred.min.split.size. Thus, if you expect 10TB of input data and has 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.task S is even larger. Ultimately the InputFormat determines the number of maps.
The number of map tasks can also be increased manually using the jobconf ' s conf.setnummaptasks (int num). This can is used to increase the number of maps tasks, but won't set the number below that which Hadoop determines via s Plitting the input data.

Number of reduces
The right number is reduces seems to be 0.95 or 1.75 * (Nodes * mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. At 1.75 The faster nodes would finish their first round of reduces and launch a second round of reduces doing a much better Job of load balancing.
Currently the number of reduces is limited to roughly, the buffer size for the output files (io.buffer.size * 2 * n Umreduces << heapsize). This would be the fixed at some point and until it is it provides a pretty firm upper bound.
The number of reduces also controls the number of output files in the. Output directory, but usually that's not important Because the next map/reduce step would split them into even smaller splits for the maps.
The number of reduce tasks can also is increased in the same as the map tasks, via jobconf ' s conf.setnumreducetasks (in T num).


The above description is how the number of maps and reduce is determined. The number of maps is determined by the amount of data read at the time the task executes, divided by the size of each block (by default, 64M), and reduce is the default of 1, and it has a suggested range, which is determined by the number of your node. The number of general reduce is by: nodes number x One Tasktracker sets the maximum amount of reduce (default is 2) between x (0.95~1.75). Note that the number above is only one of the largest caps in the set. In the actual operation of the number, but also depends on your specific task settings.


If you want to set the number of maps and reduce that a task executes, you can use the following method.

Map: When you want to change the number of maps, you can increase or decrease the number of maps by changing the size of the block in the configuration file, or by jobconf ' s conf.setnummaptasks (int num). But even if you set the number here, its actual number of runs will not be less than the number of actual partitions it produces. It means that when you set the map to 2 through the program, but when the data is read, the data is divided into 3, then the last task in the actual operation of the process map number is 3 instead of the 2 you set.

reduce: When you want to modify the number of reduce, you can change it as follows:

When it is in program debugging, you can call Job.setnumreducetasks (tasks) by declaring a Job object, or call Conf.setstrings ("Mapred.reduce.tasks", values) in the Conf settings ;

You can add run-time parameters at the command line when you are performing a task by command:

Bin/hadoop jar Examples.jar job_name-dmapred.map.tasks=nums-dmapred.reduce.tasks=nums INPUT OUTPUT

"Hadoop" map and reduce number problems

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.