The mechanism of MapReduce parallelism

Source: Internet
Author: User

1. Maptask degree of parallelism mechanism
Maptask parallelism refers to how many parallel tasks in the map phase work together. The parallelism of task processing in map phase is bound to affect the processing speed of the whole job. So, is it better to maptask the number of parallel instances? How to decide the degree of parallelism?
A mapreducejob map phase parallelism is determined by the client when the job is submitted, that is, the client submits the job before it logically slices the processing data. Slicing is done to form a slicing plan file (Job.split), and each logical slice eventually corresponds to starting a maptask.
The logical slicing mechanism is done by the Getsplits () method of the Fileinputformat implementation class.
Fileinputformat slicing mechanism
The default tiling mechanism in Fileinputformat:
A. Simply slice according to the file's content length
B. Tile size, default equals block size
C. Slicing without taking into account the whole data set, but individually slicing for each file individually
For example, there are two files for processing data:
File1.txt 320M
File2.txt 10M
After the fileinputformat slicing mechanism operation, the slice information is formed as follows:
file1.txt.split1-0m~128m
file1.txt.split2-128m~256m
file1.txt.split3-256m~320m
file2.txt.split1-0m~10m

Parameter configuration for the size of slices in fileinputformat
In Fileinputformat, the logic for calculating the tile size:
Math.max (MinSize, Math.min (MaxSize, blockSize));
Slices are mainly determined by these values:
MinSize: Default value: 1
Configuration parameters: Mapreduce.input.fileinputformat.split.minsize
MaxSize: Default value: Long.maxvalue
Configuration parameters: Mapreduce.input.fileinputformat.split.maxsize
BlockSize
So, by default, Split Size=blocksize is 128M in Hadoop 2.x.
MaxSize (Slice max): If the parameter is adjusted smaller than the blocksize, it will make the slice smaller and equal to the configured parameter.
MinSize (Slice min): parameter tuning is larger than blocksize, you can make the slice larger than the blocksize.
However, no matter how to tune the parameters, you can not let multiple small files "into" a split.
One more detail:
When Bytesremaining/splitsize > 1.1 is not satisfied, then the last of all remaining will be as a slice. This does not form a situation where 129M files are planned into two slices.

2. Reducetask degree of parallelism mechanism
Reducetask parallelism also affects the execution concurrency and execution efficiency of the entire job, and the number of concurrent maptask depends on the number of slices, reducetask the number of decisions can be set directly manually:
Job.setnumreducetasks (4);
If the data is unevenly distributed, it is possible to have data skew in the reduce phase.
Note: The number of reducetask is not arbitrary settings, but also to consider business logic requirements, in some cases, you need to calculate the global summary results, there can be only 1 reducetask.
3. Task Parallelism Experience
It is best to have at least one minute of execution time for each task.
If each map or reduce task of the job runs for only 30-40 seconds, reduce the job's map or reduce number, and schedule each task (Map|reduce) setup and join the scheduler. This intermediate process can take a few seconds, so if each task runs out very quickly, it will waste too much time at the beginning and end of the task.
In addition, by default, each task is a new JVM instance that requires the overhead of opening and destroying. In some cases, the JVM's time to open and destroy may be longer than the actual processing time, and the JVM reuse of the configuration task can improve the problem:
(Mapred.job.reuse.jvm.num.tasks, which is 1 by default, indicates that the maximum number of tasks that can be executed sequentially on a JVM (which belongs to the same job) is 1. Which means a task is starting a JVM)
If the input file is very large, such as 1TB, consider setting each block size on HDFs to be large, e.g. set to 256MB or 512MB

The mechanism of MapReduce parallelism

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.