The mechanism of MapReduce parallelism

Last Update:2018-09-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Maptask degree of parallelism mechanism
Maptask parallelism refers to how many parallel tasks in the map phase work together. The parallelism of task processing in map phase is bound to affect the processing speed of the whole job. So, is it better to maptask the number of parallel instances? How to decide the degree of parallelism?
A mapreducejob map phase parallelism is determined by the client when the job is submitted, that is, the client submits the job before it logically slices the processing data. Slicing is done to form a slicing plan file (Job.split), and each logical slice eventually corresponds to starting a maptask.
The logical slicing mechanism is done by the Getsplits () method of the Fileinputformat implementation class.
Fileinputformat slicing mechanism
The default tiling mechanism in Fileinputformat:
A. Simply slice according to the file's content length
B. Tile size, default equals block size
C. Slicing without taking into account the whole data set, but individually slicing for each file individually
For example, there are two files for processing data:
File1.txt 320M
File2.txt 10M
After the fileinputformat slicing mechanism operation, the slice information is formed as follows:
file1.txt.split1-0m~128m
file1.txt.split2-128m~256m
file1.txt.split3-256m~320m
file2.txt.split1-0m~10m

Parameter configuration for the size of slices in fileinputformat
In Fileinputformat, the logic for calculating the tile size:
Math.max (MinSize, Math.min (MaxSize, blockSize));
Slices are mainly determined by these values:
MinSize: Default value: 1
Configuration parameters: Mapreduce.input.fileinputformat.split.minsize
MaxSize: Default value: Long.maxvalue
Configuration parameters: Mapreduce.input.fileinputformat.split.maxsize
BlockSize
So, by default, Split Size=blocksize is 128M in Hadoop 2.x.
MaxSize (Slice max): If the parameter is adjusted smaller than the blocksize, it will make the slice smaller and equal to the configured parameter.
MinSize (Slice min): parameter tuning is larger than blocksize, you can make the slice larger than the blocksize.
However, no matter how to tune the parameters, you can not let multiple small files "into" a split.
One more detail:
When Bytesremaining/splitsize > 1.1 is not satisfied, then the last of all remaining will be as a slice. This does not form a situation where 129M files are planned into two slices.

2. Reducetask degree of parallelism mechanism
Reducetask parallelism also affects the execution concurrency and execution efficiency of the entire job, and the number of concurrent maptask depends on the number of slices, reducetask the number of decisions can be set directly manually:
Job.setnumreducetasks (4);
If the data is unevenly distributed, it is possible to have data skew in the reduce phase.
Note: The number of reducetask is not arbitrary settings, but also to consider business logic requirements, in some cases, you need to calculate the global summary results, there can be only 1 reducetask.
3. Task Parallelism Experience
It is best to have at least one minute of execution time for each task.
If each map or reduce task of the job runs for only 30-40 seconds, reduce the job's map or reduce number, and schedule each task (Map|reduce) setup and join the scheduler. This intermediate process can take a few seconds, so if each task runs out very quickly, it will waste too much time at the beginning and end of the task.
In addition, by default, each task is a new JVM instance that requires the overhead of opening and destroying. In some cases, the JVM's time to open and destroy may be longer than the actual processing time, and the JVM reuse of the configuration task can improve the problem:
(Mapred.job.reuse.jvm.num.tasks, which is 1 by default, indicates that the maximum number of tasks that can be executed sequentially on a JVM (which belongs to the same job) is 1. Which means a task is starting a JVM)
If the input file is very large, such as 1TB, consider setting each block size on HDFs to be large, e.g. set to 256MB or 512MB

The mechanism of MapReduce parallelism

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The mechanism of MapReduce parallelism

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The mechanism of MapReduce parallelism

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support