Configuration items of hadoop1

Source: Internet
Author: User

Mapred. Min. Split. Size

The meaning is the same as that on the word surface. After a long time, it was found that the task was initiated on the machine, rather than the host, which needed to be configured ..


Mapred. Map. Tasks

The total number of map tasks in a job is thought to be the total number of files/the actual splitsize. I don't know what to use this .. However, the following example should illustrate some problems:

Several parameter configurations of the hive production environment used by my company are as follows:
DFS. Block. size = 268435456
Hive. Merge. mapredfiles = true
Hive. Merge. mapfiles = true
Hive. Merge. Size. Per. Task = 256000000
Mapred. Map. Tasks = 2

Because the default value of "true" is used to merge small files, and the combination of "DFS. Block. Size" and "hive. Merge. Size. Per. Task" makes most of the merged files about MB.

Case 1:

Now suppose we have 3 900 MB files, then goalsize = min (2,256 MB/256 MB) = Mb (see the http://blog.sina.com.cn/s/blog_6ff05a2c010178qd.html for details on how to calculate the number of maps)
Therefore, the entire job has 6 maps, three of which process MB of data respectively, and three of which process 44 MB of data respectively.
At this time, the barrel effect comes. The execution time of the whole job's map stage is not the execution time of the shortest one map, but the execution time of the longest one map. Therefore, although three maps process only 44 MB of data, they can run quickly, but they still need to wait for the other three maps to process MB. Obviously, processing MB of three maps slows down the entire job.

Case 2:

If we set mapred. Map. Tasks to 6, let's take a look at the changes:
Goalsize = min (900 MB/6,256 MB) = 150 MB
The entire job will also be allocated with 6 maps for processing. Each map processes MB of data, which is very even. No one will hold back and allocate resources rationally, the execution time is about 59% (150/256) of Case 1)

Case Analysis:

Although mapred. Map. tasks has been adjusted from 2 to 6, Case 2 does not use more map resources than Case 1, and both use six maps. The execution time of Case 2 is about 59% of the execution time of Case 1.
From this case, we can see that the automatic optimization settings for mapred. Map. Tasks can significantly improve the job execution efficiency.

Ref: http://blog.sina.com.cn/s/blog_6ff05a2c0101aqvv.html


PS: many times I cannot find the official website configuration instructions of hadoop1. First, add it to my favorites. I hope this address will not change any more:

Https://hadoop.apache.org/docs/r1.0.4/mapred-default.html

This article from the "Dream footprints" blog, please be sure to keep this source http://daisy8867.blog.51cto.com/1043582/1554424

Configuration items of hadoop1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.