Hive optimization----Controlling the number of maps in hive

Last Update:2015-05-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Typically, the job produces one or more map tasks through the directory of input.

The main determinants are: The total number of input files, the file size of input, the size of the file block set by the cluster (currently 128M, can be set dfs.block.size in hive; command to see, this parameter can not be customized modification);

2. For example:
A) Assuming that the input directory has 1 file A and a size of 780M, then Hadoop separates the file a into 7 blocks (6 128m blocks and one 12m block), resulting in 7 map numbers
b) Assuming that the input directory has 3 files A,b,c, the size is 10m,20m,130m, then Hadoop will be separated into 4 blocks (10m,20m,128m,2m), resulting in 4 map numbers
That is, if the file is larger than the block size (128m), then it is split, and if it is less than the block size, the file is considered a block.

3. Is the more map number the better?
The answer is in the negative. If a task has a lot of small files (much smaller than the block size of 128m), then each small file will also be treated as a block, with a map task to complete,
When a map task starts and initializes much longer than the logical processing time, it can cause a lot of wasted resources.
Also, the number of maps that can be executed at the same time is limited.

4.is it safe to make sure each map handles close to 128m of file blocks?
The answer is not necessarily. For example, there is a 127m file, normally with a map to complete, but this file only one or two small small paragraph, but there are tens of millions of records,
If the logic of map processing is complex, it is certainly time consuming to do it with a map task.

for the above questions 3 and 4, we need to take two ways to solve: Reduce the number of maps and increase the number of maps;

How do I combine small files to reduce the number of maps?
Suppose a SQL task:
Select count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ';
the inputdir/group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04 of the task
A total of 194 files, many of which are much smaller than 128m of small files, the total size of 9G, normal execution will use 194 map tasks.
Total compute resources consumed by map: slots_millis_maps= 623,020

I can reduce the number of maps by merging small files before map execution by using the following methods:
set mapred.max.split.size=100000000;
set mapred.min.split.size.per.node=100000000;
set mapred.min.split.size.per.rack=100000000;
set Hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat;
then execute the above statement, using 74 map tasks, map consumes the compute resources: slots_millis_maps= 333,500
for this simple SQL task, the execution time may be similar, but half of the computing resources are saved.
probably explain, 100000000 means 100M, Sethive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat; This parameter indicates a small file merge before execution,
The previous three parameters determine the size of the merged file block, larger than the file block size of 128m, separated by 128m, less than 128m, greater than 100m, separated by 100m, to those less than 100m (including small files and separate large files remaining),
merged, resulting in a total of 74 blocks.

How to increase the map number appropriately?

when the input file is very large, the task logic is complex, map execution is very slow, you can consider increasing the number of maps, so that the amount of data processed by each map is reduced, thus improving the efficiency of the task execution.
Suppose you have a task like this:
Select Data_desc,
count (1),
count (Distinct ID),
sum (case when ...),
sum (case when ...),
sum (...)
From a group by Data_desc
If the table A has only one file, the size is 120M, but contains tens of millions of of records, if the use of 1 map to complete this task, it is certainly more time-consuming, in this case, we have to consider the reasonable splitting of this file into multiple,
This allows you to use multiple map tasks to complete.
set mapred.reduce.tasks=10;
CREATE TABLE A_1 as
SELECT * from a
Distribute by RAND (123);

this will be a table of records, randomly scattered into the a_1 table containing 10 files, and then replaced by a_1 in the SQL table A, you will use 10 map tasks to complete.
each map task handles more than 12M (millions of records) of data, which is certainly much more efficient.

looks like these two kinds of contradictions, one is to merge small files, one is to take large files into small files, this is the focus of attention, this is the place to
according to the actual situation, the control map quantity needs to follow two principle: Make the Big Data quantity use the appropriate map number, make the individual map task handle the appropriate data quantity;
For more details, please follow the Superman Academy: Bj-crxy

This article from "Superman College" blog, reproduced please contact the author!

Hive optimization----Controlling the number of maps in hive

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hive optimization----Controlling the number of maps in hive

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hive optimization----Controlling the number of maps in hive

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support