Hive optimization----Controlling the number of maps in hive

Source: Internet
Author: User

1. Typically, the job produces one or more map tasks through the directory of input.

The main determinants are: The total number of input files, the file size of input, the size of the file block set by the cluster (currently 128M, can be set dfs.block.size in hive; command to see, this parameter can not be customized modification);

2. For example:
A) Assuming that the input directory has 1 file A and a size of 780M, then Hadoop separates the file a into 7 blocks (6 128m blocks and one 12m block), resulting in 7 map numbers
b) Assuming that the input directory has 3 files A,b,c, the size is 10m,20m,130m, then Hadoop will be separated into 4 blocks (10m,20m,128m,2m), resulting in 4 map numbers
That is, if the file is larger than the block size (128m), then it is split, and if it is less than the block size, the file is considered a block.

3. Is the more map number the better?
The answer is in the negative. If a task has a lot of small files (much smaller than the block size of 128m), then each small file will also be treated as a block, with a map task to complete,
When a map task starts and initializes much longer than the logical processing time, it can cause a lot of wasted resources.
Also, the number of maps that can be executed at the same time is limited.

4.is it safe to make sure each map handles close to 128m of file blocks?
The answer is not necessarily. For example, there is a 127m file, normally with a map to complete, but this file only one or two small small paragraph, but there are tens of millions of records,
If the logic of map processing is complex, it is certainly time consuming to do it with a map task.

for the above questions 3 and 4, we need to take two ways to solve: Reduce the number of maps and increase the number of maps;

How do I combine small files to reduce the number of maps?
Suppose a SQL task:
Select count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ';
the inputdir/group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04 of the task
A total of 194 files, many of which are much smaller than 128m of small files, the total size of 9G, normal execution will use 194 map tasks.
Total compute resources consumed by map: slots_millis_maps= 623,020

I can reduce the number of maps by merging small files before map execution by using the following methods:
set mapred.max.split.size=100000000;
set mapred.min.split.size.per.node=100000000;
set mapred.min.split.size.per.rack=100000000;
set Hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat;
then execute the above statement, using 74 map tasks, map consumes the compute resources: slots_millis_maps= 333,500
for this simple SQL task, the execution time may be similar, but half of the computing resources are saved.
probably explain, 100000000 means 100M, Sethive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat; This parameter indicates a small file merge before execution,
The previous three parameters determine the size of the merged file block, larger than the file block size of 128m, separated by 128m, less than 128m, greater than 100m, separated by 100m, to those less than 100m (including small files and separate large files remaining),
merged, resulting in a total of 74 blocks.

How to increase the map number appropriately?

when the input file is very large, the task logic is complex, map execution is very slow, you can consider increasing the number of maps, so that the amount of data processed by each map is reduced, thus improving the efficiency of the task execution.
Suppose you have a task like this:
Select Data_desc,
count (1),
count (Distinct ID),
sum (case when ...),
sum (case when ...),
sum (...)
From a group by Data_desc
If the table A has only one file, the size is 120M, but contains tens of millions of of records, if the use of 1 map to complete this task, it is certainly more time-consuming, in this case, we have to consider the reasonable splitting of this file into multiple,
This allows you to use multiple map tasks to complete.
set mapred.reduce.tasks=10;
CREATE TABLE A_1 as
SELECT * from a
Distribute by RAND (123);

this will be a table of records, randomly scattered into the a_1 table containing 10 files, and then replaced by a_1 in the SQL table A, you will use 10 map tasks to complete.
each map task handles more than 12M (millions of records) of data, which is certainly much more efficient.

looks like these two kinds of contradictions, one is to merge small files, one is to take large files into small files, this is the focus of attention, this is the place to
according to the actual situation, the control map quantity needs to follow two principle: Make the Big Data quantity use the appropriate map number, make the individual map task handle the appropriate data quantity;
For more details, please follow the Superman Academy: Bj-crxy


This article from "Superman College" blog, reproduced please contact the author!

Hive optimization----Controlling the number of maps in hive

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.