Hive optimization------Control the number of maps and reduce in hive tasks

Source: Internet
Author: User
Tags file size min split

first, control the number of maps in the Hive task:

1. Typically, the job produces one or more map tasks through the directory of input.
The main determinants are: The total number of input files, the file size of input, the size of the file block set by the cluster (currently 128M, can be set dfs.block.size in hive; command to see, this parameter can not be customized modification);

2. For example:
A) Assuming that the input directory has 1 file A and a size of 780M, then Hadoop separates the file a into 7 blocks (6 128m blocks and one 12m block), resulting in 7 map numbers
b) Assuming that the input directory has 3 files A,b,c, the size is 10m,20m,130m, then Hadoop will be separated into 4 blocks (10m,20m,128m,2m), resulting in 4 map numbers
That is, if the file is larger than the block size (128m), then it is split, and if it is less than the block size, the file is considered a block.

3. The more map number, the better.
The answer is in the negative. If a task has a lot of small files (much smaller than the block size of 128m), then each small file will also be treated as a block, with a map task to complete,
When a map task starts and initializes much longer than the logical processing time, it can cause a lot of wasted resources.
Also, the number of maps that can be executed at the same time is limited.

4. It's easy to be sure that each map handles close to 128m of file blocks.
The answer is not necessarily. For example, there is a 127m file, normally with a map to complete, but this file only one or two small small paragraph, but there are tens of millions of records,
If the logic of map processing is complex, it is certainly time consuming to do it with a map task.

For the above questions 3 and 4, we need to take two ways to solve: Reduce the number of maps and increase the number of maps;

How to combine small files to reduce the number of maps.
Suppose a SQL task:
Select count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ';
The inputdir/group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04 of the task
A total of 194 files, many of which are much smaller than 128m of small files, the total size of 9G, normal execution will use 194 map tasks.
Total compute resources consumed by map: slots_millis_maps= 623,020

I can reduce the number of maps by merging small files before map execution by using the following methods:
Set mapred.max.split.size=100000000;
Set mapred.min.split.size.per.node=100000000;
Set mapred.min.split.size.per.rack=100000000;
Set Hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat;
Then execute the above statement, using 74 map tasks, map consumes the compute resources: slots_millis_maps= 333,500
For this simple SQL task, the execution time may be similar, but half of the computing resources are saved.
Probably explain, 100000000 means 100M, set Hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat; This parameter indicates a small file merge before execution,
The previous three parameters determine the size of the merged file block, larger than the file block size of 128m, separated by 128m, less than 128m, greater than 100m, separated by 100m, to those less than 100m (including small files and separate large files remaining),
Merged, resulting in a total of 74 blocks.

how to properly increase the number of maps.

When the input file is very large, the task logic is complex, map execution is very slow, you can consider increasing the number of maps, so that the amount of data processed by each map is reduced, thus improving the efficiency of the task execution.
Suppose you have a task like this:
Select Data_desc,
Count (1),
COUNT (distinct ID),
SUM (case when ...),
SUM (case when ...),
Sum (...)
From a GROUP by Data_desc
If the table A has only one file, the size is 120M, but contains tens of millions of of records, if the use of 1 map to complete this task, it is certainly more time-consuming, in this case, we have to consider the reasonable splitting of this file into multiple,
This allows you to use multiple map tasks to complete.
Set mapred.reduce.tasks=10;
CREATE TABLE A_1 as
SELECT * from a
Distribute by RAND (123);

This will be a table of records, randomly scattered into the a_1 table containing 10 files, and then replaced by a_1 in the SQL table A, you will use 10 map tasks to complete.
Each map task handles more than 12M (millions of records) of data, which is certainly much more efficient.

Looks like these two kinds of contradictions, one is to merge small files, one is to take large files into small files, this is the focus of attention, this is the place to
According to the actual situation, the control map quantity needs to follow two principle: Make the Big Data quantity use the appropriate map number, make the individual map task handle the appropriate data quantity;

second, reduce the number of control hive tasks:

1. How to determine the reduce number for hive itself:
The setting of reduce number greatly affects the task execution efficiency, without specifying the number of reduce, hive will guess to determine a reduce number, based on the following two settings:
Hive.exec.reducers.bytes.per.reducer (the amount of data processed per reduce task, default is 1000^3=1g)
Hive.exec.reducers.max (the maximum number of reduce per task, default is 999)
The formula for calculating the number of reducer is simple n=min (parameter 2, total input data volume/parameter 1)
That is, if the total size of the input (map output) of reduce is not more than 1G, then there will only be one reduce task;
such as: Select Pt,count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ' GROUP by PT;
/GROUP/P_SDO_DATA/P_SDO_DATA_ETL/PT/POPT_TBACCOUNTCOPY_MES/PT=2012-07-04 has a total size of 9G, so this sentence has 10 reduce

2. Adjust the reduce number method one:
Adjust the value of the Hive.exec.reducers.bytes.per.reducer parameter;
Set hive.exec.reducers.bytes.per.reducer=500000000; (500M)
Select Pt,count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ' GROUP by PT; This time, there are 20 of reduce

3. Adjust reduce number method two;
Set mapred.reduce.tasks = 15;
Select Pt,count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ' GROUP by PT; This time there are 15 reduce

4. The number of reduce is not the more the better;
As with map, the startup and initialization of reduce also consumes time and resources;
In addition, how many reduce, there will be the number of output files, if you generate a lot of small files, then if these small files as the next task input, there will be too many small files problem;

5. Under what circumstances there is only one reduce;
Most of the time you will find that regardless of the amount of data in the task, regardless of whether you have set parameters to adjust the number of reduce, the task has been only a reduce task;
In fact, there is only one reduce task, in addition to the data volume is less than the Hive.exec.reducers.bytes.per.reducer parameter value of the case, there are the following reasons:
A) There is no group by summary, such as the Select Pt,count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ' GROUP by PT; Written as select COUNT (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ';
This is very common and I hope you will try to rewrite it.
b) using the ORDER by
c) with Cartesian product
Usually in these cases, in addition to find a way to work around and avoid, I have no good way, because these operations are global, so Hadoop had to use a reduce to complete;

Similarly, these two principles need to be taken into account when setting the number of reduce: to make large data use the appropriate reduce number, and to make a single reduce task handle the appropriate amount of data;

Reprint http://superlxw1234.iteye.com/blog/1582880

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.