Optimization on hive parameter Level 2 control reduce count

Last Update:2014-07-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

ReduceThe number of intermediate or landing files, file size, and blockThe size is independent.

1. deciding factors of reduce count

The setting of the reduce number greatly affects the task execution efficiency. If the reduce number is not specified, hive will guess to determine the number of reduce based on the following two settings:

Parameter 1: hive.exe C. reducers. bytes. Per. Cer CER (the data volume processed by each reduce task. The default value is 1000 ^ 3 = 1G)

Parameter 2: hive.exe C. reducers. Max (the maximum reduce size for each job, which is 999 by default)

Formula for calculating the number of reducers

N = min (Parameter 2, Total input data volume/Parameter 1)

That is, if the total size of Reduce input (map output) cannot exceed 1 GB, only one reduce task will be executed;

Case:

select pt,count(1) from popt_tbaccountcopy_mes where pt = ‘2012-07-04‘ group by pt;

File storage location:/group/p_sdo_data/p_sdo_data_etl/Pt/popt_tbaccountcopy_mes/Pt = 2012-07-04

The total size is more than 9 GB, so this sentence has 10 reduce

Best Practice: it is usually necessary to manually specify the reducer. Considering MapThe output data volume of the stage is usually significantly reduced than the input data, so even if the CER is not setReset hive.exe C. Cipher CERs. MaxIt is also necessary.Based on hadoopExperience, you can set ReducerThe number is set to 0.95 *Tasktracker in ClusterNumber.

2. problems caused by large or small reduce

The number of reducers is too large:

1) generate many small files (the final output file is determined by reduce. The size of a reduce file is irrelevant to the size of the Set block ), if these small files are input as the next job, too many small files need to be merged;

2) Start and initialize reduce consume time and resources;

3) How many reduce files will have as many output files;

A small reduce number causes:

1) execution time;

2) data skew may occur (for example, 500 maps only have 1 reduce );

3. Adjust the reduce quantity. Method 1: Adjust the value of hive.exe C. reducers. bytes. Per. Cer;

Case:

Set hive.exe C. Fetch CERs. bytes. Per. Cer CER = 500000000; (500 m)

select pt,count(1) from popt_tbaccountcopy_mes where pt = ‘2012-07-04‘ group by pt;

This time there are 20 reduce

Method 2: mapred. Reduce. Tasks

In Hive, the default value is 1;

Case:

When multiple jobs share data jobs in a result set, if only one reduce is set, the job has only one output file, when multiple subsequent jobs use the output of the job, only one map can be used for processing, and the speed will be very slow. You can set the reduce number of the shared job to make multiple outputs, the subsequent jobs can be processed in parallel by multiple maps. in the real environment, the number of reduce tasks needs to be set based on the actual data stream;

How can I query the file data volume?

Hadoop FS-du file name

Set the appropriate reduce quantity based on the actual data volume;

Data job of a result set shared by multiple jobsOr a jobMultiple jobsMultiple references. You can set this parameter (reduceTo increase the number of MapQuantity.

If the data of a job after execution is no longer used by subsequent jobs, it does not matter whether to manually specify the reduce number.

Case:

set mapred.reduce.tasks = 15;
select pt,count(1) from popt_tbaccountcopy_mes where pt = ‘2012-07-04‘ group by pt;

This time there are 15 reduce

4. Common scenarios with only one reduce

Most of the time, you will find that no matter how large the data volume is, no matter whether you have set the parameter to adjust the reduce number, there is always only one reduce task in the task (one of the representations of data skew );

1) The data volume is smaller than the value of hive.exe C. extends CERs. bytes. Per. Cer;

2) No summary of group by (group by keys are distributed to different reduce nodes according to different keys );

For example, select PT, count (1) From popt_tbaccountcopy_mes Where Pt = '2017-07-04 'group by PT;

Write as select count (1) From popt_tbaccountcopy_mes Where Pt = '2017-07-04 ';

3) Order by is used (sorting of the entire job, with low performance );

5. Summary of reduce count

When setting the reduce number, you also need to consider these two principles:

1)Make use of appropriate reduce for large data volumesQuantity;

2)Make a single reduceTask Processing appropriate data volume;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Optimization on hive parameter Level 2 control reduce count

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support