hive-Tuning notes: JVM reuse, parallel execution, the use of tuning the number of reducer

Source: Internet
Author: User

Explain:

1, JVM reuse is the content of the Hadoop tuning parameters, the performance of the hive has a very large impact, especially for the difficult to avoid small file scenes or more task scenes, such scenes most of the execution time is very short. The Hadoop default configuration is to use a derived JVM to perform the map and reduce tasks, which can cause considerable overhead in the JVM's startup process, especially if the job executed contains thousands of task tasks.

JVM reuse allows the JVM instance to re-use n times in the same job, and the value of n can be set in the Hadoop mapre-site.xml file

Mapred.job.reuse.jvm.num.tasks

Also available in hive execution settings:

Set mapred.job.reuse.jvm.num.tasks=10;

One drawback of the JVM is that turning on JVM reuse will always occupy the used task slots for reuse until the task is complete. If several of the reduce task executions in an "unbalanced" job are more time consuming than other reduce tasks, the reserved slots remain idle and cannot be used by other jobs until all of the tasks have been completed before they are released.


2, parallel execution, meaning synchronous execution of multiple stages of hive, hive in the execution process, a query into one or more stages. A particular job may contain many stages, which may not be completely interdependent, meaning it can be executed in parallel, which may shorten the execution time of the entire job

Hive execution Open: Set hive.exec.parallel=true


3, adjust the number of reducer:

      settings   hive.exec.reducers.bytes.per.reducer (default is 1GB), subject to hive.exec.reducers.max (default = 999) Effect:

mapred.reduce.tasks = min (parameter 2, total input data volume/parameter 1)


Three optimized scenarios:

Using a single data source for multiple processing of SQL to execute:

From TABLE1
INSERT OVERWRITE LOCAL DIRECTORY '/data/data_table/data_table1.txt ' SELECT 20140303, col1, col2, 2160701, COUNT (DIST INCT col) WHERE col3 <= 20140303 and col3 >= 20140201 GROUP by col1, col2
INSERT OVERWRITE LOCAL DIRECTORY '/data/data_table/data_table2.txt ' SELECT 20140302, col1, col2, 2160701, COUNT (DIST INCT col) WHERE col3 <= 20140302 and col3 >= 20140131 GROUP by col1, col2
INSERT OVERWRITE LOCAL DIRECTORY '/data/data_table/data_table3.txt ' SELECT 20140301, col1, col2, 2160701, COUNT (DIST INCT col) WHERE col3 <= 20140301 and col3 >= 20140130 GROUP by col1, col2
INSERT OVERWRITE LOCAL DIRECTORY '/data/data_table/data_table4.txt ' SELECT 20140228, col1, col2, 2160701, COUNT (DIST INCT col) WHERE col3 <= 20140228 and col3 >= 20140129 GROUP by col1, col2
INSERT OVERWRITE LOCAL DIRECTORY '/data/data_table/data_table5.txt ' SELECT 20140227, col1, col2, 2160701, COUNT (DIST INCT col) WHERE col3 <= 20140227 and col3 >= 20140128 GROUP by col1, col2
INSERT OVERWRITE LOCAL DIRECTORY '/data/data_table/data_table6.txt ' SELECT 20140226, col1, col2, 2160701, COUNT (DIST INCT col) WHERE col3 <= 20140226 and col3 >= 20140127 GROUP by col1, col2

.................. Omitted

Not set before, the execution time is 450s

Set parameters:

Set mapred.job.reuse.jvm.num.tasks=20
Set hive.exec.reducers.bytes.per.reducer=150000000
Set hive.exec.parallel=true;


The execution time is reduced to 273s, and reasonable use of a parameter adjustment can achieve partial tuning

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

hive-Tuning notes: JVM reuse, parallel execution, the use of tuning the number of reducer

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.