ETL Hivesql Tuning (the location of the left join where)

Source: Internet
Author: User

First, preface

The company practical Hadoop constructs the Data warehouse, during the inevitable practical hivesql, in the ETL process, the speed has become the question which avoids can avoid. I have a few data tables associated with running 1 hours of experience, you may feel indifferent, but many times ETL will be multiple hours, very waste of time, so hivesql optimization is unavoidable.

Note: This article only from the SQL level to introduce the daily need to pay attention to the points, not involving Hadoop, MapReduce and other aspects of the compilation process of Hive, please refer to the article: http://tech.meituan.com/hive-sql-to-mapreduce.html

II. Preparation of data

Let's say we have two data sheets.

Scenic Area table: sight,12w records, data table structure:

hive> desc sight;okarea                    string                  none                City                    string                  none                country                 string string          

Scenic Order Schedule: order_sight,1040w records, data table structure:

hive> desc order_sight;    Okcreate_time             string                  none                ID                      string                  none                order_id                string 

Third, analysis

3.1 Where Condition

Then we would like to see the area ID is 9718, the date is the 2015-10-10 all order ID, then SQL needs to write as follows:

where s.id='2015-10-10'
The time required is 52 seconds, if we change the way SQL is written:
From sight S left join (where s.id=9718;

Practical 43 seconds, a little faster. Of course, we are not just analysis said 20% faster (I also test, the gap is the smallest), but to analyze the reason!

From the two-SQL notation, especially the red part of the second, I wrote the left condition inside. Then the result of execution is not the same, and the reduce time of the second is obviously less than the reduce time of the first article.

The reason is that these two SQL are decomposed into 8 map tasks and a reduce task, if the left condition is written behind, then these related operations will be placed in the reduce phase, 1 reduce operation time must be greater than 8 map execution time, resulting in an extra long execution time.

Conclusion: When an outer association is used, if the filter condition of the sub-table is written in the where, then the whole table is associated before filtering

ETL Hivesql Tuning (the location of the left join where)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.