First, preface
The company practical Hadoop constructs the Data warehouse, during the inevitable practical hivesql, in the ETL process, the speed has become the question which avoids can avoid. I have a few data tables associated with running 1 hours of experience, you may feel indifferent, but many times ETL will be multiple hours, very waste of time, so hivesql optimization is unavoidable.
Note: This article only from the SQL level to introduce the daily need to pay attention to the points, not involving Hadoop, MapReduce and other aspects of the compilation process of Hive, please refer to the article: http://tech.meituan.com/hive-sql-to-mapreduce.html
II. Preparation of data
Let's say we have two data sheets.
Scenic Area table: sight,12w records, data table structure:
hive> desc sight;okarea string none City string none country string string
Scenic Order Schedule: order_sight,1040w records, data table structure:
hive> desc order_sight; Okcreate_time string none ID string none order_id string
Third, analysis
3.1 Where Condition
Then we would like to see the area ID is 9718, the date is the 2015-10-10 all order ID, then SQL needs to write as follows:
where s.id='2015-10-10'
The time required is 52 seconds, if we change the way SQL is written:
From sight S left join (where s.id=9718;
Practical 43 seconds, a little faster. Of course, we are not just analysis said 20% faster (I also test, the gap is the smallest), but to analyze the reason!
From the two-SQL notation, especially the red part of the second, I wrote the left condition inside. Then the result of execution is not the same, and the reduce time of the second is obviously less than the reduce time of the first article.
The reason is that these two SQL are decomposed into 8 map tasks and a reduce task, if the left condition is written behind, then these related operations will be placed in the reduce phase, 1 reduce operation time must be greater than 8 map execution time, resulting in an extra long execution time.
Conclusion: When an outer association is used, if the filter condition of the sub-table is written in the where, then the whole table is associated before filtering
ETL Hivesql Tuning (the location of the left join where)