This document is documented in the process of data processing, encountered a SQL execution is very slow, for some large hive tables will also appear oom, step by step through parameter settings and SQL optimization, the process of tuning it.
First on SQL
SelectT1.create_time from
(
Select * fromBeatles_ods.routewhere Year= . and Month= - and Day=01
) T1
Left outer Join
(
Select * from(Select *, Row_number () Over(Partition byID) num fromBeatles_test.routewhere Year= . and Month= - and Day= on) TwhereT.num=1
) T2
onT1.id=T2.idwhereT2.id= NULL;
You can see that this SQL consists of 1 joins, a redo statement, both of which are resource-intensive.
1, the link operation, the small table is placed on the left side of the link.
This is a cliché, not a detailed introduction here. Basically, a small table reduces the shuffle in the MapReduce process.
In fact, it is wrong to say that "putting a small table in front of an association can improve efficiency ." The correct argument should be that "associating a table with a few duplicate association keys in front of a join can improve the efficiency of the join." "
The final conclusion is: The table written on the left side of the association with each of the 1 Duplicate association keys, the bottom layer will be more than 1 operations processing.
Suppose that a table has 10 million IDs, the average ID has 3 duplicate values, then put a table in front of the association will do more than 30 million times the operation of processing, at this time who wrote in front of who wrote in the after see the difference in performance.
If you want to know more, please visit:
Http://blog.sina.com.cn/s/blog_6ff05a2c01016j7n.html
2, adjust the number of reduce, this number can be adjusted to less than 256, not the bigger the better, the Assembly consumes the resources of the cluster, and increase the aggregate pressure.
Set = ;
3. Increase the memory size to prevent memory overflow
Set the memory for map and reduce
set mapreduce.map.memory.mb=4096; set mapreduce.reduce.memory.mb=4096;
Set JVM Memory
set mapreduce.map.java.opts=-xmx2500m;
Map and reduce can be as large as possible, I set the 4G here. If the resource is abundant, you can set this value to a larger number. But the bigger the better, simply by upgrading the memory to optimize the program is not recommended.
Hive SQL tuning during Data migration