Hive SQL tuning during Data migration

Source: Internet
Author: User

This document is documented in the process of data processing, encountered a SQL execution is very slow, for some large hive tables will also appear oom, step by step through parameter settings and SQL optimization, the process of tuning it.

First on SQL

SelectT1.create_time from 
(
Select * fromBeatles_ods.routewhere Year= . and Month= - and Day=01
) T1
Left outer Join
(
Select * from(Select *, Row_number () Over(Partition byID) num fromBeatles_test.routewhere Year= . and Month= - and Day= on) TwhereT.num=1
) T2
onT1.id=T2.idwhereT2.id= NULL;

You can see that this SQL consists of 1 joins, a redo statement, both of which are resource-intensive.

1, the link operation, the small table is placed on the left side of the link.

This is a cliché, not a detailed introduction here. Basically, a small table reduces the shuffle in the MapReduce process.

In fact, it is wrong to say that "putting a small table in front of an association can improve efficiency ." The correct argument should be that "associating a table with a few duplicate association keys in front of a join can improve the efficiency of the join." "

The final conclusion is: The table written on the left side of the association with each of the 1 Duplicate association keys, the bottom layer will be more than 1 operations processing.

Suppose that a table has 10 million IDs, the average ID has 3 duplicate values, then put a table in front of the association will do more than 30 million times the operation of processing, at this time who wrote in front of who wrote in the after see the difference in performance.

If you want to know more, please visit:

Http://blog.sina.com.cn/s/blog_6ff05a2c01016j7n.html

2, adjust the number of reduce, this number can be adjusted to less than 256, not the bigger the better, the Assembly consumes the resources of the cluster, and increase the aggregate pressure.

Set = ;

3. Increase the memory size to prevent memory overflow

Set the memory for map and reduce

set mapreduce.map.memory.mb=4096; set mapreduce.reduce.memory.mb=4096;

Set JVM Memory

set mapreduce.map.java.opts=-xmx2500m;

Map and reduce can be as large as possible, I set the 4G here. If the resource is abundant, you can set this value to a larger number. But the bigger the better, simply by upgrading the memory to optimize the program is not recommended.

Hive SQL tuning during Data migration

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.