A colleague's summary of Hivesql optimized hive is to generate a string resolution that conforms to the SQL syntax that can be executed on Hadoop by M

Last Update:2014-11-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A colleague summarizes the hive SQL optimizations

Hive is a tool that generates a string resolution that conforms to SQL syntax to generate mapreduce that can be executed on Hadoop.

Using hive to design SQL as many of the features of distributed computing as possible differs from traditional relational databases,

So we need to get rid of the original relational database development of some of the inherent thinking.

Basic principles:

1: Try to filter the data as early as possible, reduce the amount of data at each stage, partition the table, and select only the fields that need to be used

Select ... from A

Join B

On a.key = B.key

where a.userid>10

and b.userid<10

and a.dt= ' 20120417 '

and b.dt= ' 20120417 ';

Should be rewritten as:

Select .... from (select .... from A

where dt= ' 201200417 '

and userid>10

) A

Join (SELECT .... from B

where dt= ' 201200417 '

and UserID < 10

) b

on a.key = B.key;

2: Try to atomically, try to avoid a SQL containing complex logic

You can use an intermediate table to complete complex logic

drop table if exists tmp_table_1;

CREATE table if not exists tmp_table_1 as

Select ...;

drop table if exists tmp_table_2;

CREATE table if not exists tmp_table_2 as

Select ...;

drop table if exists result_table;

CREATE table if not exists result_table as

Select ...;

drop table if exists tmp_table_1;

drop table if exists tmp_table_2;

3: The number of jobs for a single SQL is controlled as much as 5 below

4: Prudent use of mapjoin, generally less than 2000 lines, the size of less than 1M (after expansion can be properly enlarged) of the table to be used, the small table should be placed on the left side of the join (TCL now many of the small table on the right side of the join).

Failure to do so can cause significant disk and memory consumption

5: Write SQL to understand the characteristics of the data itself, if there is a join, group operation, you should pay attention to whether there will be data skew

If data skew occurs, the following should be done:

Set hive.exec.reducers.max=200;

Set mapred.reduce.tasks=;---Increase the number of reduce

Set hive.groupby.mapaggr.checkinterval=100000;--This is the group key corresponding to the number of record bars above this value will be split, the value is set according to the specific data volume

Set hive.groupby.skewindata=true; --If the Group by procedure appears italic should be set to True

Set hive.skewjoin.key=100000; --This is the join key corresponding to the number of record bar more than this value will be split, the value is set according to the specific data volume

Set hive.optimize.skewjoin=true;--if it is a join process, the skew should be setting to True

6: If the number of parts of union all is greater than 2, or if the amount of data in each union is large, it should be split into multiple insert into statements, and the execution time can be increased by 50% during the actual test.

Insert Overwite Table tablename partition (dt= ...)

Select ... from (

Select ... from A

UNION ALL

Select ... from B

UNION ALL

Select ... from C

) R

where ...;

Can be rewritten as:

INSERT INTO table tablename partition (dt= ...)

Select .... from A

WHERE ...;

INSERT INTO table tablename partition (dt= ...)

Select .... from B

WHERE ...;

INSERT INTO table tablename partition (dt= ...)

Select .... from C

WHERE ...;

A colleague's summary of Hivesql optimized hive is to generate a string resolution that conforms to the SQL syntax that can be executed on Hadoop by M

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A colleague's summary of Hivesql optimized hive is to generate a string resolution that conforms to the SQL syntax that can be executed on Hadoop by M

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A colleague's summary of Hivesql optimized hive is to generate a string resolution that conforms to the SQL syntax that can be executed on Hadoop by M

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support