A colleague's summary of Hivesql optimized hive is to generate a string resolution that conforms to the SQL syntax that can be executed on Hadoop by M

Source: Internet
Author: User

A colleague summarizes the hive SQL optimizations

Hive is a tool that generates a string resolution that conforms to SQL syntax to generate mapreduce that can be executed on Hadoop.

Using hive to design SQL as many of the features of distributed computing as possible differs from traditional relational databases,

So we need to get rid of the original relational database development of some of the inherent thinking.

Basic principles:

1: Try to filter the data as early as possible, reduce the amount of data at each stage, partition the table, and select only the fields that need to be used

Select ... from A

Join B

On a.key = B.key

where a.userid>10

and b.userid<10

and a.dt= ' 20120417 '

and b.dt= ' 20120417 ';

Should be rewritten as:

Select .... from (select .... from A

where dt= ' 201200417 '

and userid>10

) A

Join (SELECT .... from B

where dt= ' 201200417 '

and UserID < 10

) b

on a.key = B.key;

2: Try to atomically, try to avoid a SQL containing complex logic

You can use an intermediate table to complete complex logic

drop table if exists tmp_table_1;

CREATE table if not exists tmp_table_1 as

Select ...;

drop table if exists tmp_table_2;

CREATE table if not exists tmp_table_2 as

Select ...;

drop table if exists result_table;

CREATE table if not exists result_table as

Select ...;

drop table if exists tmp_table_1;

drop table if exists tmp_table_2;

3: The number of jobs for a single SQL is controlled as much as 5 below

4: Prudent use of mapjoin, generally less than 2000 lines, the size of less than 1M (after expansion can be properly enlarged) of the table to be used, the small table should be placed on the left side of the join (TCL now many of the small table on the right side of the join).

Failure to do so can cause significant disk and memory consumption

5: Write SQL to understand the characteristics of the data itself, if there is a join, group operation, you should pay attention to whether there will be data skew

If data skew occurs, the following should be done:

Set hive.exec.reducers.max=200;

Set mapred.reduce.tasks=;---Increase the number of reduce

Set hive.groupby.mapaggr.checkinterval=100000;--This is the group key corresponding to the number of record bars above this value will be split, the value is set according to the specific data volume

Set hive.groupby.skewindata=true; --If the Group by procedure appears italic should be set to True

Set hive.skewjoin.key=100000; --This is the join key corresponding to the number of record bar more than this value will be split, the value is set according to the specific data volume

Set hive.optimize.skewjoin=true;--if it is a join process, the skew should be setting to True

6: If the number of parts of union all is greater than 2, or if the amount of data in each union is large, it should be split into multiple insert into statements, and the execution time can be increased by 50% during the actual test.

Insert Overwite Table tablename partition (dt= ...)

Select ... from (

Select ... from A

UNION ALL

Select ... from B

UNION ALL

Select ... from C

) R

where ...;

Can be rewritten as:

INSERT INTO table tablename partition (dt= ...)

Select .... from A

WHERE ...;

INSERT INTO table tablename partition (dt= ...)

Select .... from B

WHERE ...;

INSERT INTO table tablename partition (dt= ...)

Select .... from C

WHERE ...;


A colleague's summary of Hivesql optimized hive is to generate a string resolution that conforms to the SQL syntax that can be executed on Hadoop by M

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.