A colleague summarizes the hive SQL optimizations
Hive is a tool that generates a string resolution that conforms to SQL syntax to generate mapreduce that can be executed on Hadoop.
Using hive to design SQL as many of the features of distributed computing as possible differs from traditional relational databases,
So we need to get rid of the original relational database development of some of the inherent thinking.
Basic principles:
1: Try to filter the data as early as possible, reduce the amount of data at each stage, partition the table, and select only the fields that need to be used
Select ... from A
Join B
On a.key = B.key
where a.userid>10
and b.userid<10
and a.dt= ' 20120417 '
and b.dt= ' 20120417 ';
Should be rewritten as:
Select .... from (select .... from A
where dt= ' 201200417 '
and userid>10
) A
Join (SELECT .... from B
where dt= ' 201200417 '
and UserID < 10
) b
on a.key = B.key;
2: Try to atomically, try to avoid a SQL containing complex logic
You can use an intermediate table to complete complex logic
drop table if exists tmp_table_1;
CREATE table if not exists tmp_table_1 as
Select ...;
drop table if exists tmp_table_2;
CREATE table if not exists tmp_table_2 as
Select ...;
drop table if exists result_table;
CREATE table if not exists result_table as
Select ...;
drop table if exists tmp_table_1;
drop table if exists tmp_table_2;
3: The number of jobs for a single SQL is controlled as much as 5 below
4: Prudent use of mapjoin, generally less than 2000 lines, the size of less than 1M (after expansion can be properly enlarged) of the table to be used, the small table should be placed on the left side of the join (TCL now many of the small table on the right side of the join).
Failure to do so can cause significant disk and memory consumption
5: Write SQL to understand the characteristics of the data itself, if there is a join, group operation, you should pay attention to whether there will be data skew
If data skew occurs, the following should be done:
Set hive.exec.reducers.max=200;
Set mapred.reduce.tasks=;---Increase the number of reduce
Set hive.groupby.mapaggr.checkinterval=100000;--This is the group key corresponding to the number of record bars above this value will be split, the value is set according to the specific data volume
Set hive.groupby.skewindata=true; --If the Group by procedure appears italic should be set to True
Set hive.skewjoin.key=100000; --This is the join key corresponding to the number of record bar more than this value will be split, the value is set according to the specific data volume
Set hive.optimize.skewjoin=true;--if it is a join process, the skew should be setting to True
6: If the number of parts of union all is greater than 2, or if the amount of data in each union is large, it should be split into multiple insert into statements, and the execution time can be increased by 50% during the actual test.
Insert Overwite Table tablename partition (dt= ...)
Select ... from (
Select ... from A
UNION ALL
Select ... from B
UNION ALL
Select ... from C
) R
where ...;
Can be rewritten as:
INSERT INTO table tablename partition (dt= ...)
Select .... from A
WHERE ...;
INSERT INTO table tablename partition (dt= ...)
Select .... from B
WHERE ...;
INSERT INTO table tablename partition (dt= ...)
Select .... from C
WHERE ...;
A colleague's summary of Hivesql optimized hive is to generate a string resolution that conforms to the SQL syntax that can be executed on Hadoop by M