1. HiveHow to determine the reduce number yourself:
Reducethe number setting greatly affects the task execution efficiency, without specifying the number of reduce, hive guesses to determine the number of reduce, based on the following two settings:
Hive.exec.reducers.bytes.per.reducer(the amount of data processed per reduce task, default is 1000^3=1g)
Hive.exec.reducers.max (the maximum number of reduce per task, default is 999)
the formula for calculating the number of reducer is simple n=min (parameter 2, total input data volume/parameter1)
That is, if the total size of the input (map output) of reduce is not more than 1G, then there will only be one reduce task;
such as: Select Pt,count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ' GROUP by PT;
The total size of the/group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04 is more than 9G, so this sentence has 10 Reduce
2. Adjust the reduce number method one:
Adjust the value of the Hive.exec.reducers.bytes.per.reducer parameter;
set hive.exec.reducers.bytes.per.reducer=500000000;(500M)
Select Pt,count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ' GROUP by PT;There are 20 of them. Reduce
3. Adjust reduce number method two;
set mapred.reduce.tasks = 15;
Select Pt,count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ' GROUP by PT;There are 15 of them. Reduce
4. the number of reduce is not the more the better;
As with map, the startup and initialization of reduce also consumes time and resources;
In addition, how many reduce, there will be the number of output files, if you generate a lot of small files, then if these small files as the next task input, there will be too many small files problem;
5.Under what circumstances there is only one reduce;
Most of the time you will find that regardless of the amount of data in the task, regardless of whether you have set parameters to adjust the number of reduce, the task has been only a reduce task;
In fact, there is only one reduce task, in addition to the data volume is less than the Hive.exec.reducers.bytes.per.reducer parameter value of the case, there are the following reasons:
a)There is no aggregate of group by, such as the Select Pt,count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ' GROUP by PT; Select COUNT (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ';
This is very common and I hope you will try to rewrite it.
b)used a Order by
c) with Cartesian product
usually in these cases, in addition to find a way to work around and avoid, I have no good way, because these operations are global, so Hadoop had to use a reduce to complete;
Similarly, these two principles need to be taken into account when setting the number of reduce: to make large data use the appropriate reduce number, and to make a single reduce task handle the appropriate amount of data;
Hive is to generate a string resolution that conforms to SQL syntax to produce a mapreduce tool that can be executed on Hadoop.
Using hive to design SQL as many of the features of distributed computing as possible differs from traditional relational databases,
So we need to get rid of the original relational database development of some of the inherent thinking.
Basic principles:
1 : Filter the data as early as possible, reduce the amount of data at each stage, partition the partition table, and select only the fields that need to be used
Select ... from A
Join B
On a.key = B.key
where a.userid>10
and b.userid<10
and a.dt= ' 20120417 '
and b.dt= ' 20120417 ';
Should be rewritten as:
Select .... from (select .... from A
where dt= ' 201200417 '
and userid>10
) A
Join (SELECT .... from B
where dt= ' 201200417 '
and UserID < 10
) b
on a.key = B.key;
2 : Try to avoid a SQL containing complex logic as far as possible atomically
You can use an intermediate table to complete complex logic
drop table if exists tmp_table_1;
CREATE table if not exists tmp_table_1 as
Select ...;
drop table if exists tmp_table_2;
CREATE table if not exists tmp_table_2 as
Select ...;
drop table if exists result_table;
CREATE table if not exists result_table as
Select ...;
drop table if exists tmp_table_1;
drop table if exists tmp_table_2;
3 : The number of jobs for a single SQL is controlled as much as 5 or less
4 : Careful use of mapjoin, generally less than 2000 lines, the size of less than 1M (after expansion can be properly enlarged) of the table to be used, the small table should be noted on the left side of the join (TCL now many of the small table on the right side of the join).
Failure to do so can cause significant disk and memory consumption
5 : Write SQL to understand the characteristics of the data itself, if there is a join, group operation, you should pay attention to whether there will be data skew
If data skew occurs, the following should be done:
Set hive.exec.reducers.max=200;
set mapred.reduce.tasks=;--- increase the number of reduce
set hive.groupby.mapaggr.checkinterval=100000;-- This is the group key corresponding to the number of records more than this value will be split, the value is set according to the specific data volume
set hive.groupby.skewindata=true;-- if the Group by procedure appears italic should be set to true
set hive.skewjoin.key=100000;-- This is the join key corresponding to the number of record bar more than this value will be split, the value is set according to the specific data volume
Set hive.optimize.skewjoin=true;-- If the join process appears skewed, it should be set to true
6 : If the number of parts of union all is greater than 2, or if the amount of data in each union is large, it should be split into multiple insert into statements, and the execution time can be increased during the actual test 50%
Insert Overwite Table tablename partition (dt= ...)
Select ... from (
Select ... from A
UNION ALL
Select ... from B
UNION ALL
Select ... from C
) R
where ...;
Can be rewritten as:
INSERT INTO table tablename partition (dt= ...)
Select .... from A
WHERE ...;
INSERT INTO table tablename partition (dt= ...)
Select .... from B
WHERE ...;
INSERT INTO table tablename partition (dt= ...)
Select .... from C
WHERE ...;
For more highlights, please follow: http://bbs.superwu.cn
Focus on the two-dimensional code of Superman Academy:
Hive Optimization-----Controlling the number of reduce for hive tasks