Hive中在做多表關聯時,由於Hive的SQL最佳化引擎還不夠強大,表的關聯順序不同往往導致產生不同數量的MapReduce作業數。這時就需要通過分析執行計畫對SQL進行調整,以獲得最少的MapReduce作業數。舉一個例子(案例基於Hive 0.6.0):
create table ljn1(
k1 bigint,
k2 String,
v1 int
);
create table ljn2(
k1 bigint,
v2 int
);
create table ljn3(
k1 bigint,
v3 int
);
create table ljn4(
k1 bigint,
v4 int
);
create table ljn5(
k1 bigint,
v5 int
);
create table ljn6(
k2 string,
v6 int
);
然後看一下下面這個SQL的執行計畫:
explain
select a.v1
from
ljn1 a
left outer join ljn2 b on (a.k1 = b.k1)
left outer join ljn3 c on (a.k1 = c.k1)
left outer join ljn4 d on (a.k1 = d.k1)
left outer join ljn6 e on (a.k2 = e.k2)
left outer join ljn5 f on (a.k1 = f.k1);
STAGE DEPENDENCIES:
Stage-5 is a root stage
Stage-1 depends on stages: Stage-5
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-5
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: k1
type: bigint
sort order: +
Map-reduce partition columns:
expr: k1
type: bigint
tag: 0
value expressions:
expr: k1
type: bigint
expr: k2
type: string
expr: v1
type: int
b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: k1
type: bigint
sort order: +
Map-reduce partition columns:
expr: k1
type: bigint
tag: 1
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
1
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.Hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
$INTNAME
Reduce Output Operator
key expressions:
expr: _col0
type: bigint
sort order: +
Map-reduce partition columns:
expr: _col0
type: bigint
tag: 0
value expressions:
expr: _col1
type: string
expr: _col2
type: int
c
TableScan
alias: c
Reduce Output Operator
key expressions:
expr: k1
type: bigint
sort order: +
Map-reduce partition columns:
expr: k1
type: bigint
tag: 1
d
TableScan
alias: d
Reduce Output Operator
key expressions:
expr: k1
type: bigint
sort order: +
Map-reduce partition columns:
expr: k1
type: bigint
tag: 2
f
TableScan
alias: f
Reduce Output Operator
key expressions:
expr: k1
type: bigint
sort order: +
Map-reduce partition columns:
expr: k1
type: bigint
tag: 3
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
Left Outer Join0 to 2
Left Outer Join0 to 3
condition expressions:
0 {VALUE._col3} {VALUE._col4}
1
2
3
handleSkewJoin: false
outputColumnNames: _col3, _col4
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
$INTNAME
Reduce Output Operator
key expressions:
expr: _col3
type: string
sort order: +
Map-reduce partition columns:
expr: _col3
type: string
tag: 0
value expressions:
expr: _col4
type: int
e
TableScan
alias: e
Reduce Output Operator
key expressions:
expr: k2
type: string
sort order: +
Map-reduce partition columns:
expr: k2
type: string
tag: 1
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col10}
1
handleSkewJoin: false
outputColumnNames: _col10
Select Operator
expressions:
expr: _col10
type: int
outputColumnNames: _col0
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
常規來講,這個SQL非常簡單,a表是主表,與其他表左外關聯用到了k1和k2兩個關聯鍵,使用兩個MapReduce作業完全可以搞定。但是這個SQL的執行計畫卻給出了3個作業:(Stage-0用做資料的最終展示,該作業可以忽略不計)第1個作業(Stage-5)是a表與b表關聯;第2個作業(Stage-1)是第1個作業的中間結果再與c、d、f三表關聯;第3個作業(Stage-2)是第2個作業的中間結果再與e表關聯。
更多詳情見請繼續閱讀下一頁的精彩內容:
Hive 的詳細介紹:請點這裡
Hive 的:請點這裡
基於Hadoop叢集的Hive安裝
Hive內表和外表的區別
Hadoop + Hive + Map +reduce 叢集安裝部署
Hive本地獨立模式安裝
Hive學習之WordCount單詞統計