Hive中表的關聯順序對產生MapReduce作業數的影響案例

最後更新：2014-06-15 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

Hive中在做多表關聯時,由於Hive的SQL最佳化引擎還不夠強大,表的關聯順序不同往往導致產生不同數量的MapReduce作業數。這時就需要通過分析執行計畫對SQL進行調整，以獲得最少的MapReduce作業數。舉一個例子（案例基於Hive 0.6.0）：

create table ljn1(

k1 bigint,

k2 String,

v1 int

);

create table ljn2(

k1 bigint,

v2 int

);

create table ljn3(

k1 bigint,

v3 int

);

create table ljn4(

k1 bigint,

v4 int

);

create table ljn5(

k1 bigint,

v5 int

);

create table ljn6(

k2 string,

v6 int

);

然後看一下下面這個SQL的執行計畫:

explain

select a.v1

from

ljn1 a

left outer join ljn2 b on (a.k1 = b.k1)

left outer join ljn3 c on (a.k1 = c.k1)

left outer join ljn4 d on (a.k1 = d.k1)

left outer join ljn6 e on (a.k2 = e.k2)

left outer join ljn5 f on (a.k1 = f.k1);

STAGE DEPENDENCIES:

Stage-5 is a root stage

Stage-1 depends on stages: Stage-5

Stage-2 depends on stages: Stage-1

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-5

Map Reduce

Alias -> Map Operator Tree:

TableScan

alias: a

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 0

value expressions:

expr: k1

type: bigint

expr: k2

type: string

expr: v1

type: int

TableScan

alias: b

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 1

Reduce Operator Tree:

Join Operator

condition map:

Left Outer Join0 to 1

condition expressions:

0 {VALUE._col0} {VALUE._col1} {VALUE._col2}

handleSkewJoin: false

outputColumnNames: _col0, _col1, _col2

File Output Operator

compressed: true

GlobalTableId: 0

table:

input format: org.apache.Hadoop.mapred.SequenceFileInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Stage: Stage-1

Map Reduce

Alias -> Map Operator Tree:

$INTNAME

Reduce Output Operator

key expressions:

expr: _col0

type: bigint

sort order: +

Map-reduce partition columns:

expr: _col0

type: bigint

tag: 0

value expressions:

expr: _col1

type: string

expr: _col2

type: int

TableScan

alias: c

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 1

TableScan

alias: d

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 2

TableScan

alias: f

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 3

Reduce Operator Tree:

Join Operator

condition map:

Left Outer Join0 to 1

Left Outer Join0 to 2

Left Outer Join0 to 3

condition expressions:

0 {VALUE._col3} {VALUE._col4}

handleSkewJoin: false

outputColumnNames: _col3, _col4

File Output Operator

compressed: true

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.SequenceFileInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Stage: Stage-2

Map Reduce

Alias -> Map Operator Tree:

$INTNAME

Reduce Output Operator

key expressions:

expr: _col3

type: string

sort order: +

Map-reduce partition columns:

expr: _col3

type: string

tag: 0

value expressions:

expr: _col4

type: int

TableScan

alias: e

Reduce Output Operator

key expressions:

expr: k2

type: string

sort order: +

Map-reduce partition columns:

expr: k2

type: string

tag: 1

Reduce Operator Tree:

Join Operator

condition map:

Left Outer Join0 to 1

condition expressions:

0 {VALUE._col10}

handleSkewJoin: false

outputColumnNames: _col10

Select Operator

expressions:

expr: _col10

type: int

outputColumnNames: _col0

File Output Operator

compressed: true

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0

Fetch Operator

limit: -1

常規來講,這個SQL非常簡單,a表是主表，與其他表左外關聯用到了k1和k2兩個關聯鍵，使用兩個MapReduce作業完全可以搞定。但是這個SQL的執行計畫卻給出了3個作業：（Stage-0用做資料的最終展示，該作業可以忽略不計）第1個作業（Stage-5）是a表與b表關聯；第2個作業（Stage-1）是第1個作業的中間結果再與c、d、f三表關聯；第3個作業（Stage-2）是第2個作業的中間結果再與e表關聯。

更多詳情見請繼續閱讀下一頁的精彩內容：

Hive 的詳細介紹：請點這裡
Hive 的：請點這裡

基於Hadoop叢集的Hive安裝

Hive內表和外表的區別

Hadoop + Hive + Map +reduce 叢集安裝部署

Hive本地獨立模式安裝

Hive學習之WordCount單詞統計

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More