Hive 令人頭痛的multi-distinct

最後更新：2014-06-15 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

線上一個查詢簡化如下：
Selectdt,count(distinct c1) , count(distinct case when c2>0 and c1=0 then c1 end),count(distinct case when c2>0 and c1>0 then c1 end) from t where dtbetween ‘20131108’ and ‘20131110’ group by dt;
一個讓人頭痛的multi-distinct問題，為什麼說很頭痛，看看執行計畫就清楚了：

ABSTRACTSYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAMEt))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT(TOK_SELEXPR (TOK_TABLE_OR_COL dt)) (TOK_SELEXPR (TOK_FUNCTIONDI count(TOK_TABLE_OR_COL c1))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when(and (> (TOK_TABLE_OR_COL c2) 0) (= (TOK_TABLE_OR_COL c1) 0))(TOK_TABLE_OR_COL c1)))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when(and (> (TOK_TABLE_OR_COL c2) 0) (> (TOK_TABLE_OR_COL c1) 0))(TOK_TABLE_OR_COL c1))))) (TOK_WHERE (TOK_FUNCTION between KW_FALSE(TOK_TABLE_OR_COL dt) '20131108' '20131110')) (TOK_GROUPBY (TOK_TABLE_OR_COLdt))))
STAGEDEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGEPLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
t
TableScan
alias: t
Filter Operator
predicate:
expr: dt BETWEEN '20131108'AND '20131110'
type: Boolean
//通過select operator做投影
Select Operator
expressions:
expr: dt
type: string
expr: c1
type: int
expr: c2
type: int
outputColumnNames: dt, c1, c2
//在MAP端進行簡單的彙總，雷區1：假設有N個distinct，MAP處理資料有M條，那麼這部處理後的輸出是N*M條資料，因為MAP會對dt,keys[i]做彙總操作，所以盡量在MAP端過濾儘可能多的資料
Group By Operator
aggregations:
expr: count(DISTINCTc1)
expr: count(DISTINCTCASE WHEN (((c2 > 0) and (c1 = 0))) THEN (c1) END)
expr: count(DISTINCTCASE WHEN (((c2 > 0) and (c1 > 0))) THEN (c1) END)
bucketGroup: false
keys:
expr: dt
type: string
expr: c1
type: int
expr: CASE WHEN (((c2> 0) and (c1 = 0))) THEN (c1) END
type: int
expr: CASE WHEN (((c2> 0) and (c1 > 0))) THEN (c1) END
type: int
mode: hash
outputColumnNames: _col0,_col1, _col2, _col3, _col4, _col5, _col6
//雷區2：在做Reduce Sink時是根據partition cplumns進行HASH的方式，那麼對於按date分區的表來說一天的所有資料被放大N倍傳輸到Reducer進行運算，導致效能長尾或者OOME.
Reduce Output Operator
key expressions:
expr: _col0
type: string
expr: _col1
type: int
expr: _col2
type: int
expr: _col3
type: int
sort order: ++++
Map-reduce partitioncolumns:
expr: _col0
type: string
tag: -1
value expressions:
expr: _col4
type: bigint
expr: _col5
type: bigint
expr: _col6
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(DISTINCTKEY._col1:0._col0)
expr: count(DISTINCTKEY._col1:1._col0)
expr: count(DISTINCTKEY._col1:2._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: mergepartial
outputColumnNames: _col0, _col1,_col2, _col3
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
expr: _col2
type: bigint
expr: _col3
type: bigint
outputColumnNames: _col0, _col1,_col2, _col3
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format:org.apache.Hadoop.mapred.TextInputFormat
output format:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1

雖說Hive裡有參數：hive.groupby.skewindata（具體見：）不過當設定該參數時只支援single-distinct（https://issues.apache.org/jira/browse/HIVE-2416）因此在這種情境下是沒辦法設定的.但是這個參數還是有一定啟發的就是把SQL化歸到這種single-distinct:通過union all(注意不能直接Union All而是需要嵌套進子查詢，否則會報異常：Toplevel UNION is not supported currently; use a subquery for the UNION).

查看執行計畫（省去非關鍵區段）：
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages:Stage-1, Stage-3, Stage-4
Stage-3 is a root stage
Stage-4 is a root stage
Stage-0 is a root stage

可以看到每個single-distinct都是獨立的stage，因此可以設定上面的參數，這裡既然每個stage是獨立的那麼是不是可以設定hive.exec.parallel，hive.exec.parallel.thread.number這兩個參數來以資源換時間呢？故事總是殘忍的，這裡還有一個Bug(https://issues.apache.org/jira/browse/HIVE-4436), 因此在Hive0.12 release前是沒辦法的，這就叫有錢沒地方花.

另外也有通過unionall+sum的解決方案，感興趣的同學也可以嘗試一下.

Hive 的詳細介紹：請點這裡
Hive 的下載地��：請點這裡

基於Hadoop叢集的Hive安裝

Hive內表和外表的區別

Hadoop + Hive + Map +reduce 叢集安裝部署

Hive本地獨立模式安裝

Hive學習之WordCount單詞統計

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More