hive的multi-distinct可能帶來效能惡化

來源:互聯網
上載者:User

目前hive的版本支援multi-distinct的特性,這個在用起來比較方便,但是在此特性下面無法開啟防資料扭曲的開關(set hive.groupby.skewindata=true),防止資料扭曲的參數只在單distinct情況下會通過一個job來防止資料的傾斜。 multi-distinct使用起來方便的同時也可能會帶來效能的不最佳化,如日誌中常常統計pv,Uv,獨立ip數,獨立session數,這些都要去 重統計,如下面統計各個瀏覽器佔比的SQL,這個sql可能需要運行20到30分鐘(這個和叢集和日誌資料量相關),browser_core只有10個 數值,其reduce壓力很大,最佳化後會有50%-70%的提升

1.   原SQL

Select    

browser_core,

    count(1) as pv,

    count(distinct uniq_id) as uv,

    count(distinct client_ip) as ip_cnt,

    count(distinct session_id) as session_cnt,

    count(distinct apay_aid) as apay_aid_cnt,

    count(distinct apay_uid) as apay_uid_cnt

From dw_log

wheredt=20120101

  andpage_type='page'

  andagent is not null

  andagent <> '-'

group bybrowser_core

 

2.   改進SQL如下:

步驟(1):首先進行初步去重匯總

Create table   tmp_browser_core_ds_1 as

Select

Browser_core,

uniq_id,

client_ip,

session_id,

apay_aid,

apay_uid,

count(1) as pv

from dw_log

wheredt=20120101

  andpage_type='page'

  andagent is not null

  andagent <> '-'

group bybrowser_core,uniq_id,client_ip,session_id,apay_aid,apay_uid;

 

步驟(2):最關鍵的一步,相當於用空間來換時間。借用union all的把資料根據distinct的欄位擴充起來,假如有8個distinct,相當於資料擴充8倍,用rownumber=1來達到間接去重的目的, 如果這裡不計算整體pv的話,可以直接進行Group by效果一樣。這裡的unionall只走一個job,不會因為job多拖後腿(hadoop不怕資料量大【一定範圍內】,就怕job多和資料扭曲)。 

setmapred.reduce.tasks=300;

Create table tmp_browser_core_ds_2 as

select

     type,

     browser_core,

     type_value,

     pv,

     rownumber(type,type_value,browser_core) as rn

from (

    select

         type,

         browser_core,

         type_value,

         pv

    from (

         select

               'client_ip'as type,browser_core,client_ip as type_value,pv

         from  tmp_browser_core_ds_1

         union all

          select

               'uniq_id'as type,browser_core,uniq_id as type_value,pv

         from  tmp_browser_core_ds_1

         union all

          select

               'session_id'as type,browser_core,session_id as type_value,pv

         from tmp_st_log_browser_core_ds_1

         union all

          select

               'apay_aid'as type,browser_core,apay_aid as type_value,pv

         from  tmp_browser_core_ds_1

         union all

          select

               'apay_uid'as type,browser_core,apay_uid as type_value,pv

         from  tmp_browser_core_ds_1

    ) t

    distribute by type,type_value,browser_core

    sort by  type,type_value,browser_core

) t1;

 

 

步驟(3): 得到最終結果,沒有一個distinct,全部走的是普通sum,可以在mapper端提前彙總,會很快

select

   browser_core,

   sum(case when type='uniq_id' then pv else cast(0 as bigint) end) as pv,

   sum(case when type='client_ip' and rn=1 then 1else 0 end) ip_cnt,

   sum(case when type='uniq_id' and rn=1 then 1 else 0 end) as uv,

   sum(case when type='session_id' and rn=1 then 1 else 0 end) as session_cnt,

   sum(case when type='apay_aid' and rn=1 then 1 else 0 end) as apay_aid_cnt,

   sum(case when type='apay_uid' and rn=1 then 1 else 0 end) as apay_uid_cnt

fromtmp_st_log_browser_core_ds_2

group bybrowser_core

 

改進SQL雖然整體job數為3個,較原sql多2個job,但整體已耗用時間不超過10分鐘。基本思路,通過多job來進行multi- distinct,也可以看到rownumber的妙用,充分發揮hadoop的partition和sort的優勢。如果某個sql的multi- distinct本身很快,就不要這麼麻煩。

 

reference:

http://www.docin.com/p-150084947.html

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.