Hive 處理count distinct 產生資料扭曲處理

來源:互聯網
上載者:User

Hive 處理count distinct 產生資料扭曲處理
問題描述

問題資料扭曲範疇,但是又不可以在Map端join,剔除特殊Key等方法進行處理。

set hive.groupby.skewindata=true;insert overwrite table ad_overall_day partition(part_time='99', part_date='2015-11-99') select account_id, nvl(client_id,-1), nvl(track_id,'total'), sum(if(type=3,1,0)) as imp_cnt, sum(if(type=4,1,0)) as click_cnt, count(distinct if(type=3,zid,NULL)) as imp_uv,count(distinct if(type=4,zid,NULL)) as click_uv from derived_di_v3 where year='2015' and month='11' group by account_id, client_id, track_id grouping sets((account_id, client_id, track_id),(account_id, client_id),(account_id))   

But error find.
FAILED: SemanticException [Error 10022]: DISTINCT on different columns not supported with skew in data

separate group-by and join the results.
set hive.groupby.skewindata=true;set hive.exec.parallel=true;insert overwrite table ad_overall_day partition(part_time='99', part_date='2015-11-99') SELECT COALESCE(t1.account_id,t2.account_id),COALESCE(t1.client_id,t2.client_id),    COALESCE(t1.track_id,t2.track_id),t1.imp_cnt,t1.imp_uv,t2.click_cnt,t2.click_uvFROM(select account_id, nvl(client_id,-1) as client_id, nvl(track_id,'total') as track_id, sum(if(type=3,1,0)) as imp_cnt, count(distinct if(type=3,zid,NULL)) as imp_uvFROM derived_di_v3 where year='2015' and month='11' group by account_id, client_id, track_id grouping sets((account_id, client_id, track_id),(account_id, client_id),(account_id)) ) t1FULL OUTER JOIN(select account_id, nvl(client_id,-1) as client_id, nvl(track_id,'total') as track_id, sum(if(type=4,1,0)) as click_cnt, count(distinct if(type=4,zid,NULL)) as click_uvFROM derived_di_v3 where year='2015' and month='11' group by account_id, client_id, track_id grouping sets((account_id, client_id, track_id),(account_id, client_id),(account_id)) ) t2  ON t1.account_id = t2.account_id and t1.client_id = t2.client_id and t1.track_id = t2.track_id;
Cann’t run two MapReduce Job

Unfortunately, hive doesn’t explain the hql to two mapreduce job.
The parameter hive.groupby.skewindata seems has no affect.

Change the hql :

insert overwrite table ad_overall_day partition(part_time='99', part_date='2015-11-99')select account_id, nvl(client_id,-1) as client_id, nvl(track_id,'total') as track_id,     sum(imp1) as imp_cnt,count(imp2) as imp_uv,sum(click1) as click_cnt,count(click2) as click_uvFROM (select account_id, client_id, track_id,     if(type=3,1,0) as imp1,if(type=3,zid,NULL) as imp2,     if(type=4,1,0) as click1,if(type=4,zid,NULL) as click2    FROM dmp.derived_di_v3 where year='2015' and month='11'     group by account_id, client_id, track_id,type,zid) tgroup by account_id, client_id, track_id grouping sets((account_id, client_id, track_id),(account_id, client_id),(account_id))
Help Links

HIVE-474
HIVE-537

Hive編程指南 PDF 中文高清版 

基於Hadoop叢集的Hive安裝

Hive內表和外表的區別

Hadoop + Hive + Map +reduce 叢集安裝部署

Hive本地獨立模式安裝

Hive學習之WordCount單詞統計

Hive運行架構及配置部署

Hive 的詳細介紹:請點這裡
Hive 的:請點這裡

本文永久更新連結地址:

相關文章

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.