Hive processes count distinct to produce data skew
Problem description
The problematic data is skewed to the category, but it cannot be joined on the Map side, and special keys are excluded for processing.
set hive.groupby.skewindata=true;insert overwrite table ad_overall_day partition(part_time='99', part_date='2015-11-99') select account_id, nvl(client_id,-1), nvl(track_id,'total'), sum(if(type=3,1,0)) as imp_cnt, sum(if(type=4,1,0)) as click_cnt, count(distinct if(type=3,zid,NULL)) as imp_uv,count(distinct if(type=4,zid,NULL)) as click_uv from derived_di_v3 where year='2015' and month='11' group by account_id, client_id, track_id grouping sets((account_id, client_id, track_id),(account_id, client_id),(account_id))
But error find.
FAILED: SemanticException [Error 10022]: DISTINCT on different columns not supported with skew in data
Separate group-by and join the results.
set hive.groupby.skewindata=true;set hive.exec.parallel=true;insert overwrite table ad_overall_day partition(part_time='99', part_date='2015-11-99') SELECT COALESCE(t1.account_id,t2.account_id),COALESCE(t1.client_id,t2.client_id), COALESCE(t1.track_id,t2.track_id),t1.imp_cnt,t1.imp_uv,t2.click_cnt,t2.click_uvFROM(select account_id, nvl(client_id,-1) as client_id, nvl(track_id,'total') as track_id, sum(if(type=3,1,0)) as imp_cnt, count(distinct if(type=3,zid,NULL)) as imp_uvFROM derived_di_v3 where year='2015' and month='11' group by account_id, client_id, track_id grouping sets((account_id, client_id, track_id),(account_id, client_id),(account_id)) ) t1FULL OUTER JOIN(select account_id, nvl(client_id,-1) as client_id, nvl(track_id,'total') as track_id, sum(if(type=4,1,0)) as click_cnt, count(distinct if(type=4,zid,NULL)) as click_uvFROM derived_di_v3 where year='2015' and month='11' group by account_id, client_id, track_id grouping sets((account_id, client_id, track_id),(account_id, client_id),(account_id)) ) t2 ON t1.account_id = t2.account_id and t1.client_id = t2.client_id and t1.track_id = t2.track_id;
Cann' t run two MapReduce Job
Unfortunately, hive doesn't explain the hql to two mapreduce job.
The parameterhive.groupby.skewindata
Seems has no affect.
Change the hql:
insert overwrite table ad_overall_day partition(part_time='99', part_date='2015-11-99')select account_id, nvl(client_id,-1) as client_id, nvl(track_id,'total') as track_id, sum(imp1) as imp_cnt,count(imp2) as imp_uv,sum(click1) as click_cnt,count(click2) as click_uvFROM (select account_id, client_id, track_id, if(type=3,1,0) as imp1,if(type=3,zid,NULL) as imp2, if(type=4,1,0) as click1,if(type=4,zid,NULL) as click2 FROM dmp.derived_di_v3 where year='2015' and month='11' group by account_id, client_id, track_id,type,zid) tgroup by account_id, client_id, track_id grouping sets((account_id, client_id, track_id),(account_id, client_id),(account_id))
Help Links
HIVE-1, 474
HIVE-1, 537
Hive programming guide PDF (Chinese Version)
Hadoop cluster-based Hive Installation
Differences between Hive internal tables and external tables
Hadoop + Hive + Map + reduce cluster installation and deployment
Install in Hive local standalone Mode
WordCount word statistics for Hive Learning
Hive operating architecture and configuration and deployment
Hive details: click here
Hive: click here
This article permanently updates the link address: