Count (distinct key) case
explain select count(distinct session_id) from trackinfo where ds=‘ 2013-07-21‘ ;STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stageSTAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: trackinfo TableScan alias: trackinfo Filter Operator predicate: expr: (ds = ‘ 2013-07-21‘) type: boolean Filter Operator predicate: expr: (ds = ‘ 2013-07-21‘) type: boolean Select Operator expressions: expr: session_id type: string outputColumnNames: session_id Group By Operator aggregations: expr: count(DISTINCT session_id) bucketGroup: true keys: expr: session_id type: string mode: hash outputColumnNames: _col0, _col1 Reduce Output Operator key expressions: expr: _col0 type: string sort order: + tag: -1 value expressions: expr: _col1 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(DISTINCT KEY._col0:0._col0) bucketGroup: false mode: mergepartial outputColumnNames: _col0 Select Operator expressions: expr: _col0 type: bigint outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1
Group by case
explain select max(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stageSTAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: trackinfo TableScan alias: trackinfo Filter Operator predicate: expr: (ds = ‘2013-07-21‘) type: boolean Select Operator expressions: expr: city_id type: string expr: session_id type: string outputColumnNames: city_id, session_id Group By Operator aggregations: expr: max(session_id) bucketGroup: false keys: expr: city_id type: string mode: hash outputColumnNames: _col0, _col1 Reduce Output Operator key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: string Reduce Operator Tree: Group By Operator aggregations: expr: max(VALUE._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col1 type: string outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1
Case study of Count (distinct key) combined with group
explain select count(distinct session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stageSTAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: trackinfo TableScan alias: trackinfo Filter Operator predicate: expr: (ds = ‘2013-07-21‘) type: boolean Select Operator expressions: expr: city_id type: string expr: session_id type: string outputColumnNames: city_id, session_id Group By Operator aggregations: expr: count(DISTINCT session_id) bucketGroup: false keys: expr: city_id type: string expr: session_id type: string mode: hash outputColumnNames: _col0, _col1, _col2 Reduce Output Operator key expressions: expr: _col0 type: string expr: _col1 type: string sort order: ++ Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col2 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(DISTINCT KEY._col1:0._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col1 type: bigint outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1
Case study of max (key) Joint Group
explain select count(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stageSTAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: trackinfo TableScan alias: trackinfo Filter Operator predicate: expr: (ds = ‘2013-07-21‘) type: boolean Select Operator expressions: expr: city_id type: string expr: session_id type: string outputColumnNames: city_id, session_id Group By Operator aggregations: expr: count(session_id) bucketGroup: false keys: expr: city_id type: string mode: hash outputColumnNames: _col0, _col1 Reduce Output Operator key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(VALUE._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col1 type: bigint outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1
Execution Plan Summary
Select count (distinct session_id) from trackinfo where DS = '2017-11-01 ';
Session_id is distributed.
Select max (session_id) from trackinfo where DS = '2017-11-01 'group by city_id;
The distribution is city_id
Select count (distinct session_id) from trackinfo where DS = '2017-11-01 'group by city_id;
Session_id and city_id are distributed.
Select count (session_id) from trackinfo where DS = '2017-11-01 'group by city_id;
The distribution is city_id
The data skew conclusion is as follows:
Join, Group, Count (distinct key)Prone to data skew;
Max, CountAnd other Aggregate functions will not cause data skew.
Trackinfo table creation statement in the case
create table trackinfo (id bigint ,url string ,referer string ,keyword string ,type int ,gu_id string ,page_id string ,module_id string ,link_id string ,attached_info string ,session_id string ,tracker_u string ,tracker_type int ,ip string ,tracker_src string ,cookie string ,order_code string ,track_time string ,end_user_id bigint ,first_link string ,session_view_no int ,product_id string ,merchant_id bigint ,province_id string ,city_id string ,fee string ,edm_activity string ,edm_email string ,edm_jobid string ,ie_version string ,platform string ,internal_keyword string ,result_sum string ,currentpage string ,link_position string ,button_position string ,ext_field1 string ,ext_field2 string ,ext_field3 string ,ext_field4 string ,ext_field5 string ,adgroupkeywordid string ,ext_field6 string ,ext_field7 string ,ext_field8 string ,ext_field9 string ,ext_field10 string ,url_page_id int ,url_page_value string ,refer_page_id int ,refer_page_value string ) partitioned by(ds string);