5. Analysis of data skew caused by execution plan tracking at hive syntax level optimization

Source: Internet
Author: User
Count (distinct key) case
explain select count(distinct session_id) from trackinfo where ds=‘ 2013-07-21‘ ;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘ 2013-07-21‘)                  type: boolean              Filter Operator                predicate:                    expr: (ds = ‘ 2013-07-21‘)                    type: boolean                Select Operator                  expressions:                        expr: session_id                        type: string                  outputColumnNames: session_id                  Group By Operator                    aggregations:                          expr: count(DISTINCT session_id)                    bucketGroup: true                    keys:                          expr: session_id                          type: string                    mode: hash                    outputColumnNames: _col0, _col1                    Reduce Output Operator                      key expressions:                            expr: _col0                            type: string                      sort order: +                      tag: -1                      value expressions:                            expr: _col1                            type: bigint      Reduce Operator Tree:        Group By Operator          aggregations:                expr: count(DISTINCT KEY._col0:0._col0)          bucketGroup: false          mode: mergepartial          outputColumnNames: _col0          Select Operator            expressions:                  expr: _col0                  type: bigint            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

 

 

Group by case
explain select max(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘2013-07-21‘)                  type: boolean              Select Operator                expressions:                      expr: city_id                      type: string                      expr: session_id                      type: string                outputColumnNames: city_id, session_id                Group By Operator                  aggregations:                        expr: max(session_id)                  bucketGroup: false                  keys:                        expr: city_id                        type: string                  mode: hash                  outputColumnNames: _col0, _col1                  Reduce Output Operator                    key expressions:                          expr: _col0                          type: string                    sort order: +                    Map-reduce partition columns:                          expr: _col0                          type: string                    tag: -1                    value expressions:                          expr: _col1                          type: string      Reduce Operator Tree:        Group By Operator          aggregations:                expr: max(VALUE._col0)          bucketGroup: false          keys:                expr: KEY._col0                type: string          mode: mergepartial          outputColumnNames: _col0, _col1          Select Operator            expressions:                  expr: _col1                  type: string            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

 

 

Case study of Count (distinct key) combined with group
explain select count(distinct session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘2013-07-21‘)                  type: boolean              Select Operator                expressions:                      expr: city_id                      type: string                      expr: session_id                      type: string                outputColumnNames: city_id, session_id                Group By Operator                  aggregations:                        expr: count(DISTINCT session_id)                  bucketGroup: false                  keys:                        expr: city_id                        type: string                        expr: session_id                        type: string                  mode: hash                  outputColumnNames: _col0, _col1, _col2                  Reduce Output Operator                    key expressions:                          expr: _col0                          type: string                          expr: _col1                          type: string                    sort order: ++                    Map-reduce partition columns:                          expr: _col0                          type: string                    tag: -1                    value expressions:                          expr: _col2                          type: bigint      Reduce Operator Tree:        Group By Operator          aggregations:                expr: count(DISTINCT KEY._col1:0._col0)          bucketGroup: false          keys:                expr: KEY._col0                type: string          mode: mergepartial          outputColumnNames: _col0, _col1          Select Operator            expressions:                  expr: _col1                  type: bigint            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

 

Case study of max (key) Joint Group
explain select count(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘2013-07-21‘)                  type: boolean              Select Operator                expressions:                      expr: city_id                      type: string                      expr: session_id                      type: string                outputColumnNames: city_id, session_id                Group By Operator                  aggregations:                        expr: count(session_id)                  bucketGroup: false                  keys:                        expr: city_id                        type: string                  mode: hash                  outputColumnNames: _col0, _col1                  Reduce Output Operator                    key expressions:                          expr: _col0                          type: string                    sort order: +                    Map-reduce partition columns:                          expr: _col0                          type: string                    tag: -1                    value expressions:                          expr: _col1                          type: bigint      Reduce Operator Tree:        Group By Operator          aggregations:                expr: count(VALUE._col0)          bucketGroup: false          keys:                expr: KEY._col0                type: string          mode: mergepartial          outputColumnNames: _col0, _col1          Select Operator            expressions:                  expr: _col1                  type: bigint            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

 

Execution Plan Summary

Select count (distinct session_id) from trackinfo where DS = '2017-11-01 ';

Session_id is distributed.

 

Select max (session_id) from trackinfo where DS = '2017-11-01 'group by city_id;

The distribution is city_id

 

Select count (distinct session_id) from trackinfo where DS = '2017-11-01 'group by city_id;

Session_id and city_id are distributed.

 

Select count (session_id) from trackinfo where DS = '2017-11-01 'group by city_id;

The distribution is city_id

 

The data skew conclusion is as follows:

Join, Group, Count (distinct key)Prone to data skew;

Max, CountAnd other Aggregate functions will not cause data skew.

 

Trackinfo table creation statement in the case

create table trackinfo (id      bigint                          ,url     string                          ,referer string                          ,keyword string                          ,type    int                             ,gu_id   string                          ,page_id string                          ,module_id       string                  ,link_id string                          ,attached_info   string                  ,session_id      string                  ,tracker_u       string                  ,tracker_type    int                     ,ip      string                          ,tracker_src     string                  ,cookie  string                          ,order_code      string                  ,track_time      string                  ,end_user_id     bigint                  ,first_link      string                  ,session_view_no int                     ,product_id      string                  ,merchant_id     bigint                  ,province_id     string                  ,city_id string                          ,fee     string                          ,edm_activity    string                  ,edm_email       string                  ,edm_jobid       string                  ,ie_version      string                  ,platform        string                  ,internal_keyword        string          ,result_sum      string                  ,currentpage     string                  ,link_position   string                  ,button_position string                  ,ext_field1      string                  ,ext_field2      string                  ,ext_field3      string                  ,ext_field4      string                  ,ext_field5      string                  ,adgroupkeywordid        string          ,ext_field6      string                  ,ext_field7      string                  ,ext_field8      string                  ,ext_field9      string                  ,ext_field10     string                  ,url_page_id     int                     ,url_page_value  string                  ,refer_page_id   int                     ,refer_page_value        string ) partitioned by(ds      string);

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.