5. Analysis of data skew caused by execution plan tracking at hive syntax level optimization

Last Update:2014-07-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Count (distinct key) case

explain select count(distinct session_id) from trackinfo where ds=‘ 2013-07-21‘ ;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘ 2013-07-21‘)                  type: boolean              Filter Operator                predicate:                    expr: (ds = ‘ 2013-07-21‘)                    type: boolean                Select Operator                  expressions:                        expr: session_id                        type: string                  outputColumnNames: session_id                  Group By Operator                    aggregations:                          expr: count(DISTINCT session_id)                    bucketGroup: true                    keys:                          expr: session_id                          type: string                    mode: hash                    outputColumnNames: _col0, _col1                    Reduce Output Operator                      key expressions:                            expr: _col0                            type: string                      sort order: +                      tag: -1                      value expressions:                            expr: _col1                            type: bigint      Reduce Operator Tree:        Group By Operator          aggregations:                expr: count(DISTINCT KEY._col0:0._col0)          bucketGroup: false          mode: mergepartial          outputColumnNames: _col0          Select Operator            expressions:                  expr: _col0                  type: bigint            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

Group by case

explain select max(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘2013-07-21‘)                  type: boolean              Select Operator                expressions:                      expr: city_id                      type: string                      expr: session_id                      type: string                outputColumnNames: city_id, session_id                Group By Operator                  aggregations:                        expr: max(session_id)                  bucketGroup: false                  keys:                        expr: city_id                        type: string                  mode: hash                  outputColumnNames: _col0, _col1                  Reduce Output Operator                    key expressions:                          expr: _col0                          type: string                    sort order: +                    Map-reduce partition columns:                          expr: _col0                          type: string                    tag: -1                    value expressions:                          expr: _col1                          type: string      Reduce Operator Tree:        Group By Operator          aggregations:                expr: max(VALUE._col0)          bucketGroup: false          keys:                expr: KEY._col0                type: string          mode: mergepartial          outputColumnNames: _col0, _col1          Select Operator            expressions:                  expr: _col1                  type: string            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

Case study of Count (distinct key) combined with group

explain select count(distinct session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘2013-07-21‘)                  type: boolean              Select Operator                expressions:                      expr: city_id                      type: string                      expr: session_id                      type: string                outputColumnNames: city_id, session_id                Group By Operator                  aggregations:                        expr: count(DISTINCT session_id)                  bucketGroup: false                  keys:                        expr: city_id                        type: string                        expr: session_id                        type: string                  mode: hash                  outputColumnNames: _col0, _col1, _col2                  Reduce Output Operator                    key expressions:                          expr: _col0                          type: string                          expr: _col1                          type: string                    sort order: ++                    Map-reduce partition columns:                          expr: _col0                          type: string                    tag: -1                    value expressions:                          expr: _col2                          type: bigint      Reduce Operator Tree:        Group By Operator          aggregations:                expr: count(DISTINCT KEY._col1:0._col0)          bucketGroup: false          keys:                expr: KEY._col0                type: string          mode: mergepartial          outputColumnNames: _col0, _col1          Select Operator            expressions:                  expr: _col1                  type: bigint            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

Case study of max (key) Joint Group

explain select count(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘2013-07-21‘)                  type: boolean              Select Operator                expressions:                      expr: city_id                      type: string                      expr: session_id                      type: string                outputColumnNames: city_id, session_id                Group By Operator                  aggregations:                        expr: count(session_id)                  bucketGroup: false                  keys:                        expr: city_id                        type: string                  mode: hash                  outputColumnNames: _col0, _col1                  Reduce Output Operator                    key expressions:                          expr: _col0                          type: string                    sort order: +                    Map-reduce partition columns:                          expr: _col0                          type: string                    tag: -1                    value expressions:                          expr: _col1                          type: bigint      Reduce Operator Tree:        Group By Operator          aggregations:                expr: count(VALUE._col0)          bucketGroup: false          keys:                expr: KEY._col0                type: string          mode: mergepartial          outputColumnNames: _col0, _col1          Select Operator            expressions:                  expr: _col1                  type: bigint            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

Execution Plan Summary

Select count (distinct session_id) from trackinfo where DS = '2017-11-01 ';

Session_id is distributed.

Select max (session_id) from trackinfo where DS = '2017-11-01 'group by city_id;

The distribution is city_id

Select count (distinct session_id) from trackinfo where DS = '2017-11-01 'group by city_id;

Session_id and city_id are distributed.

Select count (session_id) from trackinfo where DS = '2017-11-01 'group by city_id;

The distribution is city_id

The data skew conclusion is as follows:

Join, Group, Count (distinct key)Prone to data skew;

Max, CountAnd other Aggregate functions will not cause data skew.

Trackinfo table creation statement in the case

create table trackinfo (id      bigint                          ,url     string                          ,referer string                          ,keyword string                          ,type    int                             ,gu_id   string                          ,page_id string                          ,module_id       string                  ,link_id string                          ,attached_info   string                  ,session_id      string                  ,tracker_u       string                  ,tracker_type    int                     ,ip      string                          ,tracker_src     string                  ,cookie  string                          ,order_code      string                  ,track_time      string                  ,end_user_id     bigint                  ,first_link      string                  ,session_view_no int                     ,product_id      string                  ,merchant_id     bigint                  ,province_id     string                  ,city_id string                          ,fee     string                          ,edm_activity    string                  ,edm_email       string                  ,edm_jobid       string                  ,ie_version      string                  ,platform        string                  ,internal_keyword        string          ,result_sum      string                  ,currentpage     string                  ,link_position   string                  ,button_position string                  ,ext_field1      string                  ,ext_field2      string                  ,ext_field3      string                  ,ext_field4      string                  ,ext_field5      string                  ,adgroupkeywordid        string          ,ext_field6      string                  ,ext_field7      string                  ,ext_field8      string                  ,ext_field9      string                  ,ext_field10     string                  ,url_page_id     int                     ,url_page_value  string                  ,refer_page_id   int                     ,refer_page_value        string ) partitioned by(ds      string);

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

5. Analysis of data skew caused by execution plan tracking at hive syntax level optimization

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

5. Analysis of data skew caused by execution plan tracking at hive syntax level optimization

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support