Hive's headache: multi-distinct

Last Update:2014-06-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

An online query is simplified as follows:
Selectdt, count (distinct c1), count (distinct case when c2> 0 and c1 = 0 then c1 end ), count (distinct case when c2> 0 and c1> 0 then c1 end) from t where dtbetween '000000' and '000000' group by dt;
The multi-distinct problem is a headache. Why do you have a headache? Let's look at the execution plan:

Abstractsyntax tree:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAMEt) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE) (TOK_SELECT (partition dt) (partition (TOK_FUNCTIONDI count (partition c1 ))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when (and (> (TOK_TABLE_OR_COL c2) 0) (= (TOK_TABLE_OR_COL c1) 0) (TOK_TABLE_OR_COL c1 )))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when (and (> (TOK_TABLE_OR_COL c2) 0) (> (TOK_TABLE_OR_COL c1) 0) (TOK_TABLE_OR_COL c1 ))))) (TOK_WHERE (TOK_FUNCTION between KW_FALSE (TOK_TABLE_OR_COL dt) '000000' 20131108 ') (TOK_GROUPBY (TOK_TABLE_OR_COLdt ))))
STAGEDEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGEPLANS:
Stage: Stage-1
Map Reduce
Alias-> Map Operator Tree:
T
TableScan
Alias: t
Filter Operator
Predicate:
Expr: dt BETWEEN '000000' AND '000000'
Type: Boolean
// Perform projection using select operator
Select Operator
Expressions:
Expr: dt
Type: string
Expr: c1
Type: int
Expr: c2
Type: int
OutputColumnNames: dt, c1, c2
// Perform simple aggregation on the MAP end. Minefield 1: Assume that there are N distinct entries and MAP processes M records of data. The output after this processing is N * M records, because MAP performs aggregation on dt and keys [I], it is recommended to filter as much data as possible on the MAP end.
Group By Operator
Aggregations:
Expr: count (DISTINCTc1)
Expr: count (distinctcase when (c2> 0) and (c1 = 0) THEN (c1) END)
Expr: count (distinctcase when (c2> 0) and (c1> 0) THEN (c1) END)
BucketGroup: false
Keys:
Expr: dt
Type: string
Expr: c1
Type: int
Expr: case when (c2> 0) and (c1 = 0) THEN (c1) END
Type: int
Expr: case when (c2> 0) and (c1> 0) THEN (c1) END
Type: int
Mode: hash
OutputColumnNames: _ col0, _ col1, _ col2, _ col3, _ col4, _ col5, _ col6
// Minefield 2: HASH Based on partition cplumns during Reduce Sink, for a table partitioned by date, all data in a day is amplified N times and transferred to Cer CER for calculation, resulting in a long tail or OOME performance.
Reduce Output Operator
Key expressions:
Expr: _ col0
Type: string
Expr: _ col1
Type: int
Expr: _ col2
Type: int
Expr: _ col3
Type: int
Sort order: ++
Map-reduce partitioncolumns:
Expr: _ col0
Type: string
Tag:-1
Value expressions:
Expr: _ col4
Type: bigint
Expr: _ col5
Type: bigint
Expr: _ col6
Type: bigint
Reduce Operator Tree:
Group By Operator
Aggregations:
Expr: count (DISTINCTKEY. _ col1: 0. _ col0)
Expr: count (DISTINCTKEY. _ col1: 1. _ col0)
Expr: count (DISTINCTKEY. _ col1: 2. _ col0)
BucketGroup: false
Keys:
Expr: KEY. _ col0
Type: string
Mode: mergepartial
OutputColumnNames: _ col0, _ col1, _ col2, _ col3
Select Operator
Expressions:
Expr: _ col0
Type: string
Expr: _ col1
Type: bigint
Expr: _ col2
Type: bigint
Expr: _ col3
Type: bigint
OutputColumnNames: _ col0, _ col1, _ col2, _ col3
File Output Operator
Compressed: true
GlobalTableId: 0
Table:
Input format: org. apache. Hadoop. mapred. TextInputFormat
Output format: org. apache. hadoop. hive. ql. io. HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
Limit:-1

Although Hive has parameters: hive. groupby. skewindata (see for details) but only single-distinct (https://issues.apache.org/jira/browse/HIVE-2416) is supported when this parameter is set, so there is no way to set it in this scenario. however, this parameter is somewhat inspired by the single-distinct feature which converts SQL into this type: use union all (note that you must not directly Union All but nested subqueries. Otherwise, an exception is reported: toplevel UNION is not supported currently; use a subquery for the UNION ).

View the Execution Plan (excluding non-essential parts ):
Stage dependencies:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1, Stage-3, Stage-4
Stage-3 is a root stage
Stage-4 is a root stage
Stage-0 is a root stage

When c. parallel. thread. number are two parameters used to change the resource time? The story is always cruel, there is a Bug here (https://issues.apache.org/jira/browse/HIVE-4436), so there is no way to do it before Hive0.12 release, it is called there is no place to spend money.

In addition, there is a solution through unionall + sum. If you are interested, you can try it.

Hive details: click here
Hive download location ��: click here

Hadoop cluster-based Hive Installation

Differences between Hive internal tables and external tables

Hadoop + Hive + Map + reduce cluster installation and deployment

Install in Hive local standalone Mode

WordCount word statistics for Hive Learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hive's headache: multi-distinct

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hive's headache: multi-distinct

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support