Hive's headache: multi-distinct

Source: Internet
Author: User

An online query is simplified as follows:
Selectdt, count (distinct c1), count (distinct case when c2> 0 and c1 = 0 then c1 end ), count (distinct case when c2> 0 and c1> 0 then c1 end) from t where dtbetween '000000' and '000000' group by dt;
The multi-distinct problem is a headache. Why do you have a headache? Let's look at the execution plan:

Abstractsyntax tree:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAMEt) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE) (TOK_SELECT (partition dt) (partition (TOK_FUNCTIONDI count (partition c1 ))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when (and (> (TOK_TABLE_OR_COL c2) 0) (= (TOK_TABLE_OR_COL c1) 0) (TOK_TABLE_OR_COL c1 )))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when (and (> (TOK_TABLE_OR_COL c2) 0) (> (TOK_TABLE_OR_COL c1) 0) (TOK_TABLE_OR_COL c1 ))))) (TOK_WHERE (TOK_FUNCTION between KW_FALSE (TOK_TABLE_OR_COL dt) '000000' 20131108 ') (TOK_GROUPBY (TOK_TABLE_OR_COLdt ))))
STAGEDEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGEPLANS:
Stage: Stage-1
Map Reduce
Alias-> Map Operator Tree:
T
TableScan
Alias: t
Filter Operator
Predicate:
Expr: dt BETWEEN '000000' AND '000000'
Type: Boolean
// Perform projection using select operator
Select Operator
Expressions:
Expr: dt
Type: string
Expr: c1
Type: int
Expr: c2
Type: int
OutputColumnNames: dt, c1, c2
// Perform simple aggregation on the MAP end. Minefield 1: Assume that there are N distinct entries and MAP processes M records of data. The output after this processing is N * M records, because MAP performs aggregation on dt and keys [I], it is recommended to filter as much data as possible on the MAP end.
Group By Operator
Aggregations:
Expr: count (DISTINCTc1)
Expr: count (distinctcase when (c2> 0) and (c1 = 0) THEN (c1) END)
Expr: count (distinctcase when (c2> 0) and (c1> 0) THEN (c1) END)
BucketGroup: false
Keys:
Expr: dt
Type: string
Expr: c1
Type: int
Expr: case when (c2> 0) and (c1 = 0) THEN (c1) END
Type: int
Expr: case when (c2> 0) and (c1> 0) THEN (c1) END
Type: int
Mode: hash
OutputColumnNames: _ col0, _ col1, _ col2, _ col3, _ col4, _ col5, _ col6
// Minefield 2: HASH Based on partition cplumns during Reduce Sink, for a table partitioned by date, all data in a day is amplified N times and transferred to Cer CER for calculation, resulting in a long tail or OOME performance.
Reduce Output Operator
Key expressions:
Expr: _ col0
Type: string
Expr: _ col1
Type: int
Expr: _ col2
Type: int
Expr: _ col3
Type: int
Sort order: ++
Map-reduce partitioncolumns:
Expr: _ col0
Type: string
Tag:-1
Value expressions:
Expr: _ col4
Type: bigint
Expr: _ col5
Type: bigint
Expr: _ col6
Type: bigint
Reduce Operator Tree:
Group By Operator
Aggregations:
Expr: count (DISTINCTKEY. _ col1: 0. _ col0)
Expr: count (DISTINCTKEY. _ col1: 1. _ col0)
Expr: count (DISTINCTKEY. _ col1: 2. _ col0)
BucketGroup: false
Keys:
Expr: KEY. _ col0
Type: string
Mode: mergepartial
OutputColumnNames: _ col0, _ col1, _ col2, _ col3
Select Operator
Expressions:
Expr: _ col0
Type: string
Expr: _ col1
Type: bigint
Expr: _ col2
Type: bigint
Expr: _ col3
Type: bigint
OutputColumnNames: _ col0, _ col1, _ col2, _ col3
File Output Operator
Compressed: true
GlobalTableId: 0
Table:
Input format: org. apache. Hadoop. mapred. TextInputFormat
Output format: org. apache. hadoop. hive. ql. io. HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
Limit:-1

Although Hive has parameters: hive. groupby. skewindata (see for details) but only single-distinct (https://issues.apache.org/jira/browse/HIVE-2416) is supported when this parameter is set, so there is no way to set it in this scenario. however, this parameter is somewhat inspired by the single-distinct feature which converts SQL into this type: use union all (note that you must not directly Union All but nested subqueries. Otherwise, an exception is reported: toplevel UNION is not supported currently; use a subquery for the UNION ).

View the Execution Plan (excluding non-essential parts ):
Stage dependencies:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1, Stage-3, Stage-4
Stage-3 is a root stage
Stage-4 is a root stage
Stage-0 is a root stage

When c. parallel. thread. number are two parameters used to change the resource time? The story is always cruel, there is a Bug here (https://issues.apache.org/jira/browse/HIVE-4436), so there is no way to do it before Hive0.12 release, it is called there is no place to spend money.

In addition, there is a solution through unionall + sum. If you are interested, you can try it.

Hive details: click here
Hive download location ��: click here

Hadoop cluster-based Hive Installation

Differences between Hive internal tables and external tables

Hadoop + Hive + Map + reduce cluster installation and deployment

Install in Hive local standalone Mode

WordCount word statistics for Hive Learning

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.