An online query is simplified as follows:
Selectdt, count (distinct c1), count (distinct case when c2> 0 and c1 = 0 then c1 end ), count (distinct case when c2> 0 and c1> 0 then c1 end) from t where dtbetween '000000' and '000000' group by dt;
The multi-distinct problem is a headache. Why do you have a headache? Let's look at the execution plan:
Abstractsyntax tree:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAMEt) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE) (TOK_SELECT (partition dt) (partition (TOK_FUNCTIONDI count (partition c1 ))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when (and (> (TOK_TABLE_OR_COL c2) 0) (= (TOK_TABLE_OR_COL c1) 0) (TOK_TABLE_OR_COL c1 )))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when (and (> (TOK_TABLE_OR_COL c2) 0) (> (TOK_TABLE_OR_COL c1) 0) (TOK_TABLE_OR_COL c1 ))))) (TOK_WHERE (TOK_FUNCTION between KW_FALSE (TOK_TABLE_OR_COL dt) '000000' 20131108 ') (TOK_GROUPBY (TOK_TABLE_OR_COLdt ))))
STAGEDEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGEPLANS:
Stage: Stage-1
Map Reduce
Alias-> Map Operator Tree:
T
TableScan
Alias: t
Filter Operator
Predicate:
Expr: dt BETWEEN '000000' AND '000000'
Type: Boolean
// Perform projection using select operator
Select Operator
Expressions:
Expr: dt
Type: string
Expr: c1
Type: int
Expr: c2
Type: int
OutputColumnNames: dt, c1, c2
// Perform simple aggregation on the MAP end. Minefield 1: Assume that there are N distinct entries and MAP processes M records of data. The output after this processing is N * M records, because MAP performs aggregation on dt and keys [I], it is recommended to filter as much data as possible on the MAP end.
Group By Operator
Aggregations:
Expr: count (DISTINCTc1)
Expr: count (distinctcase when (c2> 0) and (c1 = 0) THEN (c1) END)
Expr: count (distinctcase when (c2> 0) and (c1> 0) THEN (c1) END)
BucketGroup: false
Keys:
Expr: dt
Type: string
Expr: c1
Type: int
Expr: case when (c2> 0) and (c1 = 0) THEN (c1) END
Type: int
Expr: case when (c2> 0) and (c1> 0) THEN (c1) END
Type: int
Mode: hash
OutputColumnNames: _ col0, _ col1, _ col2, _ col3, _ col4, _ col5, _ col6
// Minefield 2: HASH Based on partition cplumns during Reduce Sink, for a table partitioned by date, all data in a day is amplified N times and transferred to Cer CER for calculation, resulting in a long tail or OOME performance.
Reduce Output Operator
Key expressions:
Expr: _ col0
Type: string
Expr: _ col1
Type: int
Expr: _ col2
Type: int
Expr: _ col3
Type: int
Sort order: ++
Map-reduce partitioncolumns:
Expr: _ col0
Type: string
Tag:-1
Value expressions:
Expr: _ col4
Type: bigint
Expr: _ col5
Type: bigint
Expr: _ col6
Type: bigint
Reduce Operator Tree:
Group By Operator
Aggregations:
Expr: count (DISTINCTKEY. _ col1: 0. _ col0)
Expr: count (DISTINCTKEY. _ col1: 1. _ col0)
Expr: count (DISTINCTKEY. _ col1: 2. _ col0)
BucketGroup: false
Keys:
Expr: KEY. _ col0
Type: string
Mode: mergepartial
OutputColumnNames: _ col0, _ col1, _ col2, _ col3
Select Operator
Expressions:
Expr: _ col0
Type: string
Expr: _ col1
Type: bigint
Expr: _ col2
Type: bigint
Expr: _ col3
Type: bigint
OutputColumnNames: _ col0, _ col1, _ col2, _ col3
File Output Operator
Compressed: true
GlobalTableId: 0
Table:
Input format: org. apache. Hadoop. mapred. TextInputFormat
Output format: org. apache. hadoop. hive. ql. io. HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
Limit:-1
Although Hive has parameters: hive. groupby. skewindata (see for details) but only single-distinct (https://issues.apache.org/jira/browse/HIVE-2416) is supported when this parameter is set, so there is no way to set it in this scenario. however, this parameter is somewhat inspired by the single-distinct feature which converts SQL into this type: use union all (note that you must not directly Union All but nested subqueries. Otherwise, an exception is reported: toplevel UNION is not supported currently; use a subquery for the UNION ).
View the Execution Plan (excluding non-essential parts ):
Stage dependencies:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1, Stage-3, Stage-4
Stage-3 is a root stage
Stage-4 is a root stage
Stage-0 is a root stage
When c. parallel. thread. number are two parameters used to change the resource time? The story is always cruel, there is a Bug here (https://issues.apache.org/jira/browse/HIVE-4436), so there is no way to do it before Hive0.12 release, it is called there is no place to spend money.
In addition, there is a solution through unionall + sum. If you are interested, you can try it.
Hive details: click here
Hive download location ��: click here
Hadoop cluster-based Hive Installation
Differences between Hive internal tables and external tables
Hadoop + Hive + Map + reduce cluster installation and deployment
Install in Hive local standalone Mode
WordCount word statistics for Hive Learning