Recent BA user feedback has two seemingly very similar statements return the number of results is not the same, more strange, doubt is not hive bug
Query 1 Returns the number of results 6071
Select COUNT (Distinct reviewid) as dis_reviewcnt
from
(select A.reviewid from
bi.dpods_dp_reviewreport a left
outer join Bi.dpods_dp_reviewlog B on
a.reviewid=b.reviewid and b.hp_statdate= ' 2013-07-24 '
where To_date (a.feedadddate) >= ' 2013-07-01 ' and a.hp_statdate= ' 2013-07-24 '
) a
Query 2 Returns the number of results 6443
Select COUNT (Distinct reviewid) as dis_reviewcnt
from
(select A.reviewid from
bi.dpods_dp_ Reviewreport a left
outer join Bi.dpods_dp_reviewlog B on
A.reviewid=b.reviewid and b.hp_ Statdate= ' 2013-07-24 ' and a.hp_statdate= ' 2013-07-24 '
where To_date (a.feedadddate) >= ' 2013-07-01 '
) A
The second query has 372 more data than the first one, and does not exist in the left table of the subquery
The only difference between the two statements is the Dpods_dp_reviewreport partition filter condition (Hp_statdate is partition column) one behind the where and the other behind on
Rough-looking data should be the same, but the mystery is actually the difference between where and on.
Where followed by the filter conditions, query 1 in the a.hp_statdate= ' 2013-07-24 ', before table scan will partition Pruner filter partition, so only ' 2013-07-24 ' The data below will join with the Dpods_dp_reviewlog.
In Query 2, the data under all partition is read, and then the Dpods_dp_reviewlog join, and the join is based on the associated condition of only a.hp_statdate= ' 2013-07-24 ' is the time to actually perform the join, and in other cases because it is left outer JOIN, the join is not on the right side will stay Null,query 2 is actually removed all the reviewid, so will and query 1 results are not the same
Can do an experiment, query2 remove on the back of the a.hp_statdate= ' 2013-07-24 ', the rest does not move, execute the statement, out of the distinct reviewcnt is also 6443
Select COUNT (Distinct reviewid) as dis_reviewcnt
from
(select A.reviewid from
bi.dpods_dp_reviewreport a left
outer join Bi.dpods_dp_reviewlog B on
a.reviewid=b.reviewid and b.hp_statdate= ' 2013-07-24 '
where To_date (a.feedadddate) >= ' 2013-07-01 '
) a
More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/database/extra/
Query plan in Query 1
ABSTRACT SYNTAX Tree: (Tok_query (Tok_from (tok_subquery tok_query tok_from tok_leftouterjoin (tok_tabref NAME bi Dpods_dp_reviewreport) a) (Tok_tabref (tok_tabname bi dpods_dp_reviewlog) b) (and (=. (Tok_table_or_col a) reviewid) (. (Tok_table_or_col b) reviewid)) (= (. (Tok_table_or_col b) hp_statdate))) (Tok_insert tok_destination (Tok_dir tok_tmp_file)) (TOK_SELECT ( Tok_selexpr (. (Tok_table_or_col a)) Reviewid)) (Tok_where (>= tok_function to_date (). (Tok_table_or_col a) feedadddate)) (' 2013-07-01 ') (= (. (Tok_table_or_col a) hp_statdate)) ())) (Tok_insert tok_destination (Tok_dir tok_tmp_file)) (TOK_
SELECT (tok_selexpr (Tok_functiondi count (Tok_table_or_col reviewid)) dis_reviewcnt))) STAGE dependencies: Stage-5 is a root Stage, consists of Stage-1 Stage-1 Stage-2 depends to Stages:stage-1 Stage-0 is a root sta GE STAGE plans:stage:stage-5 Conditional Operator StAge:stage-1 map Reduce Alias-> map Operator tree:a:a Tablescan al Ias:a Filter Operator predicate:expr: (To_date (feedadddate) >= ' 201
3-07-01 ') Type:boolean Reduce Output Operator key Expressions: Expr:reviewid Type:int sort order: + Map -reduce partition Columns:expr:reviewid type:int tag:
0 value Expressions:expr:feedadddate type:string
Expr:reviewid Type:int Expr:hp_statdate type:string a:b tablescan alias:b Reduce Output Operato R Key ExPressions:expr:reviewid Type:int sort order: + Map-reduce partition Columns:expr:reviewid Type:int tag : 1 Reduce Operator tree:join Operator condition Map:left Outer to
1 condition expressions:0 {value._col5} {Value._col8} {VALUE._COL17} 1 Handleskewjoin:false outputcolumnnames: _col5, _col8, _col17 Select Operator exp
ressions:expr: _col8 type:int outputcolumnnames: _col0
Select Operator expressions:expr: _col0 type:int
Outputcolumnnames: _col0 Group by Operator aggregations: Expr:count (DISTINCT _col0) Bucketgroup:false keys:expr: _col0 Type:int mode:hash outputcolumnnames: _col0, _col1 File Output
Operator compressed:true globaltableid:0 Table: Input Format:org.apache.hadoop.mapred.SequenceFileInputFormat Output Format:org.apa Che.hadoop.hive.ql.io.HiveSequenceFileOutputFormat stage:stage-2 map Reduce Alias-> Map Oper
Ator tree:hdfs://10.2.6.102/tmp/hive-hadoop/hive_2013-07-26_18-10-59_408_7272696604651905662/-mr-10002 Reduce Output Operator key expressions:expr: _col0 Typ E:int sort order: + tag:-1 Value expressions:expr
: _col1 Type:bigint Reduce Operator Tree:group by Operator aggregations:e Xpr:count (DISTINCT key._col0:0._col0) bucketgroup:false mode:mergepartial Outputcol
Umnnames: _col0 Select Operator expressions:expr: _col0
Type:bigint outputcolumnnames: _col0 File Output Operator compressed:false globaltableid:0 Table:input FORMAT:ORG.APACHE.HADOOP.MAPRED.TEXTINP Utformat Output Format:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat stage:s tage-0 Fetch Operator Limit:-1
Query plan in Query 2