Problems with hive left OUTER JOIN

Source: Internet
Author: User
Tags filter count join sort

Recent BA user feedback has two seemingly very similar statements return the number of results is not the same, more strange, doubt is not hive bug

Query 1 Returns the number of results 6071

Select COUNT (Distinct reviewid) as dis_reviewcnt  
from
(select A.reviewid from  
bi.dpods_dp_reviewreport  a left  
outer join Bi.dpods_dp_reviewlog  B on  
a.reviewid=b.reviewid and  b.hp_statdate= ' 2013-07-24 '
where To_date (a.feedadddate) >= ' 2013-07-01 ' and   a.hp_statdate= ' 2013-07-24 '
) a

Query 2 Returns the number of results 6443

Select COUNT (Distinct reviewid) as dis_reviewcnt  
from
(select A.reviewid from  
bi.dpods_dp_ Reviewreport  a left  
outer join Bi.dpods_dp_reviewlog  B on  
A.reviewid=b.reviewid and  b.hp_ Statdate= ' 2013-07-24 ' and   a.hp_statdate= ' 2013-07-24 '
where To_date (a.feedadddate) >= ' 2013-07-01 '
) A

The second query has 372 more data than the first one, and does not exist in the left table of the subquery

The only difference between the two statements is the Dpods_dp_reviewreport partition filter condition (Hp_statdate is partition column) one behind the where and the other behind on

Rough-looking data should be the same, but the mystery is actually the difference between where and on.

Where followed by the filter conditions, query 1 in the a.hp_statdate= ' 2013-07-24 ', before table scan will partition Pruner filter partition, so only ' 2013-07-24 ' The data below will join with the Dpods_dp_reviewlog.

In Query 2, the data under all partition is read, and then the Dpods_dp_reviewlog join, and the join is based on the associated condition of only a.hp_statdate= ' 2013-07-24 ' is the time to actually perform the join, and in other cases because it is left outer JOIN, the join is not on the right side will stay Null,query 2 is actually removed all the reviewid, so will and query 1 results are not the same

Can do an experiment, query2 remove on the back of the a.hp_statdate= ' 2013-07-24 ', the rest does not move, execute the statement, out of the distinct reviewcnt is also 6443

Select COUNT (Distinct reviewid) as dis_reviewcnt  
from
(select A.reviewid from  
bi.dpods_dp_reviewreport  a left  
outer join Bi.dpods_dp_reviewlog  B on  
a.reviewid=b.reviewid and  b.hp_statdate= ' 2013-07-24 '
where To_date (a.feedadddate) >= ' 2013-07-01 '
) a

More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/database/extra/

Query plan in Query 1

ABSTRACT SYNTAX Tree: (Tok_query (Tok_from (tok_subquery tok_query tok_from tok_leftouterjoin (tok_tabref NAME bi Dpods_dp_reviewreport) a) (Tok_tabref (tok_tabname bi dpods_dp_reviewlog) b) (and (=. (Tok_table_or_col a) reviewid) (. (Tok_table_or_col b) reviewid)) (= (. (Tok_table_or_col b) hp_statdate))) (Tok_insert tok_destination (Tok_dir tok_tmp_file)) (TOK_SELECT ( Tok_selexpr (. (Tok_table_or_col a)) Reviewid)) (Tok_where (>= tok_function to_date (). (Tok_table_or_col a) feedadddate)) (' 2013-07-01 ') (= (. (Tok_table_or_col a) hp_statdate)) ())) (Tok_insert tok_destination (Tok_dir tok_tmp_file)) (TOK_  
  SELECT (tok_selexpr (Tok_functiondi count (Tok_table_or_col reviewid)) dis_reviewcnt))) STAGE dependencies: Stage-5 is a root Stage, consists of Stage-1 Stage-1 Stage-2 depends to Stages:stage-1 Stage-0 is a root sta GE STAGE plans:stage:stage-5 Conditional Operator StAge:stage-1 map Reduce Alias-> map Operator tree:a:a Tablescan al Ias:a Filter Operator predicate:expr: (To_date (feedadddate) >= ' 201  
                      3-07-01 ') Type:boolean Reduce Output Operator key Expressions: Expr:reviewid Type:int sort order: + Map  -reduce partition Columns:expr:reviewid type:int tag:  
                      0 value Expressions:expr:feedadddate type:string  
                      Expr:reviewid Type:int Expr:hp_statdate type:string a:b tablescan alias:b Reduce Output Operato R Key ExPressions:expr:reviewid Type:int sort order: + Map-reduce partition Columns:expr:reviewid Type:int tag : 1 Reduce Operator tree:join Operator condition Map:left Outer to   
          1 condition expressions:0 {value._col5} {Value._col8} {VALUE._COL17} 1 Handleskewjoin:false outputcolumnnames: _col5, _col8, _col17 Select Operator exp  
            ressions:expr: _col8 type:int outputcolumnnames: _col0  
              Select Operator expressions:expr: _col0 type:int  
                      Outputcolumnnames: _col0 Group by Operator aggregations: Expr:count (DISTINCT _col0) Bucketgroup:false keys:expr: _col0  Type:int mode:hash outputcolumnnames: _col0, _col1 File Output  
                      Operator compressed:true globaltableid:0 Table: Input Format:org.apache.hadoop.mapred.SequenceFileInputFormat Output Format:org.apa Che.hadoop.hive.ql.io.HiveSequenceFileOutputFormat stage:stage-2 map Reduce Alias-> Map Oper   
            Ator tree:hdfs://10.2.6.102/tmp/hive-hadoop/hive_2013-07-26_18-10-59_408_7272696604651905662/-mr-10002 Reduce Output Operator key expressions:expr: _col0 Typ E:int sort order: + tag:-1 Value expressions:expr  
              : _col1      Type:bigint Reduce Operator Tree:group by Operator aggregations:e Xpr:count (DISTINCT key._col0:0._col0) bucketgroup:false mode:mergepartial Outputcol  
                  Umnnames: _col0 Select Operator expressions:expr: _col0  
              Type:bigint outputcolumnnames: _col0 File Output Operator compressed:false globaltableid:0 Table:input FORMAT:ORG.APACHE.HADOOP.MAPRED.TEXTINP Utformat Output Format:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat stage:s tage-0 Fetch Operator Limit:-1

Query plan in Query 2

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.