This script is slow to run, mainly caused by the reduce data skew, to understand that the Dw.fct_traffic_navpage_path_detl table is used to collect user click Data, then ultimately
The shopping cart and the click of the order must be very few, so this table ordr_code the word blank and the cart_prod_id field is a large amount of NULL data, as follows:
Select Ordr_code,count (*) as a from dw.fct_traffic_navpage_path_detl where ds = ' 2015-05-10 ' GROUP by Ordr_code have a >10000;
151722135
Select Cart_prod_id,count (*) as a fromdw.fct_traffic_navpage_path_detl where ds = ' 2015-05-10 ' groupby cart_prod_id hav ing a>10000;
NULL 127233335
for CREATE TABLE Tmp_lifan_trfc_tpa as this sentence SQL , BI Add the following configuration,
Set hive.mapjoin.smalltable.filesize = 120000000; Because the dw.univ_parnt_tranx_comb_detl table is not more than 120MB, if it is the hive on Tez to use Hive.auto.convert.join.noconditionaltask.size, so Tez generates broadcast
Sethive.auto.convert.join=true;
Also modify the following SQL statement:
From Dw.fct_traffic_navpage_path_detl t
Left outer join Dw.univ_parnt_tranx_comb_detl o //Use Mapjoin to resolve data skew
On t.ordr_code = O.parnt_ordr_code
and t.cart_prod_id = o.comb_prod_id
and O.ds = ' 2015-05-10 '
Left outer join Bic.cust_first_ordr_tranx F
on case o.end_user_id was null then cast (rand (9) *100as bigint) Else o.end_user_id end = F.E nd_user_id//join after number tilt with random number to avoid tilt, red for modified part
and f.first_ordr_date_id = ' 2015-05-10 '
where T.ds = ' 2015-05-10 ';
Post-run SQL can be completed within a manageable time.
Hive SQL Optimized data skew