本文出自 “熾天使” 部落格,請務必保留此出處http://3199782.blog.51cto.com/3189782/703873
原創作品,允許轉載,轉載時請務必以超連結形式標明文章
原始出處 、作者資訊和本聲明。否則將追究法律責任。http://3199782.blog.51cto.com/3189782/703873
最近在最佳化hiveSQL,
下面是一段排序,分組後取每組第一行記錄的SQL
- INSERT OVERWRITE TABLE t_wa_funnel_distinct_temp PARTITION (pt='${SRCTIME}')
- SELECT
- bussiness_id,
- cookie_id,
- session_id,
- funnel_id,
- group_first(funnel_name) funnel_name,
- step_id,
- group_first(step_name) step_name,
- group_first(log_type) log_type,
- group_first(url_pattern) url_pattern,
- group_first(url) url,
- group_first(refer) refer,
- group_first(log_time) log_time,
- group_first(is_new_visitor) is_new_visitor,
- group_first(is_mobile_traffic) is_mobile_traffic,
- group_first(is_bounce) is_bounce,
- group_first(campaign_name) campaign_name,
- group_first(group_name) group_name,
- group_first(slot_name) slot_name,
- group_first(source_type) source_type,
- group_first(next_page) next_page,
- group_first(continent) continent,
- group_first(sub_continent_region) sub_continent_region,
- group_first(country) country,
- group_first(region) region,
- group_first(city) city,
- group_first(language) language,
- group_first(browser) browser,
- group_first(os) os,
- group_first(screen_color) screen_color,
- group_first(screen_resolution) screen_resolution,
- group_first(flash_version) flash_version,
- group_first(java) java,
- group_first(host) host
- FROM
- ( SELECT *
- FROM r_wa_funnel
- WHERE pt='${SRCTIME}'
- ORDER BY bussiness_id, cookie_id, session_id, funnel_id, step_id, log_time ASC
- ) t1
- GROUP BY pt, bussiness_id, cookie_id, session_id, funnel_id, step_id;
group_first: 自訂函數, 使用者取每組第一個欄位
${SRCTIME}: 由外部oozie調度傳入, 作為時間分區,精確到小時. eg: 2011.11.01.21
下面在hive上以SRCTIME =
2011.11.01.21 執行以上SQL.
2011.11.01.21 小時分區記錄數有10435486
執行時間:
從上面可以看出,reduce階段只有一個reduce, 這是因為ORDER BY是全域排序,hive只能通過一個reduce進行排序
從業務需求來看, 只要按bussiness_id, cookie_id, session_id, funnel_id, step_id分組,組內按
log_time升序排序即可.
OK, 這樣可以採用hive提供的distribute by 和 sort by, 這樣可以充分利用hadoop資源, 在多個
reduce中局部按log_time 排序
最佳化有的hive代碼:
- INSERT OVERWRITE TABLE t_wa_funnel_distinct PARTITION (pt='2011.11.01.21')
- SELECT
- bussiness_id,
- cookie_id,
- session_id,
- funnel_id,
- group_first(funnel_name) funnel_name,
- step_id,
- group_first(step_name) step_name,
- group_first(log_type) log_type,
- group_first(url_pattern) url_pattern,
- group_first(url) url,
- group_first(refer) refer,
- group_first(log_time) log_time,
- group_first(is_new_visitor) is_new_visitor,
- group_first(is_mobile_traffic) is_mobile_traffic,
- group_first(is_bounce) is_bounce,
- group_first(campaign_name) campaign_name,
- group_first(group_name) group_name,
- group_first(slot_name) slot_name,
- group_first(source_type) source_type,
- group_first(next_page) next_page,
- group_first(continent) continent,
- group_first(sub_continent_region) sub_continent_region,
- group_first(country) country,
- group_first(region) region,
- group_first(city) city,
- group_first(language) language,
- group_first(browser) browser,
- group_first(os) os,
- group_first(screen_color) screen_color,
- group_first(screen_resolution) screen_resolution,
- group_first(flash_version) flash_version,
- group_first(java) java,
- group_first(host) host
- FROM
- ( SELECT *
- FROM r_wa_funnel
- WHERE pt='2011.11.01.21'
- distribute by bussiness_id, cookie_id, session_id, funnel_id, step_id sort by log_time ASC
- ) t1
- GROUP BY bussiness_id, cookie_id, session_id, funnel_id, step_id;
執行時間:
第一個需要執行6:43, 而最佳化有只要執行0:35秒, 效能得到大幅提升
本文出自 “熾天使” 部落格,請務必保留此出處http://3199782.blog.51cto.com/3189782/703873