hive SQL最佳化之distribute by和sort by,hivedistribute
最近在最佳化hiveSQL,
下面是一段排序,分組後取每組第一行記錄的SQL
- INSERT OVERWRITE TABLE t_wa_funnel_distinct_temp PARTITION (pt='${SRCTIME}')
- SELECT
- bussiness_id,
- cookie_id,
- session_id,
- funnel_id,
- group_first(funnel_name) funnel_name,
- step_id,
- group_first(step_name) step_name,
- group_first(log_type) log_type,
- group_first(url_pattern) url_pattern,
- group_first(url) url,
- group_first(refer) refer,
- group_first(log_time) log_time,
- group_first(is_new_visitor) is_new_visitor,
- group_first(is_mobile_traffic) is_mobile_traffic,
- group_first(is_bounce) is_bounce,
- group_first(campaign_name) campaign_name,
- group_first(group_name) group_name,
- group_first(slot_name) slot_name,
- group_first(source_type) source_type,
- group_first(next_page) next_page,
- group_first(continent) continent,
- group_first(sub_continent_region) sub_continent_region,
- group_first(country) country,
- group_first(region) region,
- group_first(city) city,
- group_first(language) language,
- group_first(browser) browser,
- group_first(os) os,
- group_first(screen_color) screen_color,
- group_first(screen_resolution) screen_resolution,
- group_first(flash_version) flash_version,
- group_first(java) java,
- group_first(host) host
- FROM
- ( SELECT *
- FROM r_wa_funnel
- WHERE pt='${SRCTIME}'
- ORDER BY bussiness_id, cookie_id, session_id, funnel_id, step_id, log_time ASC
- ) t1
- GROUP BY pt, bussiness_id, cookie_id, session_id, funnel_id, step_id;
group_first: 自訂函數,使用者取每組第一個欄位
${SRCTIME}: 由外部oozie調度傳入, 作為時間分區,精確到小時.eg: 2011.11.01.21
下面在hive上以SRCTIME = 2011.11.01.21 執行以上SQL. 2011.11.01.21小時分區記錄數有10435486
執行時間:
從上面可以看出,reduce階段只有一個reduce, 這是因為ORDER BY是全域排序,hive只能通過一個reduce進行排序
從業務需求來看, 只要按bussiness_id, cookie_id, session_id, funnel_id, step_id分組,組內按
log_time升序排序即可.
OK, 這樣可以採用hive提供的distribute by 和 sort by,這樣可以充分利用hadoop資源, 在多個
reduce中局部按log_time 排序
最佳化有的hive代碼:
- INSERT OVERWRITE TABLE t_wa_funnel_distinct PARTITION (pt='2011.11.01.21')
- SELECT
- bussiness_id,
- cookie_id,
- session_id,
- funnel_id,
- group_first(funnel_name) funnel_name,
- step_id,
- group_first(step_name) step_name,
- group_first(log_type) log_type,
- group_first(url_pattern) url_pattern,
- group_first(url) url,
- group_first(refer) refer,
- group_first(log_time) log_time,
- group_first(is_new_visitor) is_new_visitor,
- group_first(is_mobile_traffic) is_mobile_traffic,
- group_first(is_bounce) is_bounce,
- group_first(campaign_name) campaign_name,
- group_first(group_name) group_name,
- group_first(slot_name) slot_name,
- group_first(source_type) source_type,
- group_first(next_page) next_page,
- group_first(continent) continent,
- group_first(sub_continent_region) sub_continent_region,
- group_first(country) country,
- group_first(region) region,
- group_first(city) city,
- group_first(language) language,
- group_first(browser) browser,
- group_first(os) os,
- group_first(screen_color) screen_color,
- group_first(screen_resolution) screen_resolution,
- group_first(flash_version) flash_version,
- group_first(java) java,
- group_first(host) host
- FROM
- ( SELECT *
- FROM r_wa_funnel
- WHERE pt='2011.11.01.21'
- distribute by bussiness_id, cookie_id, session_id, funnel_id, step_id sort by log_time ASC
- ) t1
- GROUP BY bussiness_id, cookie_id, session_id, funnel_id, step_id;
執行時間:
第一個需要執行6:43, 而最佳化有只要執行0:35秒,效能得到大幅提升
hivesql取最小時間所在欄位
select orderid,fenjian,timee
from
(
select orderid,fenjian,timee,row_number(orderid,fenjian) rn
from (
select orderid,fenjian,timee from tableName
distribute by orderid,fenjian sort by orderid,fenjian,timee asc
) t1
) t2
where t2.rn=1
hive sql裡,幫我描述一個簡單的sql的原理
select a.id,a.info,b.num from a join b on a.id=b.id and where b.num>=10
兩個表做關聯,首先where會過濾掉不需要的資料。
至於表怎麼做map和reduce操作,在hive裡的表是虛擬,其實還是對hdfs檔案進行操作,你可以在hdfs:///user/hive/warehouse路徑下找到以表名來命名的檔案,裡面就是表的內容,可以執行-cat命令查看。所以,它的map操作很簡單,就是按行讀檔案,然後會根據hive的預設分隔符號\001對每行進行切分。切分完成後就會按照你SQL指定的邏輯進行合并,最後再輸出成hdfs檔案,只不過在hive裡面看它是以表的形式展現的。
job數會在你執行sql語句之後緊接著有相應的日誌記錄,
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
這樣就是有兩個job,正在執行第一個job。
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 2
而這個就會告訴你有多少個mapper和reducer。
像你寫的這個sql有join操作,而且是hiveSQL裡面最普通的join,那麼一定會有reducer參與,如果資料量很大,比如上千萬條記錄,join就會特別慢,job進度就會一直卡在reduce操作。可以改成mapjoin或者sort merge bucket mapjoin。
其實hive效率不高,不適合即時查詢,即使一個表為空白,用hive進行查詢也會很耗時,因為它要把sql語句翻譯成MR任務。雖然簡化了分布式編程,但是效率上就會付出代價。
你的這句sql應該會翻譯成一個JOB來執行,就是簡單地map和reduce。
mapreduce就是按行讀檔案,然後切分,合并,輸出成檔案。