hive SQL最佳化之distribute by和sort by，hivedistribute

最後更新：2014-07-23 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

hive SQL最佳化之distribute by和sort by，hivedistribute
最近在最佳化hiveSQL，

下面是一段排序，分組後取每組第一行記錄的SQL

INSERT OVERWRITE TABLE t_wa_funnel_distinct_temp PARTITION (pt='${SRCTIME}')
SELECT
bussiness_id,
cookie_id,
session_id,
funnel_id,
group_first(funnel_name) funnel_name,
step_id,
group_first(step_name) step_name,
group_first(log_type) log_type,
group_first(url_pattern) url_pattern,
group_first(url) url,
group_first(refer) refer,
group_first(log_time) log_time,
group_first(is_new_visitor) is_new_visitor,
group_first(is_mobile_traffic) is_mobile_traffic,
group_first(is_bounce) is_bounce,
group_first(campaign_name) campaign_name,
group_first(group_name) group_name,
group_first(slot_name) slot_name,
group_first(source_type) source_type,
group_first(next_page) next_page,
group_first(continent) continent,
group_first(sub_continent_region) sub_continent_region,
group_first(country) country,
group_first(region) region,
group_first(city) city,
group_first(language) language,
group_first(browser) browser,
group_first(os) os,
group_first(screen_color) screen_color,
group_first(screen_resolution) screen_resolution,
group_first(flash_version) flash_version,
group_first(java) java,
group_first(host) host
FROM
( SELECT *
FROM r_wa_funnel
WHERE pt='${SRCTIME}'
ORDER BY bussiness_id, cookie_id, session_id, funnel_id, step_id, log_time ASC
) t1
GROUP BY pt, bussiness_id, cookie_id, session_id, funnel_id, step_id;

group_first: 自訂函數，使用者取每組第一個欄位
${SRCTIME}: 由外部oozie調度傳入, 作為時間分區，精確到小時.eg: 2011.11.01.21

下面在hive上以SRCTIME = 2011.11.01.21 執行以上SQL. 2011.11.01.21小時分區記錄數有10435486

執行時間:

從上面可以看出，reduce階段只有一個reduce，這是因為ORDER BY是全域排序，hive只能通過一個reduce進行排序
從業務需求來看，只要按bussiness_id, cookie_id, session_id, funnel_id, step_id分組，組內按
log_time升序排序即可.

OK, 這樣可以採用hive提供的distribute by 和 sort by,這樣可以充分利用hadoop資源，在多個
reduce中局部按log_time 排序

最佳化有的hive代碼:

INSERT OVERWRITE TABLE t_wa_funnel_distinct PARTITION (pt='2011.11.01.21')
SELECT
bussiness_id,
cookie_id,
session_id,
funnel_id,
group_first(funnel_name) funnel_name,
step_id,
group_first(step_name) step_name,
group_first(log_type) log_type,
group_first(url_pattern) url_pattern,
group_first(url) url,
group_first(refer) refer,
group_first(log_time) log_time,
group_first(is_new_visitor) is_new_visitor,
group_first(is_mobile_traffic) is_mobile_traffic,
group_first(is_bounce) is_bounce,
group_first(campaign_name) campaign_name,
group_first(group_name) group_name,
group_first(slot_name) slot_name,
group_first(source_type) source_type,
group_first(next_page) next_page,
group_first(continent) continent,
group_first(sub_continent_region) sub_continent_region,
group_first(country) country,
group_first(region) region,
group_first(city) city,
group_first(language) language,
group_first(browser) browser,
group_first(os) os,
group_first(screen_color) screen_color,
group_first(screen_resolution) screen_resolution,
group_first(flash_version) flash_version,
group_first(java) java,
group_first(host) host
FROM
( SELECT *
FROM r_wa_funnel
WHERE pt='2011.11.01.21'
distribute by bussiness_id, cookie_id, session_id, funnel_id, step_id sort by log_time ASC
) t1
GROUP BY bussiness_id, cookie_id, session_id, funnel_id, step_id;

執行時間:

第一個需要執行6:43，而最佳化有只要執行0:35秒，效能得到大幅提升

hivesql取最小時間所在欄位

select orderid,fenjian,timee
from
(
select orderid,fenjian,timee,row_number(orderid,fenjian) rn
from (
select orderid,fenjian,timee from tableName
distribute by orderid,fenjian sort by orderid,fenjian,timee asc
) t1
) t2
where t2.rn=1

hive sql裡，幫我描述一個簡單的sql的原理

select a.id,a.info,b.num from a join b on a.id=b.id and where b.num>=10

兩個表做關聯，首先where會過濾掉不需要的資料。
至於表怎麼做map和reduce操作，在hive裡的表是虛擬，其實還是對hdfs檔案進行操作，你可以在hdfs:///user/hive/warehouse路徑下找到以表名來命名的檔案，裡面就是表的內容，可以執行-cat命令查看。所以，它的map操作很簡單，就是按行讀檔案，然後會根據hive的預設分隔符號\001對每行進行切分。切分完成後就會按照你SQL指定的邏輯進行合并，最後再輸出成hdfs檔案，只不過在hive裡面看它是以表的形式展現的。

job數會在你執行sql語句之後緊接著有相應的日誌記錄，

Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:

這樣就是有兩個job，正在執行第一個job。

Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 2
而這個就會告訴你有多少個mapper和reducer。
像你寫的這個sql有join操作，而且是hiveSQL裡面最普通的join，那麼一定會有reducer參與，如果資料量很大，比如上千萬條記錄，join就會特別慢，job進度就會一直卡在reduce操作。可以改成mapjoin或者sort merge bucket mapjoin。

其實hive效率不高，不適合即時查詢，即使一個表為空白，用hive進行查詢也會很耗時，因為它要把sql語句翻譯成MR任務。雖然簡化了分布式編程，但是效率上就會付出代價。

你的這句sql應該會翻譯成一個JOB來執行，就是簡單地map和reduce。

mapreduce就是按行讀檔案，然後切分，合并，輸出成檔案。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

hive SQL最佳化之distribute by和sort by，hivedistribute

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support