hive sequencefile 和rcfile 效率對比

最後更新：2018-12-04 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

來源資料放在test1表中，大小 26413896039 Byte。

建立sequencefile 壓縮表test2，使用insert overwrite table test2 select ...語句將test1資料匯入 test2 ，設定配置項：

set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
SET io.seqfile.compression.type=BLOCK;
set io.compression.codecs=com.hadoop.compression.lzo.LzoCodec;

匯入耗時：98.528s。另壓縮類型使用預設的record，耗時為418.936s。

建立rcfile 表test3 ，同樣方式匯入test3。

set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
set io.compression.codecs=com.hadoop.compression.lzo.LzoCodec;

匯入耗時 253.876s。

以下為其他統計資料對比：

rows	類型	合并耗時	檔案數	總資料大小	count(1)	基於domain、referer求點擊的top100
238610458	未經處理資料	1134	26413896039	66.297s
238610458	seq	98.528(block) 418.936(record)	1134	32252973826	41.578	394.949s（讀入資料：32,253,519,280，讀入行數:238610458）
238610458	rcfile	253.876 s	15	3765481781	29.318	286.588s（讀入資料：1,358,993,讀入行數:238610458

因為未經處理資料中均是小檔案，所以合并後檔案數大量減少，但是hive實現的seqfile 處理竟然還是原來的數目。rcfile 使用lzo 壓縮效果明顯，7倍的壓縮比率。查詢資料中讀入資料因為這裡這涉及小部分資料，所以rcfile的表讀入資料僅是seqfile的4%.而讀入行數一致。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

hive sequencefile 和rcfile 效率對比

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support