Hive索引,hive
1、 Hive索引概述
Hive的索引目的是提高Hive表指定列的查詢速度。
沒有索引時,類似'WHERE tab1.col1 = 10' 的查詢,Hive會載入整張表或分區,然後處理所有的rows,但是如果在欄位col1上面存在索引時,那麼只會載入和處理檔案的一部分。
與其他傳統資料庫一樣,增加索引在提升查詢速度時,會消耗額外資源去建立索引和需要更多的磁碟空間儲存索引。
Hive 0.7.0版本中,加入了索引。Hive 0.8.0版本中增加了bitmap索引。
2、 索引相關的配置參數
hive.index.compact.file.ignore.hdfs
Default Value: false
Added In: Hive 0.7.0 withHIVE-1889
在索引檔案中儲存的hdfs地址將在運行時被忽略,如果開啟的話;如果資料被遷移,那麼索引檔案依然可用,預設是false
hive.optimize.index.filter
Default Value: false
Added In: Hive 0.8.0 withHIVE-1644
是否自動使用索引, 預設是false
hive.optimize.index.filter.compact.minsize
Default Value: 5368709120
Added In: Hive 0.8.0 withHIVE-1644
壓縮索引自動應用的最小輸入大小
hive.optimize.index.filter.compact.maxsize
Default Value: -1
Added In: Hive 0.8.0 withHIVE-1644
壓縮索引自動應用的最大輸入大小,負值代表正無窮
hive.index.compact.query.max.size
Default Value: 10737418240
Added In: Hive 0.8.0 withHIVE-2096
一個使用壓縮索引做的查詢能取到的最大資料量,預設是10737418240 個byte;負值代表無窮大;
hive.index.compact.query.max.entries
Default Value: 10000000
Added In: Hive 0.8.0 withHIVE-2096
使用壓縮索引查詢時能讀到的最大索引項目數,預設是10000000;負值代表無窮大;
hive.exec.concatenate.check.index
Default Value: true
Added In: Hive 0.8.0 withHIVE-2125
如果設定為true,那麼在做ALTER TABLE tbl_name CONCATENATE on a table/partition(有索引) 操作時,拋出錯誤;可以協助使用者避免index的刪除和重建;
hive.optimize.index.groupby
Default Value: false
Added In: Hive 0.8.1 withHIVE-1694
hive.index.compact.binary.search
Default Value: true
Added In: Hive 0.8.1with HIVE-2535
在索引表中是否開啟二分搜尋進行索引項目查詢,預設是true;
3、 索引樣本
注意:在Hive 0.12.0以及之前版本中,索引名稱在create index和drop index語句中是大小寫敏感的。然而,alter index 需要一個小寫索引名字。
此bug在Hive 0.13.0解決,此版本開始使索引名字大小寫不敏感。
對於Hive 0.13.0之前的版本,最好使用小寫索引名字。
下面介紹索引的常見用法:
A、 Create/build,show和drop index
create index table01_index ontable table01(column2) as 'COMPACT' with deferred rebuild;
show index on table01;
drop index table01_index ontable01;
B、 Create then build,show formatted和drop index
create index table02_index ontable table02(column3) as 'compact' with deferred rebuild;
alter index table02_index ontable02 rebuild;
show formatted index ontable02;
drop index table02_index ontable02;
C、 建立bitmap索引,build,show 和drop
createindex table03_index on table table03 (column4) as'bitmap' with deferred rebuild;
alter index table03_index ontable03 rebuild;
show formatted index ontable03;
drop index table03_index on table03;
D、 在一張新表上建立索引
createindex table04_index on table table04 (column5)as 'compact'with deferred rebuild in tabletable04_index_table;
E、 建立索引,儲存格式為RCFile
create index table05_index ontable table05 (column6) as 'compact'with deferred rebuildstored as rcfile;
F、 建立索引,儲存格式為TextFile
create index table06_index ontable table06 (column7) as 'compact'with deferredrebuild row format delimited fields terminated by '\t' stored as textfile;
G、 建立帶有索引屬性的索引
create index table07_index ontable table07 (column8) as 'compact'with deferred rebuild idxproperties("prop1"="value1", "prop2"="value2");
H、 建立帶有表屬性的索引
create index table08_index ontable table08 (column9) as 'compact'withdeferred rebuild tblproperties("prop3"="value3", "prop4"="value4");
I、 如果索引存在,則刪除
drop index if exists table09_indexon table09;
J、 在分區上重建索引
alter index table10_index on table10partition (columnx='valueq', columny='valuer') rebuild;
4、 索引測試
(1) 查詢表中行數
hive (hive)> select count(1)from userbook;
4409365
(2) 表中未建立索引前查詢
hive (hive)> select * fromuserbook where book_id = '15999998838';
Query ID =hadoop_20150627165551_595da79a-0e27-453b-9142-7734912934c4
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is setto 0 since there's no reduce operator
Starting Job =job_1435392961740_0012, Tracking URL =http://gpmaster:8088/proxy/application_1435392961740_0012/
Kill Command =/home/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1435392961740_0012
Hadoop job information forStage-1: number of mappers: 2; number of reducers: 0
2015-06-27 16:56:04,666 Stage-1map = 0%, reduce = 0%
2015-06-27 16:56:28,974 Stage-1map = 50%, reduce = 0%, Cumulative CPU4.36 sec
2015-06-27 16:56:31,123 Stage-1map = 78%, reduce = 0%, Cumulative CPU6.21 sec
2015-06-27 16:56:34,698 Stage-1map = 100%, reduce = 0%, Cumulative CPU7.37 sec
MapReduce Total cumulative CPUtime: 7 seconds 370 msec
Ended Job =job_1435392961740_0012
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 7.37 sec HDFS Read: 348355875 HDFS Write: 76 SUCCESS
Total MapReduce CPU Time Spent:7 seconds 370 msec
OK
userbook.book_id userbook.book_name userbook.author userbook.public_date userbook.address
15999998838 uviWfFJ KwCrDOA 2009-12-27 3b74416d-eb69-48e2-9d0d-09275064691b
Time taken: 45.678 seconds, Fetched: 1 row(s)
(3) 建立索引
hive (hive)> create indexuserbook_bookid_idx on table userbook(book_id) as 'COMPACT' WITH DEFERREDREBUILD;
(4) 建立索引後再執行查詢
hive (hive)> select * fromuserbook where book_id = '15999998838';
Query ID =hadoop_20150627170019_5bb5514a-4c8e-4c47-9347-ed0657e1f2ff
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is setto 0 since there's no reduce operator
Starting Job =job_1435392961740_0013, Tracking URL = http://gpmaster:8088/proxy/application_1435392961740_0013/
Kill Command =/home/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1435392961740_0013
Hadoop job information forStage-1: number of mappers: 2; number of reducers: 0
2015-06-27 17:00:30,429 Stage-1map = 0%, reduce = 0%
2015-06-27 17:00:54,003 Stage-1map = 50%, reduce = 0%, Cumulative CPU7.43 sec
2015-06-27 17:00:56,181 Stage-1map = 78%, reduce = 0%, Cumulative CPU9.66 sec
2015-06-27 17:00:58,417 Stage-1map = 100%, reduce = 0%, Cumulative CPU10.83 sec
MapReduce Total cumulative CPUtime: 10 seconds 830 msec
Ended Job =job_1435392961740_0013
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 10.83 sec HDFS Read: 348356271 HDFS Write: 76 SUCCESS
Total MapReduce CPU Time Spent:10 seconds 830 msec
OK
userbook.book_id userbook.book_name userbook.author userbook.public_date userbook.address
15999998838 uviWfFJ KwCrDOA 2009-12-27 3b74416d-eb69-48e2-9d0d-09275064691b
Time taken: 40.549 seconds, Fetched: 1 row(s)
可以看到建立索引後,速度還是稍快一點的。
其實對於這種簡單的查詢,通過我們的設定,可以不用啟動Map/Reduce的,而是啟動Fetch task,直接從HDFS檔案中filter過濾出需要的資料,需要設定如下參數:
set hive.fetch.task.conversion=more;
hive (hive)> select * fromuserbook where book_id = '15999998838';
OK
userbook.book_id userbook.book_name userbook.author userbook.public_date userbook.address
15999998838 uviWfFJ KwCrDOA 2009-12-27 3b74416d-eb69-48e2-9d0d-09275064691b
Time taken: 0.093 seconds,Fetched: 1 row(s)
可以看到速度更快了,畢竟省略掉了開啟MR任務,執行效率提高不少。
參考:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Indexing