Hive index, hive
1. Hive index Overview
Hive indexes aim to improve the query speed of specified columns in Hive tables.
If no index exists, a query similar to 'where tab1.col1 = 10', Hive loads the entire table or partition and processes all the rows. However, if an index exists on the field col1, only part of the file will be loaded and processed.
Like other traditional databases, increasing indexes will consume additional resources to create indexes and require more disk space to store indexes.
The index is added to Hive version 0.7.0. The bitmap index is added to Hive 0.8.0.
2. Index-related configuration parameters
Hive. index. compact. file. ignore. hdfs
Default Value: false
Added In: Hive 0.7.0 withHIVE-1889
The hdfs address stored in the index file will be ignored at runtime. If it is enabled, if the data is migrated, the index file will still be available. The default value is false.
Hive. optimize. index. filter
Default Value: false
Added In: Hive 0.8.0 withHIVE-1644
Whether to use indexes automatically. The default value is false.
Hive. optimize. index. filter. compact. minsize
Default Value: 5368709120
Added In: Hive 0.8.0 withHIVE-1644
Minimum Size of input for automatic index application Compression
Hive. optimize. index. filter. compact. maxsize
Default Value:-1
Added In: Hive 0.8.0 withHIVE-1644
The maximum input size of the automatically applied compressed index. A negative value indicates positive infinity.
Hive. index. compact. query. max. size
Default Value: 10737418240
Added In: Hive 0.8.0 withHIVE-2096
The maximum data size that can be obtained for a query using a compressed index. The default value is 10737418240 bytes. negative values indicate infinity;
Hive. index. compact. query. max. entries
Default Value: 10000000
Added In: Hive 0.8.0 withHIVE-2096
The maximum number of index items that can be read when you use the compressed index query. The default value is 10000000. negative values indicate infinity;
Hive.exe c. concatenate. check. index
Default Value: true
Added In: Hive 0.8.0 withHIVE-2125
If this parameter is set to true, an error is thrown when the alter table tbl_name CONCATENATE on a table/partition (with an index) operation is performed. This helps you avoid deleting and recreating indexes;
Hive. optimize. index. groupby
Default Value: false
Added In: Hive 0.8.1 withHIVE-1694
Hive. index. compact. binary. search
Default Value: true
Added In: Hive 0.8.1with HIVE-2535
Whether to enable binary search in the index table for index item query. The default value is true;
3. Index example
Note: In Hive 0.12.0 and earlier versions, index names are case sensitive in create index and drop index statements. However, alter index requires a lowercase index name.
This bug is fixed in Hive 0.6.2. This version starts to make the index name case insensitive.
For versions earlier than Hive 0.6.2, it is best to use lower-case index names.
The following describes the common usage of indexes:
A. Create/build, show, and drop index
Create index table01_index ontable table01 (column2)'Comput'With deferred rebuild;
Show index on table01;
Drop index table01_index ontable01;
B. Create then build, show formatted, and drop index
Create index table02_index ontable table02 (column3)'Comput'With deferred rebuild;
Alter index table02_index ontable02 rebuild;
Show formatted index ontable02;
Drop index table02_index ontable02;
C. Create bitmap indexes, build, show, and drop
Createindex table03_index on table table03 (column4)'Bitmap'With deferred rebuild;
Alter index table03_index ontable03 rebuild;
Show formatted index ontable03;
Drop index table03_index on table03;
D. Create an index on a new table
Createindex table04_index on table table04 (column5) as 'comput' with deferred rebuild in tabletable04_index_table;
E. Create an index. The storage format is RCFile.
Create index table05_index ontable table05 (column6)'Comput'With deferred rebuildstored as rcfile;
F. Create an index. The storage format is TextFile.
Create index table06_index ontable table06 (column7)As 'comput'With deferredrebuild row format delimited fields terminated by '\ t' stored as textfile;
G. Create an index with the index attribute
Create index table07_index ontable table07 (column8)As 'comput'With deferred rebuildIdxproperties ("prop1" = "value1", "prop2" = "value2 ");
H. Create an index with table attributes
Create index table08_index ontable table08 (columns)As 'comput'Withdeferred rebuildTblproperties ("prop3" = "value3", "prop4" = "value4 ");
I. If the index exists, delete it.
Drop index if exists table09_indexon table09;
J. Re-indexing on partitions
Alter index table10_index on table10partition (columnx = 'valueq ', columny = 'valuer') rebuild;
4. index test
(1) query the number of rows in the table
Hive (hive)> select count (1) from userbook;
4409365
(2) query before creating an index in the table
Hive (hive)> select * fromuserbook where book_id = '2013 ';
Query ID = hadoop_20150627165551_595da79a-0e27-453b-9142-7734912934c4
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is setto 0 since there's no reduce operator
Starting Job = job_1435392961740_0012, Tracking URL = http: // gpmaster: 8088/proxy/application_1435392961740_0012/
Kill Command =/home/hadoop/hadoop-2.6.0/bin/hadoop job-kill job_1435392961740_0012
Hadoop job information forStage-1: number of mappers: 2; number of Hadoop CERs: 0
16:56:04, 666 Stage-1map = 0%, reduce = 0%
16:56:28, 974 Stage-1map = 50%, reduce = 0%, Cumulative CPU4.36 sec
16:56:31, 123 Stage-1map = 78%, reduce = 0%, Cumulative CPU6.21 sec
16:56:34, 698 Stage-1map = 100%, reduce = 0%, Cumulative CPU7.37 sec
MapReduce Total cumulative CPUtime: 7 secondds 370 msec
Ended Job = job_1435392961740_0012
MapReduce Jobs Launched:
Stage-1: Map: 2 Cumulative CPU: 7.37 sec HDFS Read: 348355875 HDFS Write: 76 SUCCESS
Total MapReduce CPU Time Spent: 7 secondds 370 msec
OK
Userbook. book_id userbook. book_name userbook. author userbook. public_date userbook. address
15999998838 uviWfFJ KwCrDOA 2009-12-27 3b74416d-eb69-48e2-9d0d-09275064691b
Time taken: 45.678 seconds, Fetched: 1 row (s)
(3) create an index
Hive (hive)> create indexuserbook_bookid_idx on table userbook (book_id) as 'comput' with deferredrebuild;
(4) execute the query after creating the index
Hive (hive)> select * fromuserbook where book_id = '2013 ';
Query ID = hadoop_20150627170019_5bb5514a-4c8e-4c47-9347-ed0657e1f2ff
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is setto 0 since there's no reduce operator
Starting Job = job_1435392961740_0013, Tracking URL = http: // gpmaster: 8088/proxy/application_1435392961740_0013/
Kill Command =/home/hadoop/hadoop-2.6.0/bin/hadoop job-kill job_1435392961740_0013
Hadoop job information forStage-1: number of mappers: 2; number of Hadoop CERs: 0
17:00:30, 429 Stage-1map = 0%, reduce = 0%
17:00:54, 003 Stage-1map = 50%, reduce = 0%, Cumulative CPU7.43 sec
17:00:56, 181 Stage-1map = 78%, reduce = 0%, Cumulative CPU9.66 sec
17:00:58, 417 Stage-1map = 100%, reduce = 0%, Cumulative CPU10.83 sec
MapReduce Total cumulative CPUtime: 10 secondds 830 msec
Ended Job = job_1435392961740_0013
MapReduce Jobs Launched:
Stage-1: Map: 2 Cumulative CPU: 10.83 sec HDFS Read: 348356271 HDFS Write: 76 SUCCESS
Total MapReduce CPU Time Spent: 10 secondds 830 msec
OK
Userbook. book_id userbook. book_name userbook. author userbook. public_date userbook. address
15999998838 uviWfFJ KwCrDOA 2009-12-27 3b74416d-eb69-48e2-9d0d-09275064691b
Time taken: 40.549 seconds, Fetched: 1 row (s)
We can see that the index creation speed is a little faster.
In fact, for this simple query, we do not need to start Map/Reduce through our settings, but instead start Fetch task to filter the required data directly from the HDFS file, you need to set the following parameters:
Set hive. fetch. task. conversion = more;
Hive (hive)> select * fromuserbook where book_id = '2013 ';
OK
Userbook. book_id userbook. book_name userbook. author userbook. public_date userbook. address
15999998838 uviWfFJ KwCrDOA 2009-12-27 3b74416d-eb69-48e2-9d0d-09275064691b
Time taken:0.093 seconds, Fetched: 1 row (s)
It can be seen that the speed is faster. After all, the enabling MR task is omitted, and the execution efficiency is improved a lot.
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Indexing