Hive index, hive

Last Update:2015-06-28 Source: Internet

Author: User

Tags createindex

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hive index, hive

1. Hive index Overview

Hive indexes aim to improve the query speed of specified columns in Hive tables.

If no index exists, a query similar to 'where tab1.col1 = 10', Hive loads the entire table or partition and processes all the rows. However, if an index exists on the field col1, only part of the file will be loaded and processed.

Like other traditional databases, increasing indexes will consume additional resources to create indexes and require more disk space to store indexes.

The index is added to Hive version 0.7.0. The bitmap index is added to Hive 0.8.0.

2. Index-related configuration parameters

Hive. index. compact. file. ignore. hdfs

Default Value: false

Added In: Hive 0.7.0 withHIVE-1889

The hdfs address stored in the index file will be ignored at runtime. If it is enabled, if the data is migrated, the index file will still be available. The default value is false.

Hive. optimize. index. filter

Default Value: false

Added In: Hive 0.8.0 withHIVE-1644

Whether to use indexes automatically. The default value is false.

Hive. optimize. index. filter. compact. minsize

Default Value: 5368709120

Added In: Hive 0.8.0 withHIVE-1644

Minimum Size of input for automatic index application Compression

Hive. optimize. index. filter. compact. maxsize

Default Value:-1

Added In: Hive 0.8.0 withHIVE-1644

The maximum input size of the automatically applied compressed index. A negative value indicates positive infinity.

Hive. index. compact. query. max. size

Default Value: 10737418240

Added In: Hive 0.8.0 withHIVE-2096

The maximum data size that can be obtained for a query using a compressed index. The default value is 10737418240 bytes. negative values indicate infinity;

Hive. index. compact. query. max. entries

Default Value: 10000000

Added In: Hive 0.8.0 withHIVE-2096

The maximum number of index items that can be read when you use the compressed index query. The default value is 10000000. negative values indicate infinity;

Hive.exe c. concatenate. check. index

Default Value: true

Added In: Hive 0.8.0 withHIVE-2125

If this parameter is set to true, an error is thrown when the alter table tbl_name CONCATENATE on a table/partition (with an index) operation is performed. This helps you avoid deleting and recreating indexes;

Hive. optimize. index. groupby

Default Value: false

Added In: Hive 0.8.1 withHIVE-1694

Hive. index. compact. binary. search

Default Value: true

Added In: Hive 0.8.1with HIVE-2535

Whether to enable binary search in the index table for index item query. The default value is true;

3. Index example

Note: In Hive 0.12.0 and earlier versions, index names are case sensitive in create index and drop index statements. However, alter index requires a lowercase index name.

This bug is fixed in Hive 0.6.2. This version starts to make the index name case insensitive.

For versions earlier than Hive 0.6.2, it is best to use lower-case index names.

The following describes the common usage of indexes:

A. Create/build, show, and drop index

Create index table01_index ontable table01 (column2)'Comput'With deferred rebuild;

Show index on table01;

Drop index table01_index ontable01;

B. Create then build, show formatted, and drop index

Create index table02_index ontable table02 (column3)'Comput'With deferred rebuild;

Alter index table02_index ontable02 rebuild;

Show formatted index ontable02;

Drop index table02_index ontable02;

C. Create bitmap indexes, build, show, and drop

Createindex table03_index on table table03 (column4)'Bitmap'With deferred rebuild;

Alter index table03_index ontable03 rebuild;

Show formatted index ontable03;

Drop index table03_index on table03;

D. Create an index on a new table

Createindex table04_index on table table04 (column5) as 'comput' with deferred rebuild in tabletable04_index_table;

E. Create an index. The storage format is RCFile.

Create index table05_index ontable table05 (column6)'Comput'With deferred rebuildstored as rcfile;

F. Create an index. The storage format is TextFile.

Create index table06_index ontable table06 (column7)As 'comput'With deferredrebuild row format delimited fields terminated by '\ t' stored as textfile;

G. Create an index with the index attribute

Create index table07_index ontable table07 (column8)As 'comput'With deferred rebuildIdxproperties ("prop1" = "value1", "prop2" = "value2 ");

H. Create an index with table attributes

Create index table08_index ontable table08 (columns)As 'comput'Withdeferred rebuildTblproperties ("prop3" = "value3", "prop4" = "value4 ");

I. If the index exists, delete it.

Drop index if exists table09_indexon table09;

J. Re-indexing on partitions

Alter index table10_index on table10partition (columnx = 'valueq ', columny = 'valuer') rebuild;

4. index test

(1) query the number of rows in the table

Hive (hive)> select count (1) from userbook;

4409365

(2) query before creating an index in the table

Hive (hive)> select * fromuserbook where book_id = '2013 ';

Query ID = hadoop_20150627165551_595da79a-0e27-453b-9142-7734912934c4

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is setto 0 since there's no reduce operator

Starting Job = job_1435392961740_0012, Tracking URL = http: // gpmaster: 8088/proxy/application_1435392961740_0012/

Kill Command =/home/hadoop/hadoop-2.6.0/bin/hadoop job-kill job_1435392961740_0012

Hadoop job information forStage-1: number of mappers: 2; number of Hadoop CERs: 0

16:56:04, 666 Stage-1map = 0%, reduce = 0%

16:56:28, 974 Stage-1map = 50%, reduce = 0%, Cumulative CPU4.36 sec

16:56:31, 123 Stage-1map = 78%, reduce = 0%, Cumulative CPU6.21 sec

16:56:34, 698 Stage-1map = 100%, reduce = 0%, Cumulative CPU7.37 sec

MapReduce Total cumulative CPUtime: 7 secondds 370 msec

Ended Job = job_1435392961740_0012

MapReduce Jobs Launched:

Stage-1: Map: 2 Cumulative CPU: 7.37 sec HDFS Read: 348355875 HDFS Write: 76 SUCCESS

Total MapReduce CPU Time Spent: 7 secondds 370 msec

Userbook. book_id userbook. book_name userbook. author userbook. public_date userbook. address

15999998838 uviWfFJ KwCrDOA 2009-12-27 3b74416d-eb69-48e2-9d0d-09275064691b

Time taken: 45.678 seconds, Fetched: 1 row (s)

(3) create an index

Hive (hive)> create indexuserbook_bookid_idx on table userbook (book_id) as 'comput' with deferredrebuild;

(4) execute the query after creating the index

Hive (hive)> select * fromuserbook where book_id = '2013 ';

Query ID = hadoop_20150627170019_5bb5514a-4c8e-4c47-9347-ed0657e1f2ff

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is setto 0 since there's no reduce operator

Starting Job = job_1435392961740_0013, Tracking URL = http: // gpmaster: 8088/proxy/application_1435392961740_0013/

Kill Command =/home/hadoop/hadoop-2.6.0/bin/hadoop job-kill job_1435392961740_0013

Hadoop job information forStage-1: number of mappers: 2; number of Hadoop CERs: 0

17:00:30, 429 Stage-1map = 0%, reduce = 0%

17:00:54, 003 Stage-1map = 50%, reduce = 0%, Cumulative CPU7.43 sec

17:00:56, 181 Stage-1map = 78%, reduce = 0%, Cumulative CPU9.66 sec

17:00:58, 417 Stage-1map = 100%, reduce = 0%, Cumulative CPU10.83 sec

MapReduce Total cumulative CPUtime: 10 secondds 830 msec

Ended Job = job_1435392961740_0013

MapReduce Jobs Launched:

Stage-1: Map: 2 Cumulative CPU: 10.83 sec HDFS Read: 348356271 HDFS Write: 76 SUCCESS

Total MapReduce CPU Time Spent: 10 secondds 830 msec

Userbook. book_id userbook. book_name userbook. author userbook. public_date userbook. address

15999998838 uviWfFJ KwCrDOA 2009-12-27 3b74416d-eb69-48e2-9d0d-09275064691b

Time taken: 40.549 seconds, Fetched: 1 row (s)

We can see that the index creation speed is a little faster.

In fact, for this simple query, we do not need to start Map/Reduce through our settings, but instead start Fetch task to filter the required data directly from the HDFS file, you need to set the following parameters:

Set hive. fetch. task. conversion = more;

Hive (hive)> select * fromuserbook where book_id = '2013 ';

Userbook. book_id userbook. book_name userbook. author userbook. public_date userbook. address

15999998838 uviWfFJ KwCrDOA 2009-12-27 3b74416d-eb69-48e2-9d0d-09275064691b

Time taken:0.093 seconds, Fetched: 1 row (s)

It can be seen that the speed is faster. After all, the enabling MR task is omitted, and the execution efficiency is improved a lot.

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Indexing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More