Indexes are standard database technologies that support indexing after the hive0.7 version. Hive provides limited indexing functionality, unlike the traditional relational database with the "key" concept, where users can create indexes on some columns to speed up certain operations, and index data created for one table is saved in another table. Hive's indexing function is now relatively late and offers fewer options. However, the index is designed to be controlled using the built-in pluggable Java code that allows users to extend this functionality to suit their needs. Not that all queries will benefit from the Hive Index, of course. Users can use the explain syntax to analyze whether the HIVEQL statement can use indexes to improve the performance of user queries. Like an index in an RDBMS, it is necessary to evaluate whether index creation is reasonable, after all, the index requires more disk space and there is a cost to creating a maintenance index. The user has to weigh the benefits and costs of getting the index.
Let's talk about how to create an index:
1. Create a table first:
[Java]View PlainCopy
- Hive> CREATE TABLE User (id int, name string)
- > ROW FORMAT Delimited
- > Fields TERMINATED by ' \ t '
- > STORED as Textfile;
2. Import data:
[Java]View PlainCopy
- hive> Load Data local inpath '/export1/tmp/wyp/row.txt '
- > overwrite into table user;
3. Test before creating an index:
[Java]View PlainCopy
- Hive> SELECT * from user where id =500000;
- Total MapReduce jobs = 1
- Launching Job 1 out of 1
- Number of reduce tasks is set to 0 since there ' s no reduce operator
- Cannot run job locally:input Size (= 356888890) is larger than
- Hive.exec.mode.local.auto.inputbytes.max (= 134217728)
- Starting Job = job_1384246387966_0247, Tracking URL =
- http://l-datalogm1.data.cn1:9981/proxy/application_1384246387966_0247/
- Kill Command=/home/q/hadoop/bin/hadoop Job-kill job_1384246387966_0247
- Hadoop job information for stage-1:number of mappers:2; Number of reducers:0
- 2013-11- :336 stage-1 map = 0, reduce = 0% /c9>
- 2013-11- :stage-1 map=50%,reduce=0, Cumulative CPU 2.0 sec
- 2013-11- :xx,531 stage-1 map=100%,reduce=0, Cumulative CPU 5.63 sec
- 2013-11- :560 stage-1 map=100%,reduce=0, Cumulative CPU 5.63 sec /c10>
- MapReduce Total Cumulative CPU time: 5 seconds 630 msec
- Ended Job = job_1384246387966_0247
- MapReduce Jobs Launched:
- Job 0:map: 2 cumulative CPU: 5.63 sec
- HDFs Read: 361084006 hdfs Write: 357 SUCCESS
- Total MapReduce CPU time spent: 5 seconds 630 msec
- Ok
- 500000 Wyp.
- Time taken: 14.107 seconds, fetched: 1 row (s)
It took a total of 14.107s.
4. Create an index on user:
[Java]View PlainCopy
- Hive> CREATE index User_index on table user (ID)
- > as ' Org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler '
- > with deferred rebuild
- > in TABLE user_index_table;
- Hive> ALTER index user_index on user rebuild;
- Hive> select * from user_index_table limit 5;
- 0 HDFs://mycluster/user/hive/warehouse/table02/000000_0 [0]
- 1 HDFs://mycluster/user/hive/warehouse/table02/000000_0 [352]
- 2 HDFs://mycluster/user/hive/warehouse/table02/000000_0 [704]
- 3 HDFs://mycluster/user/hive/warehouse/table02/000000_0 [1056]
- 4 HDFs://mycluster/user/hive/warehouse/table02/000000_0 [1408]
- Time taken: 0.244 seconds, fetched: 5 row (s)
This creates an index on the user table.
5. Test the user after the index is created:
[Java]View PlainCopy
- Hive> SELECT * from user where id =500000;
- Total MapReduce jobs = 1
- Launching Job 1 out of 1
- Number of reduce tasks is set to 0 since there ' s no reduce operator
- Cannot run job locally:input Size (= 356888890) is larger than
- Hive.exec.mode.local.auto.inputbytes.max (= 134217728)
- Starting Job = job_1384246387966_0247, Tracking URL =
- http://l-datalogm1.data.cn1:9981/proxy/application_1384246387966_0247/
- Kill Command=/home/q/hadoop/bin/hadoop Job-kill job_1384246387966_0247
- Hadoop job information for stage-1:number of mappers:2; Number of reducers:0
- 2013-11- :336 stage-1 map = 0, reduce = 0%
- 2013-11- :stage-1 map=50%,reduce=0, Cumulative CPU 2.0 sec
- 2013-11-: xx,253 stage-1 map=100%,reduce=0, Cumulative CPU 5.27 sec /c10>
- 2013-11-: 650 stage-1 map=100%,reduce=0, Cumulative CPU 5.27 sec
- MapReduce Total Cumulative CPU time: 5 seconds 630 msec
- Ended Job = job_1384246387966_0247
- MapReduce Jobs Launched:
- Job 0:map: 2 cumulative CPU: 5.63 sec
- HDFs Read: 361084006 hdfs Write: 357 SUCCESS
- Total MapReduce CPU time spent: 5 seconds 630 msec
- Ok
- 500000 Wyp.
- Time taken: 13.042 seconds, fetched: 1 row (s)
The time is 13.042s, which is not very different from the effect of no index.
There is still a bug in creating the index in hive: If the schema information from the table is from Serde,hive, the index cannot be created:
[Java]View PlainCopy
- hive> CREATE INDEX Employees_index
- > On TABLE Employees (country)
- > as ' Org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler '
- > with DEFERRED REBUILD
- > idxproperties (' creator ' = ' me ',' created_at ' = ' some_time ')
- > in TABLE employees_index_table
- > COMMENT ' Employees indexed by country and name. ';
- Failed:error in metadata:java.lang.RuntimeException: \
- Check the index columns, they should appear in the table being indexed.
- Failed:execution Error, return code 1 from \
- Org.apache.hadoop.hive.ql.exec.DDLTask
This bug occurred in Hive0.10.0, 0.10.1, 0.11.0, in Hive0.12.0 has been repaired, see for details: https://issues.apache.org/jira/browse/HIVE-4251
Article from past memory: http://www.iteblog.com/archives/836
Hive CREATE Index