Hive CREATE Index

Source: Internet
Author: User

Indexes are standard database technologies that support indexing after the hive0.7 version. Hive provides limited indexing functionality, unlike the traditional relational database with the "key" concept, where users can create indexes on some columns to speed up certain operations, and index data created for one table is saved in another table. Hive's indexing function is now relatively late and offers fewer options. However, the index is designed to be controlled using the built-in pluggable Java code that allows users to extend this functionality to suit their needs. Not that all queries will benefit from the Hive Index, of course. Users can use the explain syntax to analyze whether the HIVEQL statement can use indexes to improve the performance of user queries. Like an index in an RDBMS, it is necessary to evaluate whether index creation is reasonable, after all, the index requires more disk space and there is a cost to creating a maintenance index. The user has to weigh the benefits and costs of getting the index.

Let's talk about how to create an index:

1. Create a table first:

[Java]View PlainCopy
    1. Hive> CREATE TABLE User (id int, name string)
    2. > ROW FORMAT Delimited
    3. > Fields TERMINATED by ' \ t '
    4. > STORED as Textfile;


2. Import data:

[Java]View PlainCopy
    1. hive> Load Data local inpath '/export1/tmp/wyp/row.txt '
    2. > overwrite into table user;


3. Test before creating an index:

[Java]View PlainCopy
  1. Hive> SELECT * from user where id =500000;
  2. Total MapReduce jobs = 1
  3. Launching Job 1 out of 1
  4. Number of reduce tasks is set to 0 since there ' s no reduce operator
  5. Cannot run job locally:input Size (= 356888890) is larger than
  6. Hive.exec.mode.local.auto.inputbytes.max (= 134217728)
  7. Starting Job = job_1384246387966_0247, Tracking URL =
  8. http://l-datalogm1.data.cn1:9981/proxy/application_1384246387966_0247/
  9. Kill Command=/home/q/hadoop/bin/hadoop Job-kill job_1384246387966_0247
  10. Hadoop job information for stage-1:number of mappers:2; Number of reducers:0
  11. 2013-11- :336 stage-1 map = 0, reduce = 0% /c9>
  12. 2013-11- :stage-1 map=50%,reduce=0, Cumulative CPU 2.0 sec
  13. 2013-11- :xx,531 stage-1 map=100%,reduce=0, Cumulative CPU 5.63 sec
  14. 2013-11- :560 stage-1 map=100%,reduce=0, Cumulative CPU 5.63 sec /c10>
  15. MapReduce Total Cumulative CPU time: 5 seconds 630 msec
  16. Ended Job = job_1384246387966_0247
  17. MapReduce Jobs Launched:
  18. Job 0:map: 2 cumulative CPU: 5.63 sec
  19. HDFs Read: 361084006 hdfs Write: 357 SUCCESS
  20. Total MapReduce CPU time spent: 5 seconds 630 msec
  21. Ok
  22. 500000 Wyp.
  23. Time taken: 14.107 seconds, fetched: 1 row (s)

It took a total of 14.107s.

4. Create an index on user:

[Java]View PlainCopy
  1. Hive> CREATE index User_index on table user (ID)
  2. > as ' Org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler '
  3. > with deferred rebuild
  4. > in TABLE user_index_table;
  5. Hive> ALTER index user_index on user rebuild;
  6. Hive> select * from user_index_table limit 5;
  7. 0 HDFs://mycluster/user/hive/warehouse/table02/000000_0 [0]
  8. 1 HDFs://mycluster/user/hive/warehouse/table02/000000_0 [352]
  9. 2 HDFs://mycluster/user/hive/warehouse/table02/000000_0 [704]
  10. 3 HDFs://mycluster/user/hive/warehouse/table02/000000_0 [1056]
  11. 4 HDFs://mycluster/user/hive/warehouse/table02/000000_0 [1408]
  12. Time taken: 0.244 seconds, fetched: 5 row (s)

This creates an index on the user table.

5. Test the user after the index is created:

[Java]View PlainCopy
  1. Hive> SELECT * from user where id =500000;
  2. Total MapReduce jobs = 1
  3. Launching Job 1 out of 1
  4. Number of reduce tasks is set to 0 since there ' s no reduce operator
  5. Cannot run job locally:input Size (= 356888890) is larger than
  6. Hive.exec.mode.local.auto.inputbytes.max (= 134217728)
  7. Starting Job = job_1384246387966_0247, Tracking URL =
  8. http://l-datalogm1.data.cn1:9981/proxy/application_1384246387966_0247/
  9. Kill Command=/home/q/hadoop/bin/hadoop Job-kill job_1384246387966_0247
  10. Hadoop job information for stage-1:number of mappers:2; Number of reducers:0
  11. 2013-11- :336 stage-1 map = 0, reduce = 0%
  12. 2013-11- :stage-1 map=50%,reduce=0, Cumulative CPU 2.0 sec
  13. 2013-11-: xx,253 stage-1 map=100%,reduce=0, Cumulative CPU 5.27 sec /c10>
  14. 2013-11-: 650 stage-1 map=100%,reduce=0, Cumulative CPU 5.27 sec
  15. MapReduce Total Cumulative CPU time: 5 seconds 630 msec
  16. Ended Job = job_1384246387966_0247
  17. MapReduce Jobs Launched:
  18. Job 0:map: 2 cumulative CPU: 5.63 sec
  19. HDFs Read: 361084006 hdfs Write: 357 SUCCESS
  20. Total MapReduce CPU time spent: 5 seconds 630 msec
  21. Ok
  22. 500000 Wyp.
  23. Time taken: 13.042 seconds, fetched: 1 row (s)

The time is 13.042s, which is not very different from the effect of no index.

There is still a bug in creating the index in hive: If the schema information from the table is from Serde,hive, the index cannot be created:

[Java]View PlainCopy
  1. hive> CREATE INDEX Employees_index
  2. > On TABLE Employees (country)
  3. > as ' Org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler '
  4. > with DEFERRED REBUILD
  5. > idxproperties (' creator ' = ' me ',' created_at ' = ' some_time ')
  6. > in TABLE employees_index_table
  7. > COMMENT ' Employees indexed by country and name. ';
  8. Failed:error in metadata:java.lang.RuntimeException: \
  9. Check the index columns, they should appear in the table being indexed.
  10. Failed:execution Error, return code 1 from \
  11. Org.apache.hadoop.hive.ql.exec.DDLTask

This bug occurred in Hive0.10.0, 0.10.1, 0.11.0, in Hive0.12.0 has been repaired, see for details: https://issues.apache.org/jira/browse/HIVE-4251


Article from past memory: http://www.iteblog.com/archives/836

Hive CREATE Index

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.