Using hive to read and write data from Elasticsearch

Source: Internet
Author: User

Original link: http://lxw1234.com/archives/2015/12/585.htm

Keywords: hive, elasticsearch, integration, consolidation

Elasticsearch can already be used with big data technology frameworks like yarn, Hadoop, Hive, Pig, Spark, Flume, and more, especially when adding data, using distributed tasks to add index data, especially on data platforms. Many of the data is stored in hive, and using hive to manipulate the data in Elasticsearch will be a great convenience for developers. Here is a record of how hive integrates with Elasticsearch, querying and adding data to the configuration usage process. Based on Hive0.13.1, hadoop-cdh5.0, ElasticSearch 2.1.0. reading and statistical analysis of data in Elasticsearch through Hive data already in Elasticsearch

_index:lxw1234
_type:tags
_ID: User ID (Cookieid)
Fields: Area, media_view_tags, interest

Hive Build Table

Because I use the Elasticsearch version of 2.1.0, you must use elasticsearch-hadoop-2.2.0 to support it, and if the ES version is less than 2.1.0, you can use elasticsearch-hadoop-2.1.2.

Download Address: Https://www.elastic.co/downloads/hadoop

Add Jar File:///home/liuxiaowen/elasticsearch-hadoop-2.2.0-beta1/dist/elasticsearch-hadoop-hive-2.2.0-beta1.jar; CREATE EXTERNAL TABLE lxw1234_es_tags (Cookieid string, area string, media_view_tags string, interest string) STORED by ' Org.elasticsearch.hadoop.hive.EsStorageHandler ' tblproperties (' es.nodes ' = ' 172.16.212.17:9200,172.16.212.102:9200 ', ' es.index.auto.create ' = ' false ', ' es.resource ' = ' lxw1234/tags ', ' Es.read.metadata ' = ' true ', ' es.mapping.names ' = ' cookieid:_metadata._id, Area:area, Media_view_tags:media_view_tags, Interest:interest ');

Note: Because Lxw1234/tags's _id is Cookieid in es, you must use this method to map _id to a hive table field:
' Es.read.metadata ' = ' true ',
' Es.mapping.names ' = ' cookieid:_metadata._id,... ' querying data in Hive

The data can be queried normally.

Execute select COUNT (1) from Lxw1234_es_tags; Hive is also performed via MapReduce, with each shard using a map task:

You can query only the filtered data by specifying the search criteria in the Hive external table. For example, the following build statement will search for _id=98e5d2de059f1d563d8565 records from ES:

CREATE EXTERNAL TABLE lxw1234_es_tags_2 (Cookieid string, area string, media_view_tags string, interest string) STORED B Y ' Org.elasticsearch.hadoop.hive.EsStorageHandler ' tblproperties (' es.nodes ' = ' 172.16.212.17:9200,172.16.212.102:9200 ', ' es.index.auto.create ' = ' false ', ' es.resource ' = ' lxw1234/tags ', ' Es.read.metadata ' = ' true ', ' es.mapping.names ' = ' cookieid:_metadata._id, Area:area, Media_view_tags:media_view_tags,   Interest:interest ', ' es.query ' = '? q=_id:98e5d2de059f1d563d8565 '); Hive> select * from Lxw1234_es_tags_2; OK 98e5d2de059f1d563d8565 Sichuan | Chengdu Shopping | | shopping | | Time taken:0.096 seconds, fetched:1 row (s)

If the amount of data is small, you can use hive's local mode to do so without committing to the Hadoop cluster:

Set in hive:

Set hive.exec.mode.local.auto.inputbytes.max=134217728; Set hive.exec.mode.local.auto.tasks.max=10; Set hive.exec.mode.local.auto=true; Set fs.defaultfs=file:///;   Hive> Select Area,count (1) as CNT from Lxw1234_es_tags Group by area ORDER BY CNT DESC limit 20; Automatically selecting local only mode for query total jobs = 2 Launching Job 1 out of 2 ..... Execution log at:/tmp/liuxiaowen/liuxiaowen_20151211133030_97b50138-d55d-4a39-bc8e-cbdf09e33ee6.log Job running In-process (local Hadoop) Hadoop job information for Null:number of Mappers

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.