Original link: http://lxw1234.com/archives/2015/12/585.htm
Keywords: hive, elasticsearch, integration, consolidation
Elasticsearch can already be used with big data technology frameworks like yarn, Hadoop, Hive, Pig, Spark, Flume, and more, especially when adding data, using distributed tasks to add index data, especially on data platforms. Many of the data is stored in hive, and using hive to manipulate the data in Elasticsearch will be a great convenience for developers. Here is a record of how hive integrates with Elasticsearch, querying and adding data to the configuration usage process. Based on Hive0.13.1, hadoop-cdh5.0, ElasticSearch 2.1.0. reading and statistical analysis of data in Elasticsearch through Hive data already in Elasticsearch
_index:lxw1234
_type:tags
_ID: User ID (Cookieid)
Fields: Area, media_view_tags, interest
Hive Build Table
Because I use the Elasticsearch version of 2.1.0, you must use elasticsearch-hadoop-2.2.0 to support it, and if the ES version is less than 2.1.0, you can use elasticsearch-hadoop-2.1.2.
Download Address: Https://www.elastic.co/downloads/hadoop
Add Jar File:///home/liuxiaowen/elasticsearch-hadoop-2.2.0-beta1/dist/elasticsearch-hadoop-hive-2.2.0-beta1.jar; CREATE EXTERNAL TABLE lxw1234_es_tags (Cookieid string, area string, media_view_tags string, interest string) STORED by ' Org.elasticsearch.hadoop.hive.EsStorageHandler ' tblproperties (' es.nodes ' = ' 172.16.212.17:9200,172.16.212.102:9200 ', ' es.index.auto.create ' = ' false ', ' es.resource ' = ' lxw1234/tags ', ' Es.read.metadata ' = ' true ', ' es.mapping.names ' = ' cookieid:_metadata._id, Area:area, Media_view_tags:media_view_tags, Interest:interest ');
Note: Because Lxw1234/tags's _id is Cookieid in es, you must use this method to map _id to a hive table field:
' Es.read.metadata ' = ' true ',
' Es.mapping.names ' = ' cookieid:_metadata._id,... ' querying data in Hive
The data can be queried normally.
Execute select COUNT (1) from Lxw1234_es_tags; Hive is also performed via MapReduce, with each shard using a map task:
You can query only the filtered data by specifying the search criteria in the Hive external table. For example, the following build statement will search for _id=98e5d2de059f1d563d8565 records from ES:
CREATE EXTERNAL TABLE lxw1234_es_tags_2 (Cookieid string, area string, media_view_tags string, interest string) STORED B Y ' Org.elasticsearch.hadoop.hive.EsStorageHandler ' tblproperties (' es.nodes ' = ' 172.16.212.17:9200,172.16.212.102:9200 ', ' es.index.auto.create ' = ' false ', ' es.resource ' = ' lxw1234/tags ', ' Es.read.metadata ' = ' true ', ' es.mapping.names ' = ' cookieid:_metadata._id, Area:area, Media_view_tags:media_view_tags, Interest:interest ', ' es.query ' = '? q=_id:98e5d2de059f1d563d8565 '); Hive> select * from Lxw1234_es_tags_2; OK 98e5d2de059f1d563d8565 Sichuan | Chengdu Shopping | | shopping | | Time taken:0.096 seconds, fetched:1 row (s)
If the amount of data is small, you can use hive's local mode to do so without committing to the Hadoop cluster:
Set in hive:
Set hive.exec.mode.local.auto.inputbytes.max=134217728; Set hive.exec.mode.local.auto.tasks.max=10; Set hive.exec.mode.local.auto=true; Set fs.defaultfs=file:///; Hive> Select Area,count (1) as CNT from Lxw1234_es_tags Group by area ORDER BY CNT DESC limit 20; Automatically selecting local only mode for query total jobs = 2 Launching Job 1 out of 2 ..... Execution log at:/tmp/liuxiaowen/liuxiaowen_20151211133030_97b50138-d55d-4a39-bc8e-cbdf09e33ee6.log Job running In-process (local Hadoop) Hadoop job information for Null:number of Mappers