How do I use hive to integrate SOLR?

Last Update:2016-03-17 Source: Internet

Author: User

Tags solr query hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(a) HIVE+SOLR profile

As the offline data warehouse of the Hadoop ecosystem, hive can easily use SQL to analyze the huge amount of historical data offline, and according to the analysis results, to do some other things, such as report statistics query.
SOLR, as a high-performance search server, provides fast, powerful, full-text retrieval capabilities.

(b) Why is hive integration SOLR required?

Sometimes, we need to store the results of the analysis of hive into the full-text search service in SOLR, for example, we have a business, the search log on our e-commerce website using hive analysis after the storage to SOLR inside to do report query, because it involves search keywords, This field is required to be able to participle query and non-segmentation query, through the Word segmentation query can see the related product changes in a certain period of time a trend chart. Sometimes, we need to load the data in SOLR into hive and use SQL to do some of the join analysis functions, which are complementary to each other to better suit our business needs. Of course there are some open-source projects on the web that have hive integration SOLR, but because of the older version, it is not possible to run in the new version, and after the transformation, it can be run in the latest version.

(c) How can I enable hive integration with SOLR?

The so-called integration is actually some of the components that rewrite the Mr Programming interface of Hadoop. We all know that Mr's programming interfaces are very flexible and highly abstracted, and Mr is not just able to load data sources from HDFs, but can also load data from any non-HDFS system, if we need to customize:
InputFormat
OutputFormat
Recordreader
Recordwriter
Inputsplit
Component, though a bit of a hassle, but loading data from anywhere can really do it, including Mysql,sqlserver,oracle,mongodb, Solr,es,redis and so on.

This is the Mr Programming interface for custom Hadoop, in addition to some of the components above, additional definitions of Serde components and assembly Storagehandler are required, and serde in hive refers to serializer and Deserializer, which is what we call serialization and deserialization, hive needs to use Serde and fileinput to read and write a row of rows of data in a hive table.
Read the process:
HDFS files/every Source--Inputfileformat------Deserializer Row Object
The process of writing:
Row Object---serializer---Outputfileformat, HDFS files/every source

(iv) What can I do after hive integration Solr?

(1) Read SOLR data to Hive's supported SQL syntax, able to perform various aggregations, statistics, analyses, joins, etc.
(2) Generate SOLR Index, one sentence of SQL, can be an Mr Way to index large-scale data

(v) How to install the deployment and use?
Source here, not pasted, has uploaded GitHub, there is a need for friends can use Git clonehttps://github.com/qindongliang/hive-solr after modifying a little pom file after the execution
MVN Clean Package
command to build the jar package and copy the jar package to the Hive's Lib directory

Examples are as follows:
(1) Hive reads SOLR data

Build table:

SQL code

--The Presence table is deleted
Drop table if exists SOLR;
--Create an external table
Create external table SOLR (
--Define fields, where the fields need to be consistent with SOLR's fields
Rowkey String,
Sname string
)
--Define the storehandler of the store
Stored by "Com.easy.hive.store.SolrStorageHandler"
--Configure SOLR properties
Tblproperties (' solr.url ' = ' http://192.168.1.28:8983/solr/a ',
' solr.query ' = ' *:* ',
' solr.cursor.batch.size ' =' 10000 ',
' solr.primary_key ' =' Rowkey '
);
Execute the bin/hive command for hive command-line terminal:
--Query all data
SELECT * FROM SOLR limit 5;
--Query the specified field
Select Rowkey from SOLR;
--to aggregate statistical SOLR data in the form of Mr
Select Sname,count (*) as C from solr Group by sname ORDER by C Desc /c13>

(2) Example of building an index to SOLR using hive

First build the Data source table:

SQL code

--delete if it exists
Drop table if exists index_source;
--Building a data table
CREATE TABLE index_source (ID string, yname string,sname string) ROW FORMAT delimited fields TERMINATED by ', ' STORED as textfile;
--Import local data into the data source
Load Data local inpath '/ROOT/SERVER/HIVE/TEST_SOLR ' into table index_source;
Next, build an association table for SOLR:
--delete a table that already exists
Drop table if exists INDEX_SOLR;
--Create an associated SOLR table
Create external table INDEX_SOLR (
ID string,
Yname String,
Sname string
)
--Defining the storage engine
Stored by "Com.easy.hive.store.SolrStorageHandler"
--Set SOLR service properties
Tblproperties (' solr.url ' = ' http://192.168.1.28:8983/solr/b ',
' solr.query ' = ' *:* ',
' solr.cursor.batch.size ' =' 10000 ',
' solr.primary_key ' =' id '
);

Finally, execute the following SQL command to build the SOLR index to the data in the data source:

SQL code

--register HIVE-SOLR jar package, otherwise the MR Mode will not start properly
add jar /root/server/hive/lib/hive-solr.jar;
--execute insert command
insert Overwrite table index_solr select * from index_source ;
span class= "comment" >--after successful execution, can be viewed in the SOLR terminal interface, or in hive inside the following SOLR query
select * from index_solr limit 10 ;

(vi) Can they also integrate with other frameworks?

Of course, as an open source stand-alone framework, we can do a variety of combinations, hive can be integrated with Elasticsearch, can also be integrated with MongoDB, SOLR can be integrated with spark, can also be integrated with pig, but all need us to customize the relevant components to the line, The idea is roughly the same as that of the project.

(vii) Basic environment for this test

Apache Hadoop2.7.1
Apache Hive1.2.1
Apache Solr5.1.0

(eight) Thanks and reference information:

Https://github.com/mongodb/mongo-hadoop/tree/master/hive/src/main/java/com/mongodb/hadoop/hive
Https://github.com/lucidworks/hive-solr
Https://github.com/chimpler/hive-solr
Https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HowtoWriteYourOwnSerDe

How do I use hive to integrate SOLR?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More