How do I use hive to integrate SOLR?

Source: Internet
Author: User
Tags solr query hadoop ecosystem

(a) HIVE+SOLR profile

As the offline data warehouse of the Hadoop ecosystem, hive can easily use SQL to analyze the huge amount of historical data offline, and according to the analysis results, to do some other things, such as report statistics query.
SOLR, as a high-performance search server, provides fast, powerful, full-text retrieval capabilities.

(b) Why is hive integration SOLR required?

Sometimes, we need to store the results of the analysis of hive into the full-text search service in SOLR, for example, we have a business, the search log on our e-commerce website using hive analysis after the storage to SOLR inside to do report query, because it involves search keywords, This field is required to be able to participle query and non-segmentation query, through the Word segmentation query can see the related product changes in a certain period of time a trend chart. Sometimes, we need to load the data in SOLR into hive and use SQL to do some of the join analysis functions, which are complementary to each other to better suit our business needs. Of course there are some open-source projects on the web that have hive integration SOLR, but because of the older version, it is not possible to run in the new version, and after the transformation, it can be run in the latest version.

(c) How can I enable hive integration with SOLR?

The so-called integration is actually some of the components that rewrite the Mr Programming interface of Hadoop. We all know that Mr's programming interfaces are very flexible and highly abstracted, and Mr is not just able to load data sources from HDFs, but can also load data from any non-HDFS system, if we need to customize:
InputFormat
OutputFormat
Recordreader
Recordwriter
Inputsplit
Component, though a bit of a hassle, but loading data from anywhere can really do it, including Mysql,sqlserver,oracle,mongodb, Solr,es,redis and so on.

This is the Mr Programming interface for custom Hadoop, in addition to some of the components above, additional definitions of Serde components and assembly Storagehandler are required, and serde in hive refers to serializer and Deserializer, which is what we call serialization and deserialization, hive needs to use Serde and fileinput to read and write a row of rows of data in a hive table.
Read the process:
HDFS files/every Source--Inputfileformat------Deserializer Row Object
The process of writing:
Row Object---serializer---Outputfileformat, HDFS files/every source

(iv) What can I do after hive integration Solr?

(1) Read SOLR data to Hive's supported SQL syntax, able to perform various aggregations, statistics, analyses, joins, etc.
(2) Generate SOLR Index, one sentence of SQL, can be an Mr Way to index large-scale data

(v) How to install the deployment and use?
Source here, not pasted, has uploaded GitHub, there is a need for friends can use Git clonehttps://github.com/qindongliang/hive-solr after modifying a little pom file after the execution
MVN Clean Package
command to build the jar package and copy the jar package to the Hive's Lib directory

Examples are as follows:
(1) Hive reads SOLR data

Build table:

SQL code
  1. --The Presence table is deleted
  2. Drop table if exists SOLR;
  3. --Create an external table
  4. Create external table SOLR (
  5. --Define fields, where the fields need to be consistent with SOLR's fields
  6. Rowkey String,
  7. Sname string
  8. )
  9. --Define the storehandler of the store
  10. Stored by "Com.easy.hive.store.SolrStorageHandler"
  11. --Configure SOLR properties
  12. Tblproperties (' solr.url ' = ' http://192.168.1.28:8983/solr/a ',
  13. ' solr.query ' = ' *:* ',
  14. ' solr.cursor.batch.size ' =' 10000 ',
  15. ' solr.primary_key ' =' Rowkey '
  16. );
  17. Execute the bin/hive command for hive command-line terminal:
  18. --Query all data
  19. SELECT * FROM SOLR limit 5;
  20. --Query the specified field
  21. Select Rowkey from SOLR;
  22. --to aggregate statistical SOLR data in the form of Mr
  23. Select Sname,count (*) as C from solr Group by sname ORDER by C Desc /c13>



(2) Example of building an index to SOLR using hive

First build the Data source table:

SQL code
  1. --delete if it exists
  2. Drop table if exists index_source;
  3. --Building a data table
  4. CREATE TABLE index_source (ID string, yname string,sname string) ROW FORMAT delimited fields TERMINATED by ', '  STORED as textfile;
  5. --Import local data into the data source
  6. Load Data local inpath '/ROOT/SERVER/HIVE/TEST_SOLR ' into table index_source;
  7. Next, build an association table for SOLR:
  8. --delete a table that already exists
  9. Drop table if exists INDEX_SOLR;
  10. --Create an associated SOLR table
  11. Create external table INDEX_SOLR (
  12. ID string,
  13. Yname String,
  14. Sname string
  15. )
  16. --Defining the storage engine
  17. Stored by "Com.easy.hive.store.SolrStorageHandler"
  18. --Set SOLR service properties
  19. Tblproperties (' solr.url ' = ' http://192.168.1.28:8983/solr/b ',
  20. ' solr.query ' = ' *:* ',
  21. ' solr.cursor.batch.size ' =' 10000 ',
  22. ' solr.primary_key ' =' id '
  23. );



Finally, execute the following SQL command to build the SOLR index to the data in the data source:

SQL code
    1. --register HIVE-SOLR jar package, otherwise the MR Mode will not start properly   
    2. add jar /root/server/hive/lib/hive-solr.jar;  
    3. --execute insert command   
    4. insert  Overwrite table index_solr select * from  index_source ;   
    5. span class= "comment" >--after successful execution, can be viewed in the SOLR terminal interface, or in hive inside the following SOLR query   
    6. select * from index_solr limit 10 ;   



(vi) Can they also integrate with other frameworks?

Of course, as an open source stand-alone framework, we can do a variety of combinations, hive can be integrated with Elasticsearch, can also be integrated with MongoDB, SOLR can be integrated with spark, can also be integrated with pig, but all need us to customize the relevant components to the line, The idea is roughly the same as that of the project.

(vii) Basic environment for this test

Apache Hadoop2.7.1
Apache Hive1.2.1
Apache Solr5.1.0

(eight) Thanks and reference information:

Https://github.com/mongodb/mongo-hadoop/tree/master/hive/src/main/java/com/mongodb/hadoop/hive
Https://github.com/lucidworks/hive-solr
Https://github.com/chimpler/hive-solr
Https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HowtoWriteYourOwnSerDe

How do I use hive to integrate SOLR?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.