Building HBase Two indexes using SOLR
@ (hbase) [HBase, SOLR]
Building an HBase two-level index using SOLR Overview A Business Scenario Description Two technical Scenario 1 technical programme 12 Technical programme 23 recommendations on indexes two use hbase-indexer to build hbase two index one installation environment preparation two configuration SOLR three configuration HBASE-SOLR four test Try to create a test table in 1hbase 2 Add a configuration file to map a field in SOLR to qualifier 3 in HBase Add index 4 search five use SOLRJ to build HBase two level index see another article
I. Overview
In normal circumstances, HBase is suitable for a scenario where a value is obtained based on Rowkey, but if it is a reverse requirement, such as what Rowkey satisfies this condition based on value values, similar to an RDBMS operation, it is not an area that HBase specializes in, Because this causes a full table scan. (i) Business Scenario Description
Some of the data record some of the user's behaviors, such as whether they have played a game, whether they have bought a book, have seen a video, etc. Now you need to query the following:
(1) According to the user + tag as the query criteria, query the user whether the existence of this tag. This tag may be a video name.
(2) All tags of the user are queried according to the user.
(3) According to the label conditions, search for users who meet certain labeling conditions.
The first 2 scenarios require real-time fast queries, and the last scenario can be queried offline.
This is the most typical application. The top 2 scenarios are typically used to recommend products to users, while the 3rd scenario is to search for users who meet certain criteria and then advertise.
That
(1) for the user to refer to the user ID to obtain the user's information, and then recommend, because has been confirmed Rowkey, at this time HBase get operation is very efficient.
(2) When advertising is run, search for value in hbase based on certain conditions, get the qualifying Rowkey, and then run. If not all information indexes, you also need to retrieve the complete user information according to Rowkey.
(ii) technical programme 1, technical plan one
Scenario Description: Because Scenario 1 is the most common scenario, you determine whether a user's tag exists. The user information + tag as Rowkey, which can be the fastest query. For Scenario 2, you can get all the tags for a user by Scan#setrowprefixfilter (user ID). For Scenario 3, the HBase information is indexed to SOLR and then searched for some of the users that match the label criteria.
Issue: In order to add content from HBase to the index, you need to save the label name as content to the value of qualifier, which results in redundancy of the information. 2. Technical Solution Two
Scenario: Another scenario is to use the user ID directly as a rowkey, with each label as a qualifier. For scenes 1 and 2 both can be directly get on the line, for Scene 3 also through SOLR index.
Problem: But the user tag may be many, million, tens of millions, there may be more than hfile size of a row (the default 10G, although the likelihood is very small). In addition, even for scenario 1, the entire line needs to be read out, and then select some of the content, which will be a lot of useless information read into memory, it is easy to trigger GC. There is also this case will be labeled as qualifier, the name of the label is not predetermined, and the combination of SOLR and hbase can only be cf:qulifier and solr a field mapping (of course, this can be developed through their own module to the entire CF index, But this requires more manpower. There is no problem, the entire CF can already be indexed
Recommendation: If the number of label categories is small, such as 1000 or less, you can use this scenario, the above mentioned three problems are not present. and this is a more typical user portrait application, it is recommended that the classification, clustering and other algorithms to better control the number of labels. A large number of applications such as referrals and ad serving are using similar architectures. The drawback is that information is lost in detail after it is consolidated, but this data can be stored in HDFs for any time to re-extract valid information. 3, recommendations for indexes
(1) "strongly not recommended" If you use HBase directly to query, regardless of whether or not to add filter, this will require a full table scan, although you can use offline analysis, but also cause hbase to load full table data into memory, so frequent GC, causing the cluster load too high.
(2) "testable, no need to introduce SOLR, but performance to be determined" another way is to index the HBase table, re-index it back into HBase, label as Rowkey, and use the user ID as a value. This table can be used specifically for scenario 3.
(3) "testable, high degree of freedom, most flexible, but with a large workload" use coprocessor to index data to SOLR.
(4) "testable" uses open source tools such as Solr-index to index data to SOLR.
CDH has a HBASE-SOLR module, which is based on the hbase-indexer of the Open source project. The problem is that Hbase-indexer is based on 0.94 and 0.98, and it is unclear whether CDH has been improved or not documented. In general, however, it is released at the same time as CDH5.6, and there should be no compatibility issues. Ii. using Hbase-indexer to build an hbase two-level index
The following describes how to use Hbase-indexer to synchronize data in HBase into SOLR.
The rationale is that the operations are rewritten in SOLR once, based on the log of HBase. (i) Installation environment preparation
This article is deployed in an existing clustered environment, with JDK version 1.7, where the HBase cluster version is hbase-1.0.0-cdh5.6.0
The software required for SOLR and HBase integration is as follows: 1, solr-4.10.3-cdh5.6.0 2, hbase-solr-1.5-cdh5.6.0 (ii) configuring SOLR
Unpack Solr, CD to Exampe directory, and execute on each node using Solrcloud mode:
Java-dbootstrap_confdir=./solr/collection1/conf-dcollection.configname=myconf-dzkhost={zk_host}/solr-jar Start.jar
To start SOLR, the above command will upload the configuration file in Solr to ZK for other nodes in the cluster to read and use.
Among them, the Conf directory has a collection1 configuration file, there are two main files that may need to be modified:
①schema.xml file
②solrconfig.xml file
Among them, in the Schema.xml file, you can define some of their own field, FieldType, word breaker, Copyfield
The Solrconfig.xml file should have some common settings for the index, such as setting the soft and hard commit
Once started, you can access the http://{your_host}:8983/solr/and go to the SOLR query page. (iii) configuration HBASE-SOLR
After extracting the HABSE-SOLR, you need to do some configuration to use, otherwise the startup error.
(1) Modify the Conf/hbase-indexer-site.xml file and add the following:
<property>
<name>hbaseindexer.zookeeper.connectstring</name>
<value> zookeeperhost</value>
</property>
<property>
<name>hbase.zookeeper.quorum </name>
<value>zookeeperhost</value>
</property>
(2) Configure HBase to turn on HABSE's replication function, and in HBase's Hbase-site.xml file, modify the following:
<!--added by hwyn3816 on 2016/07/04--<property> <name>hbase.replication</name> <v Alue>true</value> <description>sep is basically replication, so enable it</description> </PR
operty> <property> <name>replication.source.ratio</name> <value>1.0</value> <description>source ratio of 100% makes sure that all SEP consumer is actually used (otherwise, some can sit idle , especially with small clusters) </description> </property> <property> <name>replication. Source.nb.capacity</name> <value>1000</value> <description>maximum Number of Hlog entries To replicate in one go. If this was large, and a consumer takes a while to process the events, the HBase RPC call would time Out.</description>
; </property> <property> <name>replication.replicationsource.implementation</name> ≪value>com.ngdata.sep.impl.sepreplicationsource</value> <description>a Custom Replication source
That fixes a few things and adds some functionality (doesn ' t interfere with normal replication usage) .</description>
</property>
(3) Copy some additional jars into the habse in the Lib directory,
CP lib/hbase-sep-* $HBASE _home/lib
(4) Restart the HBase cluster, allowing the configuration to take effect and execute:
CD ${hbase_home}
bin/rolling-restart.sh
(5) Start Habse-index service
Bin/hbase-indexer Server
(iv) Testing
1. Create a test table in HBase
$ hbase Shell
hbase> create ' Indexdemo-user ', {NAME = ' info ', replication_scope = ' 1 '}
# Add some data to the table
hbase> put ' indexdemo-user ', ' row1 ', ' info:firstname ', ' John '
hbase> put ' indexdemo-user ', ' row1 ', ' info: LastName ', ' Smith '
2. Add a configuration file to map the field in SOLR to qualifier in HBase
<?xml version= "1.0"?>
<indexer table= "Indexdemo-user" >
<field name= "firstname_s" value= " Info:firstname "/>
<field name=" lastname_s "value=" Info:lastname "/>
<field name=" age_i "value=" Info:age "type=" int "/>
</indexer>
Note that the above two field names also need to be defined in the SOLR schema.xml file before they can be used. 3. Add Index
After defining the above file, create the indexer in Solr-hbase and execute:
./bin/hbase-indexer add-indexer-n myindexer-c indexdemo-indexer.xml \
-CP SOLR.ZK={ZK_HOST}/SOLR-CP solr.collection=collection1-z {Zk_host}
4. Search
After execution, if there is no error, you can go to the SOLR query page and query to the field data defined by the HABSE index:
{"
Responseheader": {
"status": 0,
"qtime": 1,
"params": {
"indent": "true",
"Q": "*:*",
"_": "1467685845733",
"WT": "JSON"
}
},
"response": {
"Numfound": 3,
"start": 0,
"Docs": [
{
"firstname_s": "hwyn3816",
"lastname_s": "Huanghe",
"id": "row1",
"_ Version_ ": 1538978729496150000
},
{
" firstname_s ":" John ",
" lastname_s ":" Smith ",
" id ":" Row2 ",
" _version_ ": 1538978814854430700
},
{
" firstname_s ":" John ",
" lastname_s ":" Smith ",
" id ":" row3 ",
" _version_ ": 1538979512969068500
}
]
}
}
(v) Use of SOLRJ search
Iii. using coprocessor to build hbase two-level indexes
see another article