Application of hbase and SOLR in Massive Data Query

Last Update:2018-12-05 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For historical data query, you can use traditional relational databases, such as oracle and mysql, to efficiently query historical data when the data size is small.

However, when the data reaches a certain scale, it is not appropriate to use traditional relational databases. Of course, you can store the data in the distributed database HBase.

Currently, HBase only supports Primary indexes of rowkey, but does not support secondary indexes. Of course, you can splice all the fields to be indexed into rowkey and query them based on the filter function of hbase,

For example, you can use rowkey to search for a user's order for a period of time. rowkey = "100022333 | 2012-12-23: 10: 20 | orderNum", because hbase is stored, by default, the data is sorted by rowkey, so that the historical data of a user is concentrated in a region, which facilitates sequential search. One drawback of this method is that the paging function is not supported well. The total count and rowNum used for paging can be summarized in the corprocessor, however, it is not well supported for which record starts rowNum, and the summary of the total number of records needs to be implemented by using the coprocessor endpoint separately, which increases the computing workload; if it is placed on the client for paging, it is not feasible for massive data volumes.

Of course, the corresponding index table can be generated for this table in Hbase. There are several secondary indexes and several tables. For example, the rowkey of the primary table is designed as rowkey = "orderNum ", then the index table is rowkey = "usetNum | orderdate", cf = "orderNum", and there will also be paging problems.

Next, we propose a sorl + hbase method. solr indexes (solr supports paging operations) and hbase stores data.

When writing data to HBase, you can create a solr index, which can be implemented in the observer function of the coprocessor of HBase.

Public class sorlindexcoprocessorobserver extends baseregionobserver {
/**
* Create a SOLR Index
*/
Public void postput (final observercontext <regioncoprocessorenvironment> E,
Final put, final waledit edit, final Boolean writetowal)
Throws ioexception {
Byte [] rowkey = put. getrow ();
String rowkeystr = new string (rowkey, "UTF-8 ");
List <keyValue> KV = put
. Get ("CF". getbytes (), bytes. tobytes ("ordertime "));
String ordertime = new string (Kv. Get (0). getvalue (), "UTF-8 ");

List <keyValue> kv2 = put. Get ("CF". getbytes (), bytes. tobytes ("usernum "));
String userNum = new String (kv2.get (0). getValue (), "UTF-8 ");

String solrUrl = "http: // 10.1.1.57: 8082/solr ";
SolrServer solr = null;
Try {
Solr = new CommonsHttpSolrServer (solrUrl );
} Catch (MalformedURLException err ){
// TODO Auto-generated catch block
Err. printStackTrace ();
}
SolrInputDocument siDoc = new SolrInputDocument ();
SiDoc. addField ("id", rowKeyStr );
SiDoc. addField ("rowkey", rowKeyStr );
SiDoc. addField ("orderTime", orderTime );
SiDoc. addField ("userNum", userNum );
Try {
Solr. add (siDoc );
} Catch (solrserverexception E1 ){
// Todo auto-generated Catch Block
E1.printstacktrace ();
} Catch (ioexception E2 ){
// Todo auto-generated Catch Block
E2.printstacktrace ();
}
Try {
SOLR. Commit ();
} Catch (solrserverexception E3 ){
E3.printstacktrace ();
} Catch (ioexception E1 ){
E1.printstacktrace ();
}

}

Configure coprocessor

Alter 'T1 ', METHOD => 'table _ att', 'coprocessor '=> 'hdfs: // xxx. jar | com. newcosoft. hadoop. hbase. SorlIndexCoprocessorObserver | 1001

Multiple coprocessor can be configured for a table, and a sequence will automatically grow for identification. To load a coprocessor (a filter program), you must comply with the following rules:

[Coprocessor jar file location] | class name | [priority] | [arguments]

For the commit operation in solr, after the commit is submitted, the index is flushed to the hard disk, and the listener is triggered to create a new insexSearcher (New insexReader, loading the index from the hard disk ), in this way, the new insexsearcher will be used for subsequent queries, which has a great impact on the query performance. When batch import is complete, you can independently call the commit operation of sorl.

Here, the solr master and slave modes are used. The master provides index construction, and multiple slave provides index query. For master commit operations, a new snapshot is generated,

The Snappuller program on slave is generally executed on the crontab. It will go to the master to check whether there is a new snapshot. Once a new version is found, slave will download it, then snapinstall.

When a new searcher is open, there will be a cache push process. After the push, the new index will be delivered for use. Here, the execution frequency of the Snappuller program will be controlled, solr optimization is not in-depth here.

Search the orderNum list of the corresponding historical order based on the index fields, startrow, and pagesize in solr, traverse the list, and query it in hbase.

String solrurlslave = "http: // 10.1.1.59: 8082/SOLR ";
Solrserver solr2 = NULL;
Try {
Solr2 = new commonshttpsolrserver (solrurlslave );
} Catch (malformedurlexception e ){
// Todo auto-generated Catch Block
E. printstacktrace ();
}

String querystring = "q = usernum: 1111002 & orderdate = 2012-10-12 & startrow = 1 & pagesize = 10 & sort = orderdate + DESC ";
Solrparams = solrrequestparsers
. Parsequerystring (querystring );
Try {
Queryresponse RSp = solr2.query (solrparams );
Solrdocumentlist doclist = RSP. getresults ();
// Traverse the doclist and search for it in hbase

// Todo
} Catch (solrserverexception e ){
E. printstacktrace ();
}

Of course, there are many strategies to consider sorl HA for solr indexes. I will not introduce them here, but I will explain them in a separate chapter later.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More