Application of hbase and SOLR in Massive Data Query

Source: Internet
Author: User
Tags solr

For historical data query, you can use traditional relational databases, such as oracle and mysql, to efficiently query historical data when the data size is small.

However, when the data reaches a certain scale, it is not appropriate to use traditional relational databases. Of course, you can store the data in the distributed database HBase.

Currently, HBase only supports Primary indexes of rowkey, but does not support secondary indexes. Of course, you can splice all the fields to be indexed into rowkey and query them based on the filter function of hbase,

For example, you can use rowkey to search for a user's order for a period of time. rowkey = "100022333 | 2012-12-23: 10: 20 | orderNum", because hbase is stored, by default, the data is sorted by rowkey, so that the historical data of a user is concentrated in a region, which facilitates sequential search. One drawback of this method is that the paging function is not supported well. The total count and rowNum used for paging can be summarized in the corprocessor, however, it is not well supported for which record starts rowNum, and the summary of the total number of records needs to be implemented by using the coprocessor endpoint separately, which increases the computing workload; if it is placed on the client for paging, it is not feasible for massive data volumes.

Of course, the corresponding index table can be generated for this table in Hbase. There are several secondary indexes and several tables. For example, the rowkey of the primary table is designed as rowkey = "orderNum ", then the index table is rowkey = "usetNum | orderdate", cf = "orderNum", and there will also be paging problems.

Next, we propose a sorl + hbase method. solr indexes (solr supports paging operations) and hbase stores data.

When writing data to HBase, you can create a solr index, which can be implemented in the observer function of the coprocessor of HBase.

Public class sorlindexcoprocessorobserver extends baseregionobserver {
/**
* Create a SOLR Index
*/
Public void postput (final observercontext <regioncoprocessorenvironment> E,
Final put, final waledit edit, final Boolean writetowal)
Throws ioexception {
Byte [] rowkey = put. getrow ();
String rowkeystr = new string (rowkey, "UTF-8 ");
List <keyValue> KV = put
. Get ("CF". getbytes (), bytes. tobytes ("ordertime "));
String ordertime = new string (Kv. Get (0). getvalue (), "UTF-8 ");

List <keyValue> kv2 = put. Get ("CF". getbytes (), bytes. tobytes ("usernum "));
String userNum = new String (kv2.get (0). getValue (), "UTF-8 ");

String solrUrl = "http: // 10.1.1.57: 8082/solr ";
SolrServer solr = null;
Try {
Solr = new CommonsHttpSolrServer (solrUrl );
} Catch (MalformedURLException err ){
// TODO Auto-generated catch block
Err. printStackTrace ();
}
SolrInputDocument siDoc = new SolrInputDocument ();
SiDoc. addField ("id", rowKeyStr );
SiDoc. addField ("rowkey", rowKeyStr );
SiDoc. addField ("orderTime", orderTime );
SiDoc. addField ("userNum", userNum );
Try {
Solr. add (siDoc );
} Catch (solrserverexception E1 ){
// Todo auto-generated Catch Block
E1.printstacktrace ();
} Catch (ioexception E2 ){
// Todo auto-generated Catch Block
E2.printstacktrace ();
}
Try {
SOLR. Commit ();
} Catch (solrserverexception E3 ){
E3.printstacktrace ();
} Catch (ioexception E1 ){
E1.printstacktrace ();
}

}

Configure coprocessor

Alter 'T1 ', METHOD => 'table _ att', 'coprocessor '=> 'hdfs: // xxx. jar | com. newcosoft. hadoop. hbase. SorlIndexCoprocessorObserver | 1001

Multiple coprocessor can be configured for a table, and a sequence will automatically grow for identification. To load a coprocessor (a filter program), you must comply with the following rules:

[Coprocessor jar file location] | class name | [priority] | [arguments]

For the commit operation in solr, after the commit is submitted, the index is flushed to the hard disk, and the listener is triggered to create a new insexSearcher (New insexReader, loading the index from the hard disk ), in this way, the new insexsearcher will be used for subsequent queries, which has a great impact on the query performance. When batch import is complete, you can independently call the commit operation of sorl.

Here, the solr master and slave modes are used. The master provides index construction, and multiple slave provides index query. For master commit operations, a new snapshot is generated,

The Snappuller program on slave is generally executed on the crontab. It will go to the master to check whether there is a new snapshot. Once a new version is found, slave will download it, then snapinstall.

When a new searcher is open, there will be a cache push process. After the push, the new index will be delivered for use. Here, the execution frequency of the Snappuller program will be controlled, solr optimization is not in-depth here.

Search the orderNum list of the corresponding historical order based on the index fields, startrow, and pagesize in solr, traverse the list, and query it in hbase.

String solrurlslave = "http: // 10.1.1.59: 8082/SOLR ";
Solrserver solr2 = NULL;
Try {
Solr2 = new commonshttpsolrserver (solrurlslave );
} Catch (malformedurlexception e ){
// Todo auto-generated Catch Block
E. printstacktrace ();
}

String querystring = "q = usernum: 1111002 & orderdate = 2012-10-12 & startrow = 1 & pagesize = 10 & sort = orderdate + DESC ";
Solrparams = solrrequestparsers
. Parsequerystring (querystring );
Try {
Queryresponse RSp = solr2.query (solrparams );
Solrdocumentlist doclist = RSP. getresults ();
// Traverse the doclist and search for it in hbase

// Todo
} Catch (solrserverexception e ){
E. printstacktrace ();
}

 

Of course, there are many strategies to consider sorl HA for solr indexes. I will not introduce them here, but I will explain them in a separate chapter later.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.