Synchronizing data to Elasticsearch using the HBase collaboration (coprocessor)
The recent project needs to synchronize the data in the HBase to the Elasticsearch, the requirement is to put or delete data to HBase, then the ES cluster, the corresponding index, also need to update or delete this data. I used the Hbase-rirver plug-in, found that it is not so good, so to find some information on the Internet, their own collation of a bit, wrote a synchronized data components, based on the hbase of the collaboration, the effect is not bad, now share to everyone, If you find any need to optimize or correct the place, you can in my csdn blog: my csdn blog address above dms I leave a message, code hosted on the Code cloud Hbase-observer-elasticsearch. At the same time to thank Gavin Zhang 2shou, although I do not know Gavin Zhang 2shou, (2shou Synchronous Data blog) but I read his Code and blog, (2shou synchronization component code) on his basis for the code to do some optimization and adjustment, To meet my own needs, so thank you, I hope I will open up my code, and others can inspire you to write more and more useful things: HBase collaboration (coprocessor) Writing component Deployment Component Validation Component summary HBase Co coprocessor)
The
HBase 0.92 release coprocessor-coprocessor, a framework that works in Master/regionserver, can run the user's code, and thus flexibly accomplish the task of distributed data processing. HBase supports two types of coprocessor, Endpoint and Observer. The Endpoint coprocessor is similar to a stored procedure in a traditional database where clients can invoke these Endpoint coprocessor to execute a section of server-side code and return the results of the server-side code to the client for further processing, the most common use being aggregation operations. Without a coprocessor, when the user needs to find the maximum data in a table, the Max aggregation operation, a full table scan is required, traversing the scan results within the client code, and performing the maximum operation. Such a method cannot take advantage of the concurrency capability of the underlying cluster, and it is inefficient to centralize all computations to the client side for unified execution. With coprocessor, the user can deploy the maximum code to the HBase server, HBase will take advantage of multiple nodes at the bottom cluster to perform the maximum operation concurrently. That is, the maximum code is executed within each Region range, and the maximum value of each Region is computed on the Region Server side, and only the max value is returned to the client. The maximum value is found in further processing of the maximum number of Region by the client. This will improve the overall efficiency of the implementation of a lot. Another coprocessor is called Observer coprocessor, a coprocessor similar to a trigger in a traditional database, which is invoked by the Server side when certain events occur. Observer coprocessor is a hook hook that is scattered around the HBase Server-side code and is invoked when a fixed event occurs. For example, there is a hook function preput before the put operation, which is called by Region Server before the put operation is executed, and a postput hook function after the put operation. In the actual application scenario, the second kind of observer coprocessor to apply a bit more, because the second way is more flexible, can be binding on a table, if HBase has 10 tables, I just want to bind 5 of them, and the other five do not need to deal with, you can not bind, Here's the second way I'm going to introduce. Writing components
The
First writes a esclient client that is used for link access to the ES cluster code.
Package org.eminem.hbase.observer;
Import org.elasticsearch.client.Client;
Import org.elasticsearch.client.transport.TransportClient;
Import Org.elasticsearch.common.lang3.StringUtils;
Import org.elasticsearch.common.settings.ImmutableSettings;
Import org.elasticsearch.common.settings.Settings;
Import org.elasticsearch.common.transport.InetSocketTransportAddress;
Import Java.lang.reflect.Field;
Import java.util.ArrayList;
Import java.util.List;
/** * ES Cleint class */public class Esclient {//elasticsearch cluster name public static String clustername;
Elasticsearch the host public static String Nodehost;
The Elasticsearch port (the Java API uses the transport port, which is TCP) public static int nodeport;
The Elasticsearch index name is public static String IndexName;
The Elasticsearch type name is public static String typeName;
Elasticsearch client public static client client;
/** * Get Es config * * @return/public static String GetInfo () { list<string> fields = new arraylist<string> (); try {for (Field f:esclient.class.getdeclaredfields ()) {Fields.Add (F.getname () + "=" + f.ge
T (null));
} catch (Illegalaccessexception ex) {ex.printstacktrace ();
Return Stringutils.join (Fields, ","); }/** * Init ES client/public static void Initesclient () {Settings settings = Immutablesett
Ings.settingsbuilder (). Put ("Cluster.name", Esclient.clustername). Build ();
Client = new Transportclient (settings). addtransportaddress (New Inetsockettransportaddress (
Esclient.nodehost, Esclient.nodeport));
}/** * Close ES client/public static void Closeesclient () {client.close ();
}
}
The
Then writes a classes class, inherits Baseregionobserver, and copies the start (), Stop (), Postput (), Postdelete (), and four methods. These four methods are very well understood, representing the beginning of the collaborator, the end of the collaborator, the putting event triggering and the data being stored in HBase we can do something after the delete event triggers and deletes the data from the hbase we can do something. All we have to do is write the code that initializes the ES client in start, shutting down the ES client in the stop and defining the scheduled object. Two triggering events, respectively, bulk the data in HBase to ES, and it's easy to fix.
Package org.eminem.hbase.observer;
Import Org.apache.commons.logging.Log;
Import Org.apache.commons.logging.LogFactory;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.hbase.Cell;
Import Org.apache.hadoop.hbase.CellUtil;
Import org.apache.hadoop.hbase.CoprocessorEnvironment;
Import Org.apache.hadoop.hbase.client.Delete;
Import org.apache.hadoop.hbase.client.Durability;
Import Org.apache.hadoop.hbase.client.Put;
Import Org.apache.hadoop.hbase.coprocessor.BaseRegionObserver;
Import Org.apache.hadoop.hbase.coprocessor.ObserverContext;
Import org.apache.hadoop.hbase.coprocessor.RegionCoprocessorEnvironment;
Import Org.apache.hadoop.hbase.regionserver.wal.WALEdit;
Import org.apache.hadoop.hbase.util.Bytes;
Import java.io.IOException;
Import Java.util.HashMap;
Import java.util.List;
Import Java.util.Map;
Import Java.util.NavigableMap; /** * Hbase Sync data to Es Class */public Class Hbasedatasyncesobserver extends Baseregionobserver {private stat IC Final Log LOG = Logfactory.getlog (Hbasedatasyncesobserver.class); /** * Read ES config from params * @param env */private static void Readconfiguration (Coprocessorenvir
Onment env) {Configuration conf = env.getconfiguration ();
Esclient.clustername = Conf.get ("Es_cluster");
Esclient.nodehost = Conf.get ("Es_host");
Esclient.nodeport = Conf.getint ("Es_port",-1);
Esclient.indexname = Conf.get ("Es_index");
Esclient.typename = Conf.get ("Es_type"); /** * Start * @param e * @throws ioexception/@Override public void Start (Coprocesso
Renvironment e) throws IOException {//Read config readconfiguration (e);
Init ES client esclient.initesclient ();
LOG.ERROR ("------Observer init esclient------" +esclient.getinfo ()); /** * Stop * @param e * @throws ioexception/@Override public void Stop (Coprocessoren VironmeNT e) throws IOException {//Close ES client esclient.closeesclient ();
Shutdown Time Task Elasticsearchbulkoperator.shutdownscheduex (); }/** * Called after the client stores a value * after data put to hbase then prepare update Builder to BUL
K ES * * @param e * @param put * @param edit * @param durability * @throws IOException * * @Override public void Postput (observercontext<regioncoprocessorenvironment> e, put, waledit edit, Du
Rability durability) throws IOException {string indexid = new String (Put.getrow ());
try {navigablemap<byte[], list<cell>> familymap = Put.getfamilycellmap ();
map<string, object> Infojson = new hashmap<string, object> ();
map<string, object> json = new hashmap<string, object> ();
For (map.entry<byte[], list<cell>> entry:familyMap.entrySet ()) { For (Cell Cell:entry.getValue ()) {String key = bytes.tostring (cellutil.clonequalifier
(cell));
String value = bytes.tostring (Cellutil.clonevalue (cell));
Json.put (key, value);
}//Set HBase family to es infojson.put ("info", JSON); Elasticsearchbulkoperator.addupdatebuildertobulk (ESClient.client.prepareUpdate (Esclient.indexname,
Esclient.typename, IndexID). Setdocasupsert (True). Setdoc (Infojson)); catch (Exception ex) {LOG.ERROR ("Observer put a doc, index [" + Esclient.indexname + "]" + "IndexID [" +
IndexID + "] Error:" + ex.getmessage ());
}/** * Called after the client deletes a value. * After data delete from hbase then prepare delete Builder to bulk ES * @param e * @param Delete * @param Edit * @param durability * @throws IOException/@Override PUBlic void Postdelete (observercontext<regioncoprocessorenvironment> e, delete delete, waledit edit, durability
Durability) throws IOException {string indexid = new String (Delete.getrow ()); try {elasticsearchbulkoperator.adddeletebuildertobulk ESClient.client.prepareDelete (esclient.indexname, ESCLI
Ent.typename, IndexID));
catch (Exception ex) {log.error (ex); LOG.ERROR ("Observer delete a doc, index [" + Esclient.indexname + "]" + "IndexID [" + IndexID + "] Error:" + ex.getme
Ssage ());
}
}
}
This code in the Info node is based on my side of the need to add, we can combine their own needs, remove this info node, directly hbase in the field into the ES. Our needs need to insert hbase family into ES as well.
The last is the key bulk ES code, combined with 2shou code, I wrote this part of the code, not using a timer, but the use of scheduledexecutorservice, as for why not use a timer, We can go to Baidu above the search for the difference between these two dongdong, I do not do too much introduction here. In the Elasticsearchbulkoperator class, I use Scheduledexecutorservice to periodically perform a task to determine if there is a need for bulk data in the buffer pool, Valve value is 10000. Every 30 seconds, if the threshold is reached, then the data in the buffer pool is immediately bulk into ES, and the data in the buffer pool is emptied, waiting for the next scheduled task to be executed. Of course, initializing a timed task requires a beeper bell thread, delay time of 10 seconds. Another important thing is the need to lock the bulk process.
Package org.eminem.hbase.observer;
Import Org.apache.commons.logging.Log;
Import Org.apache.commons.logging.LogFactory;
Import Org.elasticsearch.action.bulk.BulkRequestBuilder;
Import Org.elasticsearch.action.bulk.BulkResponse;
Import Org.elasticsearch.action.delete.DeleteRequestBuilder;
Import Org.elasticsearch.action.update.UpdateRequestBuilder;
Import java.util.concurrent.Executors;
Import Java.util.concurrent.ScheduledExecutorService;
Import Java.util.concurrent.TimeUnit;
Import Java.util.concurrent.locks.Lock;
Import Java.util.concurrent.locks.ReentrantLock; /** * Bulk hbase data to Elasticsearch class */public class Elasticsearchbulkoperator {private static final Log L
OG = Logfactory.getlog (Elasticsearchbulkoperator.class);
private static final int max_bulk_count = 10000;
private static Bulkrequestbuilder Bulkrequestbuilder = null;
private static final Lock Commitlock = new Reentrantlock (); private static Scheduledexecutorservice ScheduledexecutorsErvice = null;
static {//init es bulkrequestbuilder bulkrequestbuilder = ESClient.client.prepareBulk ();
Bulkrequestbuilder.setrefresh (TRUE);
init thread pool and set size 1 Scheduledexecutorservice = Executors.newscheduledthreadpool (1); Create Beeper thread (it'll be sync data to ES cluster)//use a commitlock to protected bulk ES as Thread-sa
ve final Runnable beeper = new Runnable () {public void run () {commitlock.lock ();
try {bulkrequest (0);
catch (Exception ex) {System.out.println (Ex.getmessage ());
Log.error ("Time Bulk" + esclient.indexname + "Index error:" + ex.getmessage ());
finally {Commitlock.unlock ();
}
}
}; Set time Bulk Task//Set Beeper thread (ten second to delaytion, second period between successive executions) Scheduledexecutorservice.scheduleatfixedrate (beeper, 10, 30
, timeunit.seconds); }/** * Shutdown time task immediately/public static void Shutdownscheduex () {if (null!= s Cheduledexecutorservice &&!scheduledexecutorservice.isshutdown ()) {Scheduledexecutorservice.shutdown
(); }/** * Bulk request when number of builders is grate then threshold * * @param threshold *
/private static void Bulkrequest (int threshold) {if (Bulkrequestbuilder.numberofactions () > Threshold) {
Bulkresponse bulkitemresponse = Bulkrequestbuilder.execute (). Actionget ();
if (!bulkitemresponse.hasfailures ()) {Bulkrequestbuilder = ESClient.client.prepareBulk (); }}/** * Add update Builder to bulk * with Commitlock to protected bulk as Thread-save * @param buildER */public static void Addupdatebuildertobulk (Updaterequestbuilder builder) {commitlock.lock ();
try {bulkrequestbuilder.add (builder);
Bulkrequest (Max_bulk_count); catch (Exception ex) {log.error ("update Bulk" + esclient.indexname + "Index error:" + ex.getmessage ())
;
finally {Commitlock.unlock (); }/** * Add delete Builder to bulk * with Commitlock to protected bulk as Thread-save * * @p Aram Builder */public static void Adddeletebuildertobulk (Deleterequestbuilder builder) {Commitlock.lock
();
try {bulkrequestbuilder.add (builder);
Bulkrequest (Max_bulk_count); catch (Exception ex) {log.error ("delete Bulk" + esclient.indexname + "Index error:" + ex.getmessage ())
;
finally {Commitlock.unlock ();
}
}
}
At this point, the code is complete, then we only need to package the deployment. Deployment components use MAVEN to package
MVN Clean Package
Upload to HDFs using the shell command
Hadoop fs-put Hbase-observer-elasticsearch-1.0-snapshot-zcestestrecord.jar/hbase_es
Hadoop fs-chmod-r 777/hbase_es
Verifying Components HBase Shell
Create ' Test_record ', ' info '
disable ' Test_record '
alter ' Test_record ', Method => ' Table_att ', ' Coprocessor ' => ' hdfs:///hbase_es/hbase-observer-elasticsearch-1.0-snapshot-zcestestrecord.jar| org.eminem.hbase.observer.hbasedatasyncesobserver|1001|es_cluster=zcits,es_type=zcestestrecord,es_index= Zcestestrecord,es_port=9100,es_host=master '
enable ' Test_record ' put
' Test_record ', ' test1 ', ' info:c1 ', ' Value1 '
deleteall ' Test_record ', ' test1 '
Binding operations need to establish a corresponding index in the ES cluster the following is an explanation of the binding code:
Package Java projects as jar packages and upload to HDFs specific paths
Enter HBase shell,disable you want to load the table
Activate observer with alert command
Coprocessor the corresponding format to | separate, followed by:
-HDFs path for jar packages
-Main class of observer
-Priority (generally no change)
-parameters (generally no change)
-The newly installed coprocessor will automatically generate the name: Coprocessor + $ + ordinal (through describe table_name can be viewed)
After you have adjusted the contents of the jar package, you need to repackage and bind the new jar package, and then bind the target table to do the binding before you join the target table to bind the synchronization components, the following is the binding command
HBase Shell
Disable ' Test_record '
alter ' Test_record ', Method => ' Table_att_unset ', NAME => ' coprocessor$1 '
enable ' Test_record '
desc ' Test_record '
Summarize
After binding if there is an error or sync does not pass in the process of execution, you can view the Hbase-roor-regionserver-slave*.log file from the logs directory on the HBase node. Because the collaborator is deployed on the regionserver, it is up to the log to be viewed from the node, not the master node.
Hbase-river plug-in before the download of the source code looked, the Hbase-river plug-in is periodic scan the entire table bulk operation, and we write this component here, is based on the hbase triggered events to carry out, both the effect and performance is self-evident, One is a full volume, one is incremental, we in the actual development, certainly hope that if there is data updated or deleted, we just have some data to synchronize on the line, there is no modification or deletion of data, we can not ignore.
Timer and Scheduledexecutorservice, where I chose Scheduledexecutorservice,2shou before I mentioned that there was a hole in the deployment plug-in, and after modifying the Java code, Upload to HDFs jar package file must not be the same as before, or even unload the original coprocessor and then reinstall can not be effective, this pit I also met, because there is no replication stop method, will be scheduled task stopped, the thread will always hang there, And once the error will cause HBase to fail to start, you must kill the corresponding thread. This pit, pit me for a while, we must remember to make a copy of the Stop method, to close the open thread or client, this is the best way.