Use HBaseEndPoint (coprocessor) for computing

Source: Internet
Author: User
If you want to make statistics on the data in hbase, such as the maximum value of a field, the number of records meeting certain conditions, and the characteristics of various records, and classify by record features (similar to SQL's groupby )~ The common practice is to scan the data of the entire hbase table, or add a filter to make some preliminary operations.

If you want to make statistics on the data in hbase, such as the maximum value of a field, the number of records meeting certain conditions, and the characteristics of various records, and classify by record features (similar to SQL group )~ The common practice is to scan the data of the entire hbase table, or add a filter to make some preliminary operations.

If you want to make statistics on the data in hbase, such as the maximum value of a field, the number of records meeting certain conditions, and the characteristics of various records, and classify by record features (similar to SQL group )~

The conventional approach is to scan the data of the entire hbase table, or add a filter to slightly protect the environment (for rowcounter, FirstKeyOnlyFilter is added ), however, in this case, there will still be significant side effects, such as occupying a large amount of network bandwidth (when the standard level reaches 10 million, after million), especially, and the amount of RPC should not be underestimated.

What is the ideal method?

In the simple example of row counter, I want to count the total number of rows. If each region tells me how many rows he has, then I will tell the results to me, and I will summarize their results, isn't that enough?
The problem is that hbase does not provide this interface to count the number of rows in each region. Can we implement one by ourselves?
Yes, as the title of this article says, we can implement an Endpoint by ourselves, load hbase, and then remotely call it.

What is Endpoint?

First, find out what hbase coprocessor is.

Hbase has two types of coprocessor: Observer (Observer), similar to trigger of relational databases, and EndPoint, similar to the stored procedure of relational databases.

The observer will introduce the Endpoint more here.

An EndPoint is an interface of the dynamic RPC plug-in. Its implementation code is deployed on the server (regionServer) and can be called through HBase RPC. The client Class Library provides very convenient methods to call these dynamic interfaces. They can call an EndPoint at any time. Their implementation code will be remotely executed by the target region and the results will be returned to the terminal. You can use these powerful plug-in interfaces to add new features to HBase.

How to implement an EndPoint

1. define a new protocol interface and must inherit CoprocessorProtocol.
2. Implement the terminal interface, inherit the abstract class BaseEndpointCoprocessor, and deploy the implementation code
3. on the Client side, the terminal can be called by two new HBase Client APIs. Single region: HTableInterface. coprocessorProxy (Class Protocol, byte [] row ). Rigons region: HTableInterface. coprocessorExec (Class Protocol, byte [] startKey, byte [] endKey, Batch. Call Callable), the region here is identified by a row, that is, the region that RPC sends to when the row falls into the region. For start-end, [start, end) region within the specified range is called by RPC.

public interface CounterProtocol extends CoprocessorProtocol {public long count(byte[] start, byte[] end) throws IOException;}
public class CounterEndPoint extends BaseEndpointCoprocessor implements CounterProtocol {@Overridepublic long count(byte[] start, byte []end) throws IOException {// aggregate at each regionScan scan = new Scan();long numRow = 0;InternalScanner scanner = ((RegionCoprocessorEnvironment) getEnvironment()).getRegion().getScanner(scan);try {List curVals = new ArrayList();boolean hasMore = false;do {curVals.clear();hasMore = scanner.next(curVals);if (Bytes.compareTo(curVals.get(0).getRow(), start)= 0) {break;}numRow++;} while (hasMore);} finally {scanner.close();}return numRow;}}
public class CounterEndPointDemo {public static void main(String[] args) throws IOException, Throwable {final String startRow = args[0];final String endRow = args[1];@SuppressWarnings("resource")HTableInterface table = new HTable(HBaseConfiguration.create(), "tc");Map results;// scan: for all regionsresults = table.coprocessorExec(CounterProtocol.class, startRow.getBytes(),endRow.getBytes(), new Batch.Call() {public Long call(CounterProtocol instance) throws IOException {return instance.count(startRow.getBytes(), endRow.getBytes());}});long total = 0;for (Map.Entry e : results.entrySet()) {System.out.println(e.getValue());total += e.getValue();}System.out.println("total:" + total);}}

The framework of the entire program is actually another mapreduce, but it runs on the region server, reduce runs on the client, where the map calculation is large, and the reduce calculation is small!

Note the following:
The return type of protocol. It can be a basic type.
For a custom type, you must implement the org. apache. hadoop. io. Writable interface.
For detailed supported types, see the code hbase source code: org. apache. hadoop. hbase. io. HbaseObjectWritable

How to deploy it?

1. added through hbase-site.xml

  hbase.coprocessor.region.classes  xxxx.CounterEndPoint 
  1. If you want to configure multiple, use commas (,) to separate them.
  2. Jar containing this class must be located in the hbase classpath
  3. This coprocessor applies to all tables. If you only want to act on some tables, use the following method.

2. Using shell
Added:

hbase(main):005:0> alter 't1', METHOD => 'table_att','coprocessor'=>'hdfs:///foo.jar|com.foo.FooRegionObserver|1001|arg1=1,arg2=2'Updating all regions with the new schema...1/1 regions updated.Done.0 row(s) in 1.0730 seconds

Coprocessor format:
[FilePath] | ClassName | Priority | arguments
Arguments: k = v [, k = v] +

  1. Where FilePath is the hdfs path, such as/tmp/zhenhe/cp/zhenhe-1.0.jar
  2. Full name of the ClassNameEndPoint implementation class
  3. Priority is an integer. The framework determines the execution sequence of multiple cp based on the data.
  4. Arguments, the parameter passed to cp
  5. If hbase classpath contains a modified class, FilePath can be left blank.

Uninstall:

  1. Describe "tableName" first to view the cp number you want to uninstall
  2. Then alter 't1', METHOD => 'table _ att_unset', NAME => 'coprocessor $ 3', coprocessor $3 variable.
Application scenarios

This is the simplest example. In addition, there are many statistical scenarios that can be implemented in this way, with the following benefits:

  1. Save network bandwidth
  2. Reduce RPC calls (scan calls increase linearly with the decrease of CacheSzie), reduce hbase pressure
  3. The statistical efficiency can be improved. In my previous example of the Group by type, the statistical speed can be improved by more than 50%.

Other application scenarios?

  1. A table that stores user information and can count the information of each user (counter job)
  2. Count the maximum, minimum, average, reference: https://issues.apache.org/jira/browse/HBASE-1512
  3. Batch delete records, batch delete records with a timestamp

Refer:

1. http://blogs.apache.org/hbase/entry/coprocessor_introduction
2. https://issues.apache.org/jira/browse/HBASE-1512

Original article address: Use HBase EndPoint (coprocessor) for computation. Thank you for sharing it with the original author.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.