Investigation and solution of notservingregionexception problem in HBase cluster

Last Update:2017-02-27 Source: Internet

Author: User

Tags commit split

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

HBase cluster in the reading and writing process, may be due to region split or region blance and other causes region short line, at this time the client and the HBase cluster RPC operation will throw notservingregionexception exception, This causes the read and write operation to fail. Based on the actual project experience, this paper describes the discovery of this problem and the troubleshooting process.

1. Finding problems

During the stress test of hbase cluster, it is found that when the actual written hbase and hbase queries are several times (the cluster scale 10~20, the amount of data read and write per second is at the level of hundreds of thousands of records), the reading and writing of the cluster will fluctuate a certain degree. Specifically as follows:

1 The write end throws the following exception message:

Org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:Failed actions:notservingregionexception : Servers with issues:my161208.cm6:60020,

At Org.apache.hadoop.hbase.client.hconnectionmanager$hconnectionimplementation.processbatchcallback ( hconnectionmanager.java:1600)

At Org.apache.hadoop.hbase.client.hconnectionmanager$hconnectionimplementation.processbatch ( hconnectionmanager.java:1376)

At Org.apache.hadoop.hbase.client.HTable.flushCommits (htable.java:916)

2 read-side also throws similar exception information:

Org.apache.hadoop.hbase.client.RetriesExhaustedException:Failed after attempts=10, exceptions:

Mon Oct 14:03:09 CST, Org.apache.hadoop.hbase.client.scannercallable@3740fb20, Org.apache.hadoop.hbase.notservingregionexception:org.apache.hadoop.hbase.notservingregionexception:xxxxxx,\ x0fp\x8d\xc3\xdb1053223266:\x00\x00v6,1351490475989.bd68113129f07163dc25e78fba17ad6c. is closing

The above anomalies occur periodically during the pressure test, and the HBase cluster has a short period of service.

2. Troubleshooting Issues

By looking at the HBase Master run log, combined with the client throws an exception moment, found at that time HBase cluster is in the region split and the region Balance between different machines, then, why is the cyclical frequent triggering of the above process? But also occurs during the pressure measurement (the amount of data is several times larger than usual). The following table is designed to analyze:

1 because the table Rowkey have time fields, so the need to create a new region every day, and due to the large amount of written data, further triggering the HBase region split operation, this process is generally time-consuming (measured from the online log, on average, 10 seconds, Region size is 4GB), and region split operation is more frequent;

2) at the same time because the region split operation led to uneven distribution of region, triggering hbase automatic region balance operation, region migration process will also lead to region offline, this process takes a long time (when testing from the online log, The average is about 20 seconds).

3. Problem solving

First of all, from the client to consider, in fact, to ensure that the region offline, read and write requests can continue after the cluster recovery, can take the following measures:

1 for the write end, you can add a record that was not written to a client cache, and then submit it to a background thread for a single resubmit after a period of time, or Setautoflush (Flase, false) to ensure that the failed record is not discarded. Stay in the client WriteBuffer and wait until the next WriteBuffer full and try to commit until the commit is successful.

2 for the read-end, after catching an exception, you can take a period of hibernation after a period of time to retry, and so on.

More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/database/extra/

3 of course, can also be adjusted according to the actual situation hbase.client.retries.number and Hbase.client.pause configuration options.

Then, from the server side, you need to address region split and region balance separately:

1 as the table is built, we have taken into account the uniform distribution of data on different region servers and have created and allocated the same number of region in advance on different region servers, taking into account that in order for the cluster to provide a stable service in the actual online environment, You can choose to turn off HBase region automatic balance function, of course, you can choose to turn off after a small daily reading and writing pressure (such as after the wee hours) to trigger the execution of a balance operation.

2 Next, Region is always created, can not be reused the question how to solve it? The root cause is that the timestamp field is included in the Rowkey, and the timestamp always grows upwards at every moment. However, the consumer does need to be able to perform sequential scan operations based on the timestamp field, so the timestamp field must be preserved. Accordingly, here are two kinds of solution ideas:

A common approach is to schedule tables by time, such as by day, so that you can create a region partition by using a predefined table to avoid processes such as region split that are frequently triggered in the actual read and write process, but the disadvantage of this approach is that you need to build your tables in advance every day, This DDL process may cause problems that can lead to read and write problems, while the read-write side also needs to adapt to read and write the newly created table.

In fact, we can change a way of thinking, by modifying the table's ROWKEY structure, the timestamp field into a cycle of timestamp, such as the timestamp% Ts_mode after the value, where the ts_mode must be greater than the equivalent of the table TTL time cycle, This will ensure that the data is not overwritten. After this transformation, you can realize the region of the region and avoid the infinite rise of the ... For the read-write side of the change is also small, read-write-side operation only to take the timestamp field after the model as Rowkey to read and write, in addition, read-side needs to consider can adapt to scan scan processing [Starttsmode, Endtsmode] and [Endtsmode, Starttsmode] Two kinds of situation.

4. Summary words

The above is only my actual project encountered in the general summary of the problems, for reference only. Welcome to discuss the exchange.

Author: great Circle those things

URL: http://www.cnblogs.com/panfeng412/archive/2012/11/04/hbase-how-to-resolve-not-serving-region-exception.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More