NetEase Video Cloud: HBase problem Diagnosis case A-client read-write blocking exception

Source: Internet
Author: User

NetEase Video Cloud is a cloud-based distributed multimedia processing cluster and professional audio and video technology designed by NetEase to provide stable, smooth, low-latency, high-concurrency video streaming, recording, storage, transcoding and VOD, such as the PAAs service, online education, telemedicine, entertainment show, Online finance and other industries and enterprise users only through simple development can create online audio and video platform. Below, NetEase video cloud Technical experts to share an hbase problem diagnosis case.

In the era of big data, hbase, as an extremely scalable distributed storage system, is increasingly being favored by a wide variety of businesses in order to achieve efficient random read and write operations on the premise of Big data storage. For the business side, the focus on HBase's own service read and write performance, on the other hand also need to pay more attention to the specific meaning of hbase client parameters. This article starts with a specific HBase client exception, locates the cause of the exception and the corresponding client parameter optimization.

Crime scene

A recent business has seen a large number of thread blocks when reading data using HBase clients, and the business party retains the thread stack information at the time, as shown in:

See this problem, the first from the log and monitoring of the business table and the region server, confirmed that in a long time did not request to come in, in addition to no other useful information, but also did not receive the other users of the cluster of abnormal feedback, from the perspective of the phenomenon, This exception is triggered in a specific environment.

Case Analysis Process

1. As shown in Figure 1, all requests are blocked in the <0x0000000782a936f0> this global lock, there are two issues to focus on:

Which thread holds this global lock <0x0000000782a936f0>?

This is what kind of global lock (not important for the problem itself, interested can refer to step 3)?

2. Which thread holds this lock?

2.1 It is easy to find the global lock by searching in the Jstack log <0x0000000782a936f0> is held by the following thread:

Looking intently, the thread holds the global lock and is in the timed_waiting state, so the lock may not be released for a long time, causing all threads that need this global lock to block the wait. Well, the question turns into: why is this thread in the time_waiting state?

2.2 According to the prompt, look at the source code in the Rpcretryingcall.java 115 lines, you can determine that the thread is in the time_waiting state is caused by its own hibernation, as shown in:

The Rpcretryingcall function is an implementation of the RPC request retry mechanism, so there are two points to infer:

HBase client request in that time period the network has an exception that causes the RPC request to fail, into the retry logic

Depending on the retry mechanism of hbase (Backoff mechanism), a period of time is dormant between each of the two retry mechanisms, that is, 115 lines of code, which causes the thread to remain in the time_waiting state for too long.

The sleep time is determined by the Expectedsleep = Callable.sleep (pause,tries + 1), according to the hbase algorithm (see part III), the default maximum expectedsleep is 20s, the entire retry time will continue to 8min, This means that the global lock is held at 8min, which does not explain the blocking that lasts for nearly a few hours without request. Unless there are two situations:

Configuration problem: Requires client check hbase.client.pause and hbase.client.retries.number two parameter configuration exception, such as Hbase.client.pause parameter if hand shake is 10000, there may be a few hours of blocking situation

The network continues to have problems: if thread 1 holds a global lock retry failure after exiting, thread 2 competes to this lock, at this time the network still has a problem, thread 2 will again enter retry, retry the 8min after the failure to quit, loop down, there may be a few hours of blocking the situation

And the business party to confirm the configuration, all parameters are basically the default configuration, so the guess is not true, the most likely scenario is guessing two. Confirmed that there are many services at the time of the incident (0 o'clock in the morning ~ 6 a.m.) because the cloud network upgrade abnormal jitter occurs. However, because there is no specific log information, it is not possible to fully confirm that the guesses are correct. However, the analysis of the problem can further understand the HBase retry mechanism and some client parameter optimization strategy, which is one of the original intention to write this article.

3. Let's see what the lock is in the global lock, and see the source that this lock is the Regionlockobject object in the red box:

Refer to the source code note that this lock is to prevent simultaneous multithreading concurrent loading meta partition. The global lock code block first looks for the meta partition from the cache, and if it does not exist, it executes the Prefetchregioncache method to remotely find and write to the cache, so if the first thread successfully loads the meta partition data and writes the cache, then the thread can use it directly.

Normally, the Prefetchregioncache method executes only if the cache does not exist, and if the network does not have a problem at this point, the remote lookup meta partition information will be completed quickly and the lock-up time will be very short. Once the network has a long jitter, it is possible that the lock has been held, blocking other threads.

HBase RPC retry mechanism

The above analysis shows that the retry mechanism of hbase is the key point of this anomaly, and it is necessary to parse it once. HBase performs a retry operation after an RPC failure, and the maximum number of retries can be configured through a configuration file with a default value of 31 for the parameter in the hbase.client.retries.number,0.98 version. Also sleep for a period of time between every two retries, that is, the expectedsleep variable mentioned above, the variable implementation of the specific algorithm is as follows: public static int retry_backoff[] = {1, 2, 3, 5, 10, 20, 40, 100, 1 XX, +,};long normalpause = pause * Hconstants.retry_backoff[ntries];long jitter = (long) (Normalpause * RA) Ndom.nextfloat () * 0.01f); 1% possible Jitterreturn normalpause + jitter;

Where Retry_backoff is a table of retry factors, and increments from small to large indicate that the retry time will gradually increase with the number of retries. The pause variable can be configured through a configuration file with a default value of 100 for the parameter in the hbase.client.pause,0.98 version.

Temporarily ignores jitter this small random variable, the maximum retry interval sleep time by default Expectedsleep = 20s. The default retry count is 31, and the pause time between each connection to the cluster retry is:

[100,200,300,500,1000,2000,4000,10000,10000,10000,10000,20000,20000,...,20000]

This means that the client will retry 30 times within 448s and then discard the connection to the cluster.

Client parameter Optimization Practice

Obviously, according to the second and third parts of the above, once the network jitter anomaly, the default worst case of a thread will have about 8min retry time, which will cause other threads to block the Regionlockobject this global lock. In order to build a more stable, low-latency hbase system, the client parameters need to be adjusted in addition to various adjustments to server-side parameters:

1. Hbase.client.pause: Default is 100, can be reduced to 20

2. Hbase.client.retries.number: Default is 31, can be reduced to 21

After the modification, the above algorithm can calculate the pause time between each connection cluster retry is:

[20,40,60,100,200,400,800,2000,2000,2000,2000,4000,4000,...,4000]

The client retries 20 times within 1min and then discards the connection to the cluster, which in turn gives the global lock to other threads to perform other requests.

Summarize

This article from a client exception, through the stack analysis, source tracking, reasonable inference, on the one hand, the sharing program to organize the location of the exception, on the one hand, the retry mechanism of HBase RPC and client parameter optimization. Finally, I hope to join you in embracing the advent of the big data era and hope that through our efforts we will be able to get to know hbase! More wonderful sharing please pay attention to NetEase Video Cloud official website (http://vcloud.163.com/) or NetEase video cloud official (vcloud163).


NetEase Video Cloud: HBase problem Diagnosis case A-client read-write blocking exception

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.