NetEase Video Cloud: Hbase--regionserver Case investigation

Source: Internet
Author: User
Tags connection reset

Today, NetEase video cloud technology experts to share your hbase–regionserver down case investigation, welcome to participate in the discussion.

It was a quiet evening, and it was nice to eat raisins and watch the game. But a warning message such as lightning generally broke the peace of the night, the online cluster of an RS down! So slips sat up from the bed, looked at the monitoring, instantly stunned: a single machine read and write throughput reached 5w ops/sec! RS downtime is caused by such a large amount of write-off? If so, how is it caused? If not, what is the reason? A variety of questions in the moment from the brain one by one flashed, do not care so much, first backup a copy of the log, and then pull up Rs. Next is the bug to troubleshoot the old routine: log, monitor and source three-pronged, to see what happened!

Case-Site article

is to use the monitoring tool ganglia to the incident regionserver at that time read and write throughput of the monitoring curve, it can be seen, about 19 to 21:30 of the time period, the RS throughput is maintained around 3w Ops/sec, the peak is reached 6w ops/sec. Before we measured the maximum read and write throughput that a single RS can withstand on the line, it basically remained at around 2w, mainly because of network bandwidth bottlenecks. And this RS read and write throughput before the outage is so much, the intuition tells me that RS downtime is because of it!

Then quickly pull out the log to see the full screen of the Responsetooslow, as shown:

Obviously, the most likely cause of this anomaly is the full GC, and, sure enough, after patient troubleshooting, you can see a lot of the whole GC log fragments as shown below:

2016-04-14 21:27:13,174 WARN [jvmpausemonitor] util. jvmpausemonitor:detected pause in JVM or host machine (eg GC): Pause of approximately 20542ms

GC pool ' parnew ' had collection (s): Count=1 time=0ms

GC pool ' Concurrentmarksweep ' had collection (s): count=2 time=20898ms

2016-04-14 21:27:13,174 WARN [regionserver60020.periodicflusher] util. Sleeper:we slept 20936ms instead of 100ms, this was likely due to a long garbage collecting pause and it's usually bad, se E http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

As you can see, HBase executes a CMS GC, which causes the entire process to be suspended 20s of threads. Through the monitoring of the Memstore can also be seen in this period of time the GC strength, as shown:

GC time is the most obvious harm is to cause the upper business congestion, through the log can also be seen a few clues:

Java.io.IOException:Connection Reset by peer

At sun.nio.ch.FileDispatcherImpl.read0 (Native Method)

At Sun.nio.ch.SocketDispatcher.read (socketdispatcher.java:39)

At Sun.nio.ch.IOUtil.readIntoNativeBuffer (ioutil.java:223)

At Sun.nio.ch.IOUtil.read (ioutil.java:197)

At Sun.nio.ch.SocketChannelImpl.read (socketchannelimpl.java:384)

At Org.apache.hadoop.hbase.ipc.RpcServer.channelRead (rpcserver.java:2246)

At Org.apache.hadoop.hbase.ipc.rpcserver$connection.readandprocess (rpcserver.java:1496)

....

2016-04-14 21:32:40,173 WARN [b.defaultrpcserver.handler=125,queue=5,port=60020] IPC. rpcserver:rpcserver.respondercallid:7540 service:clientservice methodname:multi size:100.2 K connection:10.160.247. 139:56031:output Error

2016-04-14 21:32:40,173 WARN [b.defaultrpcserver.handler=125,queue=5,port=60020] IPC. Rpcserver:b.defaultrpcserver.handler=125,queue=5,port=60020:caught a closedchannelexception, this means the Server was processing a request but the client went away. The error message Was:null

The above log indicates that the HBase server is unable to respond to user requests because of the full GC, and the User client program sockettimeout and breaks this connection after a certain amount of time. After the connection is broken, the server side prints the log as above. However, these don't seem to have much to do with our ultimate goal, and don't forget that our goal is to find out why RS is down!

To solve the story

After the investigation of the case, the only useful clue is that hbase experienced a very serious and frequent full GC before the outage, and it can be further seen from the log below that the full GC is occurring in concurrent mode failure. That is, the virtual machine has not finished the GC and the case of a large number of data caused by the JVM memory is not enough, at this time the virtual opportunity to suspend all user threads, the execution of long full gc!

(Concurrent mode failure): 45876255k->21800674k (46137344K), 10.0625300 secs] 48792749k->21800674k (49283072K), [ CMS perm:43274k->43274k (262144K)], 10.2083040 secs] [times:user=12.02 sys=0.00, real=10.20 secs]

2016-04-14 21:22:43,990 WARN [jvmpausemonitor] util. jvmpausemonitor:detected pause in JVM or host machine (eg GC): Pause of approximately 10055ms

GC pool ' parnew ' had collection (s): count=2 time=244ms

GC pool ' Concurrentmarksweep ' had collection (s): Count=1 time=10062ms

The above mentioned full GC will have a serious impact on the upper layer of the business, it is likely to have a significant impact on the underlying relying party? The truth is yes!. And most of the reasons for Rs downtime are also attributed to this!

Further review the log to find the following exception in the HBase log:

2016-04-14 21:22:44,006 WARN [responseprocessor for Block bp-632656502-10.160.173.93-1448595094942:blk_1073941840_ 201226] HDFs. Dfsclient:dfsoutputstream Responseprocessor exception for block Bp-632656502-10.160.173.93-1448595094942:blk_ 1073941840_201226java.io.ioexception:bad Response ERROR for block Bp-632656502-10.160.173.93-1448595094942:blk_ 1073941840_201226 from Datanode 10.160.173.93:50010

At Org.apache.hadoop.hdfs.dfsoutputstream$datastreamer$responseprocessor.run (dfsoutputstream.java:732)

From the log content it should be hbase call Dfsclient load block data from Datanode "Bp-632656502-10.160.173.93-1448595094942:blk_1073941840_201226″, However, Datanode returns a failure. The specific failure reason requires viewing the Datanode node log as follows:

2016-04-14 21:22:43,789 INFO Org.apache.hadoop.hdfs.server.datanode.DataNode:opWriteBlock Bp-632656502-10.160.173.93-1448595094942:blk_1073941840_201226 received Exception java.net.SocketTimeoutException : 10000 Millis timeout while waiting for channel to is ready for read. ch:java.nio.channels.socketchannel[connected local=/10.160.173.94:50010 remote=/10.160.173.94:30110]

2016-04-14 21:22:43,779 ERROR Org.apache.hadoop.hdfs.server.datanode.datanode:hz-hbase4.photo.163.org:50010:d Ataxceiver Error processing Write_block operation SRC:/10.160.173.94:30123 dest:/10.160.173.94:50010

java.net.sockettimeoutexception:10000 Millis Timeout while waiting for channel to is ready for read. ch:java.nio.channels.socketchannel[connected local=/10.160.173.94:50010 remote=/10.160.173.94:30123]

Obviously, as can be seen from the log, Datanode has been waiting for a read request from the client, but until sockettimeout, the request is not over, Datanode will disconnect the connection, causing the client to receive the "Bad response ERROR * * * "Exception.

So what does this have to do with the full GC? Very simply, because the full GC causes hbase all internal threads to hang, so the read request sent to Datanode is also suspended, datanode and so on, and so on, left and so on, and so on, the last thing to be connected to break off.

View Hadoop client Source It is known that if the above exception occurs dfsclient, Dfsclient will set a global flag ErrorIndex to a value other than 0. Refer to the following code snippet in the Dfsoutputstream class for details:

The ending of the crime

The final result of this foreshadowing is that the Hadoop client will set a global flag ErrorIndex to a value other than 0, so what does this have to do with the eventual RS outage? To keep looking down. HBase log related fragments, recorded a more detailed RS outage exception information, we use these anomaly information as a pointcut to analyze, you can see at least three useful clues, as shown in:

Clue One: The most straightforward reason for RS downtime is that the system fails when it shuts down LogWriter (and then restarts a new Hlog).

Clue Two: A IOException exception occurred when performing a logwriter shutdown failure because of "writing trailer"

Clue three: And the cause of ioexception anomaly is "all datanodes * * * * * * *

So far, the most reliable information we can get is that RS downtime is due to "all datanodes * * * is bad" caused, see the literal meaning is this datanode for some reason broken, then we hurriedly go to see Datanode log, See if there are any related exceptions or error logs for that time period.

Unfortunately, the Datanode log did not print any exceptions or error logs at that point in time, and all services were displayed as expected, as shown in the following information:

2016-04-14 21:32:38,893 INFO org.apache.hadoop.hdfs.server.datanode.datanode.clienttrace:src:127.0.0.1, dest: 127.0.0.1, Op:request_short_circuit_fds, blockid:1073941669, srvid:ds-22834907-10.160.173.94-50010-1448595406972, Success:true

2016-04-14 21:32:38,894 INFO org.apache.hadoop.hdfs.server.datanode.datanode.clienttrace:src:127.0.0.1, dest: 127.0.0.1, Op:request_short_circuit_fds, blockid:1073941669, srvid:ds-22834907-10.160.173.94-50010-1448595406972, Success:true

...

See here, is not a bit of a circle: HBase log inside clearly printed said this datanode broken, but the actual Datanode log display service all normal. This time you have to turn over the source, see where hbase print "All datanodes * * * is bad", by viewing the source code, you can see the final culprit is the above mentioned ErrorIndex, as shown:

Finally out of the daylight, no end will faint! The foreshadowing of the above to the end of the mat to come out because the full GC eventually led to dfsclient a global flag ErrorIndex set to a non-0 value, here finally meet, almost tears!

The case carding article

The whole process went down I was a little dizzy, involved in all aspects too much, so it is necessary to complete the whole process of combing over, the following simple draw a:

Through the analysis of the whole case, on the one hand, once again exercise how to monitor, log and source location to troubleshoot problems, on the other hand in the hbase operation process also need to pay special attention to the following points:

1. Full GC will not only seriously affect the upper layer of the business, resulting in business read and write requests. There are also various exceptions that can cause data requests between HDFS, which can even lead to regionserver downtime.

2. The above mentioned full GC is largely due to the concurrent Mode failure, which is rare, and can usually be avoided by reducing the JVM parameter xx:cmsinitiatingoccupancyfraction. This parameter is used to set the CMS garbage collection time, if this is set to 60, indicating that the JVM has used memory for the total memory of 60% of the time garbage collection, reduce this value can make garbage collection earlier.

3. Be sure to strictly limit traffic at the business level. On the one hand, we need to communicate with the business parties, restrict by the business side, and explore the new scheme of the HBase business resource isolation;

More technical content can be concerned about our NetEase Video Cloud official website (http://vcloud.163.com/) or the official (vcloud163) for technical exchanges!

NetEase Video Cloud: Hbase--regionserver Case investigation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.