HBase Cluster Overall outage report (2016.7.13)

Last Update:2018-07-26 Source: Internet

Author: User

Tags create directory flush safe mode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

scenarios and operational Records

Around 10:50, received operation and maintenance personnel notice, hbase cluster B all nodes down, the following records to restore all operations of the cluster.

>status ' Simple '
5 dead servers

All Regionserver really hung up and pulled up all the regionserver quickly

Service Hbase-regionserver Start

Continue to use status ' simple ' to view the cluster status and discover that Regionserver is not pulled up to view the Regionserver log:

2016-07-13 10:37:13,480 ERROR [regionserver60020] regionserver. Hregionserver:failed Init
Org.apache.hadoop.ipc.RemoteException (org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory/hbase/wals/slave4.hadoop,60020,1468377431049. Name node is in safe mode.
Resources is low on NN. Add or free to more resources then turn off Safe mode manually. Note:if you turn off Safe mode before adding resources, the NN would immediately return to safe mode. Use "HDFs dfsadmin-safemode leave" To turn safe mode off.
At Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode (fsnamesystem.java:1197)
At Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt (fsnamesystem.java:3568)
At Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs (fsnamesystem.java:3544)
At Org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs (namenoderpcserver.java:739)
At Org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs ( clientnamenodeprotocolserversidetranslatorpb.java:558)
At Org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos Clientnamenodeprotocol Clientnamenodeprotocol2.callblockingmethod (Clientnamenodeprotocolprotos.java)

According to the above description, the reason for Regionserver's failure was that HDFs entered safe mode and quickly turned to HDFs to view the HDFs status via HDFs Dfsadmin-safemode get, and did enter Safe mode, executing the command: HDFs Dfsadmin-safe mode leave, leave safe mode and pull up all regionserver again.
Again into the hbase shell, at this time Regionserver have pulled up, into the Web UI view cluster status, found that regionserver although pulled up, but the above region did not pull up, execute command hbase hbck found, There are more than 1000 inconsistencies, execute the repair command hbase Hbck-repair, observe the Web UI state again, Regionserver is rapidly pulling up the corresponding region, waiting for all regions to be pulled up, inconsistencies disappear, hbase cluster recovery service. Fault Analysis

HBase cluster recovery, find out why the entire hbase cluster is causing such serious problems, view the Regionserver downtime log, as follows

2016-07-13 10:29:40,383 WARN [Thread-16] regionserver. hstore:failed Flushing Store file, retrying num=0
Java.io.IOException:org.apache.hadoop.ipc.RemoteException ( Org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create File/hbase/data/default/platform_common_ user_flow_consumer/e8878
12d9c1be58014f0733cf6e7b058/.tmp/72ace704aa374894afbabf2118225ebb. Name node is in safe mode. The
Resources is low on the NN. Add or free to more resources then turn off Safe mode manually. Note:if you turn off Safe mode before adding resources, the NN would immediately return to safe mod
E. Use "HDFs DFSA Dmin-safemode leave "To turn safe mode off.
at Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode (fsnamesystem.java:1197)
at Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt (fsnamesystem.java:2225)
at Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile (fsnamesystem.java:2180)

Then

2016-07-13 10:29:51,603 FATAL [Thread-16] regionserver. Hregionserver:regionserver abort:loaded coprocessors is: []
2016-07-13 10:29:51,660 INFO [Thread-16] regionserver. HRegionServer:STOPPED:Replay of HLog required. Forcing server shutdown
2016-07-13 10:29:51,663 INFO [rpcserver.handler=45,port=60020] IPC. Rpcserver:rpcserver.handler=45,port=60020:exiting
2016-07-13 10:29:51,661 INFO [priority.rpcserver.handler=0,port=60020] IPC. Rpcserver:priority.rpcserver.handler=0,port=60020:exiting
2016-07-13 10:29:51,661 INFO [rpcserver.handler=3,port=60020] IPC. Rpcserver:rpcserver.handler=3,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=49,port=60020] IPC. Rpcserver:rpcserver.handler=49,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=52,port=60020] IPC. Rpcserver:rpcserver.handler=52,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=53,port=60020] IPC. Rpcserver:rpcserver.handler=53,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=55,port=60020] IPC. Rpcserver:rpcserver.handler=55,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=56,port=60020] IPC. Rpcserver:rpcserver.handler=56,port=60020:exiting
2016-07-13 10:29:51,661 INFO [rpcserver.handler=0,port=60020] IPC. Rpcserver:rpcserver.handler=0,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=59,port=60020] IPC. Rpcserver:rpcserver.handler=59,port=60020:exiting
2016-07-13 10:29:51,661 INFO [rpcserver.handler=11,port=60020] IPC. Rpcserver:rpcserver.handler=11,port=60020:exiting
2016-07-13 10:29:51,661 INFO [rpcserver.handler=10,port=60020] IPC. Rpcserver:rpcserver.handler=10,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=64,port=60020] IPC. Rpcserver:rpcserver.handler=64,port=60020:exiting
2016-07-13 10:29:51,661 INFO [rpcserver.handler=2,port=60020] IPC. Rpcserver:rpcserver.handler=2,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=67,port=60020] IPC. Rpcserver:rpcserver.handler=67,port=60020:exiting
2016-07-13 10:29:51,664 INFO [rpcserver.handler=69,port=60020] IPC. Rpcserver:rpcserver.handler=69,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=66,port=60020] IPC. Rpcserver:rpcserver.handler=66,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=65,port=60020] IPC. Rpcserver:rpcserver.handler=65,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=63,port=60020] IPC. Rpcserver:rpcserver.handler=63,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=62,port=60020] IPC. Rpcserver:rpcserver.handler=62,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=61,port=60020] IPC. Rpcserver:rpcserver.handler=61,port=60020:exiting
2016-07-13 10:29:51,664 INFO [rpcserver.handler=79,port=60020] IPC. Rpcserver:rpcserver.handler=79,port=60020:exiting

The above log indicates that Regionserver failed while refreshing the store file to HDFs, then regionserver the exception, Hlog need to repaly again, forcing server shutdown. The main culprit is that HDFS is in safe mode, and all the spearhead points to the Namenode, view the Namenode log, as follows:

2016-07-13 10:29:25,239 WARN Org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker:Space available on Volume ' null ' is 0, which is below the configured reserved amount 104857600
2016-07-13 10:29:25,239 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem:NameNode low on available disk space. Entering Safe Mode.
2016-07-13 10:29:25,239 ERROR Org.apache.hadoop.hdfs.server.namenode.FSEditLog:Error:flush failed for (journal Journalandstream (Mgr=filejournalmanager (ROOT=/DATA/NAMENODE/NFSMOUNT/NN), stream=ed
Itlogfileoutputstream (/data/namenode/nfsmount/nn/current/edits_inprogress_0000000000339257361)))
Java.io.ioexception:input/output Error
At Sun.nio.ch.FileDispatcherImpl.size0 (Native Method)
At Sun.nio.ch.FileDispatcherImpl.size (filedispatcherimpl.java:83)
At Sun.nio.ch.FileChannelImpl.size (filechannelimpl.java:294)
At Org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.preallocate (editlogfileoutputstream.java:219)
At Org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.flushAndSync (editlogfileoutputstream.java:202 )
At Org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush (editlogoutputstream.java:112)
At Org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush (editlogoutputstream.java:106)
At Org.apache.hadoop.hdfs.server.namenode.JournalSet Journalsetoutputstream journalsetoutputstream8.apply ( journalset.java:498)
At Org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors (journalset.java:358)
At Org.apache.hadoop.hdfs.server.namenode.JournalSet.access (journalset.java:57) Atorg.apache.hadoop.hdfs.server.namenode.JournalSet (journalset.java:57) at Org.apache.hadoop.hdfs.server.namenode.JournalSetJournalSetOutputStream.flush (journalset.java:494)
At Org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync (fseditlog.java:624)
At Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt (fsnamesystem.java:2238)
At Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile (fsnamesystem.java:2180)
At Org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create (namenoderpcserver.java:505)
At Org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create ( clientnamenodeprotocolserversidetranslatorpb.java:354)
At Org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos Clientnamenodeprotocol Clientnamenodeprotocol2.callblockingmethod (Clientnamenodeprotocolprotos.java)
At Org.apache.hadoop.ipc.ProtobufRpcEngine Server Serverprotobufrpcinvoker.call (protobufrpcengine.java:585)
At Org.apache.hadoop.ipc.RPC Server.call (rpc.java:1026) Atorg.apache

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More