HBase Cluster Overall outage report (2016.7.13)

Source: Internet
Author: User
Tags create directory flush safe mode
scenarios and operational Records

Around 10:50, received operation and maintenance personnel notice, hbase cluster B all nodes down, the following records to restore all operations of the cluster.

Login to HBase ui:http://192.168.3.146:60010/, unable to log in
Login to HBase Shell to view:

>status ' Simple '
5 dead servers

All Regionserver really hung up and pulled up all the regionserver quickly

Service Hbase-regionserver Start

Continue to use status ' simple ' to view the cluster status and discover that Regionserver is not pulled up to view the Regionserver log:

2016-07-13 10:37:13,480 ERROR [regionserver60020] regionserver. Hregionserver:failed Init
Org.apache.hadoop.ipc.RemoteException (org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory/hbase/wals/slave4.hadoop,60020,1468377431049. Name node is in safe mode.
Resources is low on NN. Add or free to more resources then turn off Safe mode manually. Note:if you turn off Safe mode before adding resources, the NN would immediately return to safe mode. Use "HDFs dfsadmin-safemode leave" To turn safe mode off.
At Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode (fsnamesystem.java:1197)
At Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt (fsnamesystem.java:3568)
At Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs (fsnamesystem.java:3544)
At Org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs (namenoderpcserver.java:739)
At Org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs ( clientnamenodeprotocolserversidetranslatorpb.java:558)
At Org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos Clientnamenodeprotocol Clientnamenodeprotocol2.callblockingmethod (Clientnamenodeprotocolprotos.java)

According to the above description, the reason for Regionserver's failure was that HDFs entered safe mode and quickly turned to HDFs to view the HDFs status via HDFs Dfsadmin-safemode get, and did enter Safe mode, executing the command: HDFs Dfsadmin-safe mode leave, leave safe mode and pull up all regionserver again.
Again into the hbase shell, at this time Regionserver have pulled up, into the Web UI view cluster status, found that regionserver although pulled up, but the above region did not pull up, execute command hbase hbck found, There are more than 1000 inconsistencies, execute the repair command hbase Hbck-repair, observe the Web UI state again, Regionserver is rapidly pulling up the corresponding region, waiting for all regions to be pulled up, inconsistencies disappear, hbase cluster recovery service. Fault Analysis

HBase cluster recovery, find out why the entire hbase cluster is causing such serious problems, view the Regionserver downtime log, as follows

2016-07-13 10:29:40,383 WARN [Thread-16] regionserver. hstore:failed Flushing Store file, retrying num=0
Java.io.IOException:org.apache.hadoop.ipc.RemoteException ( Org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create File/hbase/data/default/platform_common_ user_flow_consumer/e8878
12d9c1be58014f0733cf6e7b058/.tmp/72ace704aa374894afbabf2118225ebb. Name node is in safe mode. The
Resources is low on the NN. Add or free to more resources then turn off Safe mode manually. Note:if you turn off Safe mode before adding resources, the NN would immediately return to safe mod
E. Use "HDFs DFSA Dmin-safemode leave "To turn safe mode off.
at Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode (fsnamesystem.java:1197)
at Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt (fsnamesystem.java:2225)
at Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile (fsnamesystem.java:2180)

Then

2016-07-13 10:29:51,603 FATAL [Thread-16] regionserver. Hregionserver:regionserver abort:loaded coprocessors is: []
2016-07-13 10:29:51,660 INFO [Thread-16] regionserver. HRegionServer:STOPPED:Replay of HLog required. Forcing server shutdown
2016-07-13 10:29:51,663 INFO [rpcserver.handler=45,port=60020] IPC. Rpcserver:rpcserver.handler=45,port=60020:exiting
2016-07-13 10:29:51,661 INFO [priority.rpcserver.handler=0,port=60020] IPC. Rpcserver:priority.rpcserver.handler=0,port=60020:exiting
2016-07-13 10:29:51,661 INFO [rpcserver.handler=3,port=60020] IPC. Rpcserver:rpcserver.handler=3,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=49,port=60020] IPC. Rpcserver:rpcserver.handler=49,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=52,port=60020] IPC. Rpcserver:rpcserver.handler=52,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=53,port=60020] IPC. Rpcserver:rpcserver.handler=53,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=55,port=60020] IPC. Rpcserver:rpcserver.handler=55,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=56,port=60020] IPC. Rpcserver:rpcserver.handler=56,port=60020:exiting
2016-07-13 10:29:51,661 INFO [rpcserver.handler=0,port=60020] IPC. Rpcserver:rpcserver.handler=0,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=59,port=60020] IPC. Rpcserver:rpcserver.handler=59,port=60020:exiting
2016-07-13 10:29:51,661 INFO [rpcserver.handler=11,port=60020] IPC. Rpcserver:rpcserver.handler=11,port=60020:exiting
2016-07-13 10:29:51,661 INFO [rpcserver.handler=10,port=60020] IPC. Rpcserver:rpcserver.handler=10,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=64,port=60020] IPC. Rpcserver:rpcserver.handler=64,port=60020:exiting
2016-07-13 10:29:51,661 INFO [rpcserver.handler=2,port=60020] IPC. Rpcserver:rpcserver.handler=2,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=67,port=60020] IPC. Rpcserver:rpcserver.handler=67,port=60020:exiting
2016-07-13 10:29:51,664 INFO [rpcserver.handler=69,port=60020] IPC. Rpcserver:rpcserver.handler=69,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=66,port=60020] IPC. Rpcserver:rpcserver.handler=66,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=65,port=60020] IPC. Rpcserver:rpcserver.handler=65,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=63,port=60020] IPC. Rpcserver:rpcserver.handler=63,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=62,port=60020] IPC. Rpcserver:rpcserver.handler=62,port=60020:exiting
2016-07-13 10:29:51,663 INFO [rpcserver.handler=61,port=60020] IPC. Rpcserver:rpcserver.handler=61,port=60020:exiting
2016-07-13 10:29:51,664 INFO [rpcserver.handler=79,port=60020] IPC. Rpcserver:rpcserver.handler=79,port=60020:exiting

The above log indicates that Regionserver failed while refreshing the store file to HDFs, then regionserver the exception, Hlog need to repaly again, forcing server shutdown. The main culprit is that HDFS is in safe mode, and all the spearhead points to the Namenode, view the Namenode log, as follows:

2016-07-13 10:29:25,239 WARN Org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker:Space available on Volume ' null ' is 0, which is below the configured reserved amount 104857600
2016-07-13 10:29:25,239 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem:NameNode low on available disk space. Entering Safe Mode.
2016-07-13 10:29:25,239 ERROR Org.apache.hadoop.hdfs.server.namenode.FSEditLog:Error:flush failed for (journal Journalandstream (Mgr=filejournalmanager (ROOT=/DATA/NAMENODE/NFSMOUNT/NN), stream=ed
Itlogfileoutputstream (/data/namenode/nfsmount/nn/current/edits_inprogress_0000000000339257361)))
Java.io.ioexception:input/output Error
At Sun.nio.ch.FileDispatcherImpl.size0 (Native Method)
At Sun.nio.ch.FileDispatcherImpl.size (filedispatcherimpl.java:83)
At Sun.nio.ch.FileChannelImpl.size (filechannelimpl.java:294)
At Org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.preallocate (editlogfileoutputstream.java:219)
At Org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.flushAndSync (editlogfileoutputstream.java:202 )
At Org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush (editlogoutputstream.java:112)
At Org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush (editlogoutputstream.java:106)
At Org.apache.hadoop.hdfs.server.namenode.JournalSet Journalsetoutputstream journalsetoutputstream8.apply ( journalset.java:498)
At Org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors (journalset.java:358)
At Org.apache.hadoop.hdfs.server.namenode.JournalSet.access (journalset.java:57) Atorg.apache.hadoop.hdfs.server.namenode.JournalSet (journalset.java:57) at Org.apache.hadoop.hdfs.server.namenode.JournalSetJournalSetOutputStream.flush (journalset.java:494)
At Org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync (fseditlog.java:624)
At Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt (fsnamesystem.java:2238)
At Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile (fsnamesystem.java:2180)
At Org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create (namenoderpcserver.java:505)
At Org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create ( clientnamenodeprotocolserversidetranslatorpb.java:354)
At Org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos Clientnamenodeprotocol Clientnamenodeprotocol2.callblockingmethod (Clientnamenodeprotocolprotos.java)
At Org.apache.hadoop.ipc.ProtobufRpcEngine Server Serverprotobufrpcinvoker.call (protobufrpcengine.java:585)
At Org.apache.hadoop.ipc.RPC Server.call (rpc.java:1026) Atorg.apache

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.