[Translated from mos] Why does GI Rebootless Fencing fail ?, Rebootlessfencing

Source: Internet
Author: User

[Translated from mos] Why does GI Rebootless Fencing fail ?, Rebootlessfencing


Why does GI Rebootless Fencing fail?

Reference:
Why Grid Infrastructure Rebootless Node Fencing Fails (Doc ID 1502282.1)

Applicable:
Oracle Server-Enterprise Edition-Version 11.2.0.2 and later
Information in this document applies to any platform.

Purpose:
Rebootless Fencing is a new feature introduced from 11.2.0.2. When evict occurs, Rebootless Fencing replaces the reboot node before 11.2.0.2. Rebootless Fencing tries gracefully stop GI on the evicted node to avoid reboot nodes.

If Rebootless Fencing fails, the evicted node will be reboot. This article lists the common causes of Rebootless Fencing failure.

Details:
1. Resources fails to stop.

If one or more resources fail to stop, rebootless fencing will fail and node will be rebooted.

In this case, rebootless fencing fails on node2 after network split brain and node 2 is rebooted as expected:
• <GI_HOME>/log/<node>/alert <node>. log from evicted node:

.. 2012-09-11 12:04:34.363[cssd(18834)]CRS-1610:Network communication with node racnode1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.020 seconds2012-09-11 12:04:36.379[cssd(18834)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /ocw/grid/log/racnode2/cssd/ocssd.log.2012-09-11 12:04:36.379[cssd(18834)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /ocw/grid/log/racnode2/cssd/ocssd.log2012-09-11 12:04:36.399[cssd(18834)]CRS-1652:Starting clean up of CRSD resources.2012-09-11 12:04:36.586[crsd(26115)]CRS-5833:Cleaning resource 'zDRMON.sh.racnode2 1 1' failed as part of reboot-less node fencing2012-09-11 12:04:36.588[cssd(18834)]CRS-1653:The clean up of the CRSD resources failed.                    ##>> user resource fails to be cleaned2012-09-11 12:04:37.042[ohasd(16821)]CRS-2765:Resource 'ora.evmd' has failed on server 'racnode2'.2012-09-11 12:04:37.052[/ocw/grid/bin/scriptagent.bin(27696)]CRS-5822:Agent '/ocw/grid/bin/scriptagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:4:10} in /ocw/grid/log/racnode2/agent/crsd/scriptagent_oracle/scriptagent_oracle.log.2012-09-11 12:04:37.062[ohasd(16821)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode2'.                 ##>> node rebooted after this message, in some cases, this message won't be there2012-09-11 12:10:47.356[ohasd(16677)]CRS-2112:The OLR service started on node racnode2.2012-09-11 12:10:47.521[ohasd(16677)]CRS-1301:Oracle High Availability Service started on node racnode2.2012-09-11 12:10:47.539[ohasd(16677)]CRS-8011:reboot advisory message from host: racnode2, component: cssagent, with time stamp: L-2012-09-11-12:04:37.140      ##>> reboot advisory shows both cssdagent and cssdmonitor took the action to reboot[ohasd(16677)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS2012-09-11 12:10:47.594[ohasd(16677)]CRS-8011:reboot advisory message from host: racnode2, component: cssmonit, with time stamp: L-2012-09-11-12:04:37.139[ohasd(16677)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS2012-09-11 12:10:47.605[ohasd(16677)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 2 were announced and 0 errors occurred


When resource fails to stop, cssdagent or cssdmonitor or both will try to reboot the node, below is sample log.

 

• <GI_HOME>/agent/ohasd/oracssdmonitor_root/oracssdmonitor_root.log

12:04:36. 400: [USRTHRD] [1095805248] clsnpollmsg_main: got posted
12:04:36. 400: [USRTHRD] [1095805248] clsnpollmsg_main: shutdown initiated by CSS, requested to sync
12:04:36. 400: [USRTHRD] [1095805248] clsnwork_queue: posting worker thread
12:04:36. 400: [USRTHRD] [1095805248] clsnpollmsg_main: exiting check loop
12:04:36. 400: [USRTHRD] [1095805248] clsnpollmsg_main: got HB signal
12:04:36. 400: [USRTHRD] [1097382208] clsnwork_process_work: calling sync
12:04:36. 413: [USRTHRD] [1097382208] clsnwork_process_work: sync completed
12:04:37. 035: [CSSCLNT] [1095805248] clsssRecvMsg: got a disconnect from the server while waiting for message type 22
12:04:37. 035: [CSSCLNT] [1098959168] clsssRecvMsg: got a disconnect from the server while waiting for message type 27
12:04:37. 035: [USRTHRD] [1095805248] clsnwork_queue: posting worker thread
12:04:37. 035: [USRTHRD] [1095805248] clsnpollmsg_main: exiting check loop
12:04:37. 035: [GIPCXCPT] [1098959168] gipcInternalSend: connection not valid for send operation endp 0x8e3e60 [000000000000000001b7] {gipcEndpoint: localAddr 'clsc: // (ADDRESS = (PROTOCOL = ipc) (KEY =) (GIPCID = 4255a05b-7e7139a5-18801) ', remoteAddr 'clsc: // (ADDRESS = (PROTOCOL = ipc) (KEY = OCSSD_LL_racnode2 _) (GIPCID = Beijing ))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18834, flags 0x00001e, usrFlags 0x20010}, ret gipcretConnectionLost (12)
12:04:37. 035: [USRTHRD] [1097382208] clsnwork_process_work: calling sync
12:04:37. 035: [CSSCLNT] [1077418304] clsssRecvMsg: got a disconnect from the server while waiting for message type 1
12:04:37. 036: [CSSCLNT] [1077418304] clssgsGroupGetStatus: communications failed (0/3/-1)

12:04:37. 036: [CSSCLNT] [1077418304] clssgsGroupGetStatus: returning 8

12:04:37. 036: [USRTHRD] [1077418304] clsnomon_status: Communications failure with CSS detected. Waiting for sync to complete...
12:04:37. 036: [GIPCXCPT] [0, 1098959168] gipcSendSyncF [clsssServerRPC: clsss. c: 6272]: EXCEPTION [ret gipcretConnectionLost (12)] failed to send on endp 0x8e3e60 [00000000000001b7] {gipcEndpoint: localAddr 'clsc: // (ADDRESS = (PROTOCOL = ipc) (KEY =) (GIPCID = 4255a05b-7e7139a5-18801) ', remoteAddr 'clsc: // (ADDRESS = (PROTOCOL = ipc) (KEY = OCSSD_LL_racnode2 _) (GIPCID = Beijing ))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18834, flags 0x00001e, usrFlags 0x20010}, addr 0000000000000000, buf 0x4180bd80, len 80, flags 0x8000000
12:04:37. 036: [CSSCLNT] [1098959168] clsssServerRPC: send failed with err 12, msg type 7

12:04:37. 036: [CSSCLNT] [1098959168] clsssCommonClientExit: RPC failure, rc 3

12:04:37. 139: [USRTHRD] [1097382208] clsnwork_process_work: sync completed
12:04:37. 139: [USRTHRD] [1097382208] clsnSyncComplete: posting omon

 

 

• <GI_HOME>/agent/ohasd/oracssdagent_root/oracssdagent_root.log

2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: got posted2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: shutdown initiated by CSS, requested to sync2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnwork_queue: posting worker thread2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: exiting check loop2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: got HB signal2012-09-11 12:04:36.400: [ USRTHRD][1097382208] clsnwork_process_work: calling sync2012-09-11 12:04:36.413: [ USRTHRD][1097382208] clsnwork_process_work: sync completed2012-09-11 12:04:37.035: [ CSSCLNT][1098959168]clsssRecvMsg: got a disconnect from the server while waiting for message type 272012-09-11 12:04:37.035: [ CSSCLNT][1095805248]clsssRecvMsg: got a disconnect from the server while waiting for message type 222012-09-11 12:04:37.035: [GIPCXCPT][1098959168]gipcInternalSend: connection not valid for send operation endp 0x2aaab4014900 [00000000000001c0] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=561e3f6b-a0a3602e-18817))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_racnode2_)(GIPCID=a0a3602e-561e3f6b-18834))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18834, flags 0x3861e, usrFlags 0x20010 }, ret gipcretConnectionLost (12)2012-09-11 12:04:37.035: [ USRTHRD][1095805248] clsnwork_queue: posting worker thread2012-09-11 12:04:37.035: [ USRTHRD][1095805248] clsnpollmsg_main: exiting check loop2012-09-11 12:04:37.035: [GIPCXCPT][1098959168]gipcSendSyncF [clsssServerRPC : clsss.c : 6272]: EXCEPTION[ ret gipcretConnectionLost (12) ]  failed to send on endp 0x2aaab4014900 [00000000000001c0] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=561e3f6b-a0a3602e-18817))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_racnode2_)(GIPCID=a0a3602e-561e3f6b-18834))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18834, flags 0x3861e, usrFlags 0x20010 }, addr 0000000000000000, buf 0x4180bd80, len 80, flags 0x80000002012-09-11 12:04:37.035: [ CSSCLNT][1098959168]clsssServerRPC: send failed with err 12, msg type 7 2012-09-11 12:04:37.035: [ CSSCLNT][1098959168]clsssCommonClientExit: RPC failure, rc 32012-09-11 12:04:37.036: [ CSSCLNT][1077418304]clsssRecvMsg: got a disconnect from the server while waiting for message type 12012-09-11 12:04:37.036: [ CSSCLNT][1077418304]clssgsGroupGetStatus:  communications failed (0/3/-1)2012-09-11 12:04:37.036: [ CSSCLNT][1077418304]clssgsGroupGetStatus: returning 82012-09-11 12:04:37.036: [ USRTHRD][1077418304] clsnomon_status: Communications failure with CSS detected. Waiting for sync to complete...2012-09-11 12:04:37.036: [ USRTHRD][1097382208] clsnwork_process_work: calling sync


As CRSD resources (user resources) failed to stop, crsd. log can be a starting point for further debugging.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.