Why is GI rebootless Fencing going to fail?
Reference from:
Why Grid Infrastructure rebootless Node Fencing fails (Doc ID 1502282.1)
Suitable for:
Oracle server-enterprise edition-version 11.2.0.2 and later
Information in this document applies to any platform.
Objective:
Rebootless Fencing is a new feature introduced from 11.2.0.2, when evict occurs, rebootless Fencing replaces 11.2.0.2 nodes before reboot. Rebootless Fencing will attempt to gracefully stop the GI on the evicted node to avoid reboot nodes.
If the rebootless Fencing fails, the evicted node will be reboot. This article lists the common causes of rebootless Fencing failures.
Details:
1. Resources fails to stop.
If one or more resources fail to stop, rebootless fencing would fail and node would be rebooted.
Rebootless fencing fails on Node2 after network split brain and Node 2 is rebooted as expected:
? <gi_home>/log/<node>/alert<node>.log from evicted node:
.. 2012-09-11 12:04:34.363[CSSD (18834)]crs-1610:network Communication with node Racnode1 (1) Missing for 90% of timeout Inter Val. Removal of this node from cluster in 2.020 seconds2012-09-11 12:04:36.379[CSSD (18834)]crs-1609:this node was unable to comm Unicate with other nodes in the cluster and are going down to preserve cluster integrity; Details at (: CSSNM00008:) in/ocw/grid/log/racnode2/cssd/ocssd.log.2012-09-11 12:04:36.379[CSSD (18834)]crs-1656:the CSS Daemon is terminating due to a fatal error; Details at (: CSSSC00012:) in/ocw/grid/log/racnode2/cssd/ocssd.log2012-09-11 12:04:36.399[CSSD (18834)]crs-1652: Starting clean up of CRSD resources.2012-09-11 12:04:36.586[CRSD (26115)]crs-5833:cleaning resource ' ZDRMON.sh.racnode2 1 1 ' failed as part of reboot-less node fencing2012-09-11 12:04:36.588[CSSD (18834)]crs-1653:the clean up of the CRSD Resou RCEs failed. ##>> user resource fails to be cleaned2012-09-11 12:04:37.042[ohasd (16821)]crs-2765:resource ' ORA.EVMD 'Have failed on server ' Racnode2 '. 2012-09-11 12:04:37.052[/ocw/grid/bin/scriptagent.bin (27696)]crs-5822:agent '/ocw/ Grid/bin/scriptagent_oracle ' disconnected from server. Details at (: CRSAGF00117:) {0:4:10} in/ocw/grid/log/racnode2/agent/crsd/scriptagent_oracle/scriptagent_oracle.log.2012-09-11 12:04:37.062[ OHASD (16821)]crs-2765:resource ' ORA.CRSD ' have failed on server ' Racnode2 '. ##>> node rebooted after the this message, in some cases, this message won ' t is there2012-09-11 12:10:47.356[ohasd (16677 )]crs-2112:the OLR Service started on node racnode2.2012-09-11 12:10:47.521[ohasd (16677)]crs-1301:oracle high Availability Service started on node racnode2.2012-09-11 12:10:47.539[ohasd (16677)]crs-8011:reboot Advisory message From Host:racnode2, Component:cssagent, with time stamp:l-2012-09-11-12:04:37.140 ##>> reboot Advisory shows Both Cssdagent and Cssdmonitor took the action to REBOOT[OHASD (16677)]crs-8013:reboot Advisory Message Text:clsnomon_sta Tus:need To reboot, unexpected failure 8 received from css2012-09-11 12:10:47.594[OHASD (16677)]crs-8011:reboot Advisory Message FR Om Host:racnode2, Component:cssmonit, with time STAMP:L-2012-09-11-12:04:37.139[OHASD (16677)]crs-8013:reboot Advisory message Text:clsnomon_status:need to reboot, unexpected failure 8 received from css2012-09-11 12:10:47.605[ohas D (16677)]crs-8017:location:/etc/oracle/lastgasp has 2 reboot advisory log files, 2 were announced and 0 errors occurred
When resource fails to stop, cssdagent or cssdmonitor or both would try to reboot the node, below is sample log.
? <gi_home>/agent/ohasd/oracssdmonitor_root/oracssdmonitor_root.log
2012-09-11 12:04:36.400: [usrthrd][1095805248] Clsnpollmsg_main:got Posted
2012-09-11 12:04:36.400: [usrthrd][1095805248] Clsnpollmsg_main:shutdown initiated by CSS, requested to sync
2012-09-11 12:04:36.400: [usrthrd][1095805248] clsnwork_queue:posting worker thread
2012-09-11 12:04:36.400: [usrthrd][1095805248] clsnpollmsg_main:exiting check loop
2012-09-11 12:04:36.400: [usrthrd][1095805248] Clsnpollmsg_main:got HB signal
2012-09-11 12:04:36.400: [usrthrd][1097382208] clsnwork_process_work:calling Sync
2012-09-11 12:04:36.413: [usrthrd][1097382208] Clsnwork_process_work:sync completed
2012-09-11 12:04:37.035: [Cssclnt][1095805248]clsssrecvmsg:got A disconnect from the server while waiting for message Ty PE 22
2012-09-11 12:04:37.035: [Cssclnt][1098959168]clsssrecvmsg:got A disconnect from the server while waiting for message Ty PE 27
2012-09-11 12:04:37.035: [usrthrd][1095805248] clsnwork_queue:posting worker thread
2012-09-11 12:04:37.035: [usrthrd][1095805248] clsnpollmsg_main:exiting check loop
2012-09-11 12:04:37.035: [gipcxcpt][1098959168]gipcinternalsend:connection not valid for send operation ENDP 0x8e3e60 [0 0000000000001B7] {gipcendpoint:localaddr ' clsc://(address= (PROTOCOL=IPC) (key=) (gipcid=3165a05b-7e7139a5-18801)) ' , Remoteaddr ' clsc://(address= (PROTOCOL=IPC) (key=ocssd_ll_racnode2_) (gipcid=7e7139a5-3165a05b-18834)) ', NumPend 0, Numready 0, Numdone 0, Numdead 0, Numtransfer 0, Objflags 0x0, Pidpeer 18834, Flags 0x3861e, Usrflags 0x20010}, ret GIPCR Etconnectionlost (12)
2012-09-11 12:04:37.035: [usrthrd][1097382208] clsnwork_process_work:calling Sync
2012-09-11 12:04:37.035: [Cssclnt][1077418304]clsssrecvmsg:got A disconnect from the server while waiting for message Ty PE 1
2012-09-11 12:04:37.036: [Cssclnt][1077418304]clssgsgroupgetstatus:communications failed (0/3/-1)
2012-09-11 12:04:37.036: [cssclnt][1077418304]clssgsgroupgetstatus:returning 8
2012-09-11 12:04:37.036: [usrthrd][1077418304] clsnomon_status:communications failure with CSS detected. Waiting for sync-to-complete ...
2012-09-11 12:04:37.036: [GIPCXCPT][1098959168]GIPCSENDSYNCF [clsssserverrpc:clsss.c:6272]: EXCEPTION[ret Gipcretconnectionlost () failed to send on ENDP 0x8e3e60 [00000000000001b7] {gipcendpoint:localaddr ' clsc://(ADDRE Ss= (PROTOCOL=IPC) (key=) (gipcid=3165a05b-7e7139a5-18801)) ', Remoteaddr ' clsc://(address= (PROTOCOL=IPC) (KEY=OCSSD_ ll_racnode2_) (gipcid=7e7139a5-3165a05b-18834)) ', Numpend 0, Numready 0, Numdone 0, Numdead 0, Numtransfer 0, ObjFlags 0x0, Pidpeer 18834, Flags 0x3861e, Usrflags 0x20010}, addr 0000000000000000, buf 0x4180bd80, Len, Flags 0x8000000
2012-09-11 12:04:37.036: [Cssclnt][1098959168]clsssserverrpc:send failed with err, MSG type 7
2012-09-11 12:04:37.036: [CSSCLNT][1098959168]CLSSSCOMMONCLIENTEXIT:RPC failure, RC 3
2012-09-11 12:04:37.139: [usrthrd][1097382208] Clsnwork_process_work:sync completed
2012-09-11 12:04:37.139: [usrthrd][1097382208] clsnsynccomplete:posting OMON
? <gi_home>/agent/ohasd/oracssdagent_root/oracssdagent_root.log
2012-09-11 12:04:36.400: [usrthrd][1095805248] Clsnpollmsg_main:got posted2012-09-11 12:04:36.400: [USRTHRD][ 1095805248] Clsnpollmsg_main:shutdown initiated by CSS, requested to sync2012-09-11 12:04:36.400: [USRTHRD][1095805248] Clsnwork_queue:posting worker thread2012-09-11 12:04:36.400: [usrthrd][1095805248] clsnpollmsg_main:exiting check Loo p2012-09-11 12:04:36.400: [usrthrd][1095805248] Clsnpollmsg_main:got HB signal2012-09-11 12:04:36.400: [USRTHRD][ 1097382208] clsnwork_process_work:calling sync2012-09-11 12:04:36.413: [usrthrd][1097382208] Clsnwork_process_work: Sync completed2012-09-11 12:04:37.035: [Cssclnt][1098959168]clsssrecvmsg:got A disconnect from the server while waiting For message type 272012-09-11 12:04:37.035: [Cssclnt][1095805248]clsssrecvmsg:got A disconnect from the server while Wai Ting for message type 222012-09-11 12:04:37.035: [gipcxcpt][1098959168]gipcinternalsend:connection not valid for Send Ope Ration ENDP 0x2aaab4014900 [00000000000001c0]{gipcendpoint:localaddr ' clsc://(address= (PROTOCOL=IPC) (key=) (gipcid=561e3f6b-a0a3602e-18817)) ', RemoteAddr ' CLSC ://(Address= (PROTOCOL=IPC) (key=ocssd_ll_racnode2_) (gipcid=a0a3602e-561e3f6b-18834)) ', Numpend 0, NumReady 0, Numdone 0, Numdead 0, Numtransfer 0, Objflags 0x0, Pidpeer 18834, Flags 0x3861e, Usrflags 0x20010}, ret gipcretconnection Lost (2012-09-11) 12:04:37.035: [usrthrd][1095805248] clsnwork_queue:posting worker thread2012-09-11 12:04:37.035: [ usrthrd][1095805248] clsnpollmsg_main:exiting Check loop2012-09-11 12:04:37.035: [gipcxcpt][1098959168] GIPCSENDSYNCF [clsssserverrpc:clsss.c:6272]: exception[ret gipcretconnectionlost ()] failed to send on ENDP 0X2AA ab4014900 [00000000000001c0] {gipcendpoint:localaddr ' clsc://(address= (PROTOCOL=IPC) (key=) (gipcid= 561e3f6b-a0a3602e-18817) ', Remoteaddr ' clsc://(address= (PROTOCOL=IPC) (key=ocssd_ll_racnode2_) (GIPCID= a0a3602e-561e3f6b-18834) ', Numpend 0, Numready 0, Numdone 0, Numdead 0, Numtransfer 0, Objflags 0x0, pIdpeer 18834, Flags 0x3861e, Usrflags 0x20010}, addr 0000000000000000, buf 0x4180bd80, Len, Flags 0x80000002012-09-11 12:04:37.035: [Cssclnt][1098959168]clsssserverrpc:send failed with err, MSG type 7 2012-09-11 12:04:37.035: [cssclnt ][1098959168]CLSSSCOMMONCLIENTEXIT:RPC failure, RC 32012-09-11 12:04:37.036: [Cssclnt][1077418304]clsssrecvmsg:got A Disconnect from the server while waiting for message type 12012-09-11 12:04:37.036: [cssclnt][1077418304]clssgsgroupgetst Atus:communications failed (0/3/-1) 2012-09-11 12:04:37.036: [cssclnt][1077418304]clssgsgroupgetstatus:returning 82012-09-11 12:04:37.036: [usrthrd][1077418304] clsnomon_status:communications failure with CSS detected. Waiting for sync-to-complete ... 2012-09-11 12:04:37.036: [usrthrd][1097382208] clsnwork_process_work:calling Sync
As CRSD resources (user resources) failed to stop, Crsd.log can is a starting point for further debugging.
"Translated from MoS article" Why is GI rebootless Fencing going to fail?