一個客戶的11.2的RAC環境,出現了一個節點DOWN掉的問題。
這個客戶手工調整了系統的時間,使得兩個節點的系統時間只相差幾秒,而且還設定了11.2的自動時間同步功能。但是到了下午發現一個節點上的執行個體DOWN掉了,而且這時兩個節點的時間相差了一個小時左右。
檢查掛掉的節點上的資料庫alert檔案:
Mon May 10 15:22:03 2010
NOTE: ASMB terminating
Errors in file /oracle/app/oracle/diag/rdbms/posp/posp2/trace/posp2_asmb_8257792.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 82 Serial number: 53
Errors in file /oracle/app/oracle/diag/rdbms/posp/posp2/trace/posp2_asmb_8257792.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 82 Serial number: 53
ASMB (ospid: 8257792): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 8257792
很顯然導致問題的原因是串連不到ASM執行個體,這比如導致資料庫執行個體的崩潰,再檢查ASM執行個體的錯誤記錄檔:
Mon May 10 15:00:13 2010
WARNING: Elapsed time did not advance (delta = 0),
WARNING: skipping disk repair timer updates for
WARNING: offline disks
Mon May 10 15:03:35 2010
WARNING: Elapsed time did not advance (delta = 0),
WARNING: skipping disk repair timer updates for
WARNING: offline disks
Mon May 10 15:06:56 2010
WARNING: Elapsed time did not advance (delta = 0),
WARNING: skipping disk repair timer updates for
WARNING: offline disks
Mon May 10 15:10:17 2010
WARNING: Elapsed time did not advance (delta = 0),
WARNING: skipping disk repair timer updates for
WARNING: offline disks
Mon May 10 15:13:39 2010
WARNING: Elapsed time did not advance (delta = 0),
WARNING: skipping disk repair timer updates for
WARNING: offline disks
Mon May 10 15:17:00 2010
WARNING: Elapsed time did not advance (delta = 0),
WARNING: skipping disk repair timer updates for
WARNING: offline disks
Mon May 10 15:20:21 2010
WARNING: Elapsed time did not advance (delta = 0),
WARNING: skipping disk repair timer updates for
WARNING: offline disks
Mon May 10 15:22:03 2010
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
Mon May 10 15:22:03 2010
NOTE: client exited [4718636]
NOTE: force a map free for map id 2
Mon May 10 15:22:04 2010
PMON (ospid: 6816236): terminating the instance due to error 481
Instance terminated by PMON, pid = 6816236
ASM執行個體的被中止的原因是錯誤481,而Oracle對於這個錯誤的描述為:
ORA-00481: LMON process terminated with error
Cause: The global enqueue service monitor process died
Action: Warm start instance
看來問題和CLUSTER環境有關,而且ASM的警示日誌中包含了大量的警告資訊。
進一步檢查CLUSTER上的相關資訊:
2011-05-10 14:33:46.819
[cssd(4391224)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .
2011-05-10 14:35:01.116
[cssd(4391224)]CRS-1632:Node jzdbiufo is being removed from the cluster in cluster incarnation 200459465
2011-05-10 14:35:01.144
[cssd(4391224)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc .
2011-05-10 14:35:03.192
[cssd(4391224)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .
2011-05-10 14:36:18.384
[cssd(4391224)]CRS-1612:Network communication with node jzdbiufo (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.896 seconds
2011-05-10 14:36:26.422
[cssd(4391224)]CRS-1611:Network communication with node jzdbiufo (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.859 seconds
2011-05-10 14:36:30.452
[cssd(4391224)]CRS-1610:Network communication with node jzdbiufo (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.829 seconds
2011-05-10 14:36:33.285
[cssd(4391224)]CRS-1632:Node jzdbiufo is being removed from the cluster in cluster incarnation 200459467
2011-05-10 14:36:33.312
[cssd(4391224)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc .
2011-05-10 14:54:16.181
[crsd(6422586)]CRS-2765:Resource 'ora.net1.network' has failed on server 'jzdbnc'.
2011-05-10 14:54:18.539
[/oracle/app/11.2.0/grid/bin/oraagent.bin(3014938)]CRS-5016:Process "/oracle/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/oracle/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/jzdbnc/agent/crsd/oraagent_grid/oraagent_grid.log"
2011-05-10 14:54:18.574
[/oracle/app/11.2.0/grid/bin/oraagent.bin(3014938)]CRS-5016:Process "/oracle/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/oracle/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/jzdbnc/agent/crsd/oraagent_grid/oraagent_grid.log"