HP-UNIX系統宕機 早上進行db例行巡檢的時候發現一個節點2宕機。首先檢查檢點2的alter,沒有任何發現,而且crs各項資源也offline狀態,可以判斷應該
是系統宕機導致沒有任何記錄。 www.2cto.com 通過查看系統登入日誌發現有過重啟記錄:# last | grep Decroot pts/1 Mon Dec 17 10:08 still logged inroot pts/0 Mon Dec 17 09:33 still logged inreboot system boot Sun Dec 16 08:16 still logged inreboot system boot Sat Dec 15 23:59 - 08:16 (08:16) 但是就是不知道系統重新資訊會不會也記錄到這裡,而且看第3條記錄,還still logged in這個
只能交給HP工程師來處理了。檢查/etc/shutdownlog發現如下新:00:03 Sun Dec 16 2012. Reboot after panic: MCA, IIP:0xe0000000008a1a40 IFA:0xc000000006dae00008:18 Sun Dec 16 2012. Reboot after panic: MCA, IIP:0xe000000000d650a0 IFA:0x20000000777db0cc www.2cto.com 檢查節點1的alter日誌發現如下資訊:Sat Dec 15 23:55:30 2012Errors in file /opt/oracle/product/admin/xxx/udump/xxx1_ora_4074.trc:Sat Dec 15 23:55:31 2012Errors in file /opt/oracle/product/admin/xxx/udump/xxx1_ora_4074.trc:Sat Dec 15 23:55:34 2012Reconfiguration started (old inc 100, new inc 102)List of nodes: 0 檢查crs日誌如下:2012-12-15 23:55:16.183[cssd(4229)]CRS-1612:node xxx2 (0) at 50% heartbeat fatal, eviction in 0.000 seconds2012-12-15 23:55:23.183[cssd(4229)]CRS-1611:node xxx2 (0) at 75% heartbeat fatal, eviction in 0.000 seconds2012-12-15 23:55:24.181[cssd(4229)]CRS-1611:node xxx2 (0) at 75% heartbeat fatal, eviction in 0.000 seconds2012-12-15 23:55:28.183[cssd(4229)]CRS-1610:node xxx2 (0) at 90% heartbeat fatal, eviction in 0.000 seconds2012-12-15 23:55:29.180[cssd(4229)]CRS-1610:node xxx2 (0) at 90% heartbeat fatal, eviction in 0.000 seconds2012-12-15 23:55:30.183[cssd(4229)]CRS-1610:node xxx2 (0) at 90% heartbeat fatal, eviction in 0.000 seconds2012-12-15 23:55:30.682[cssd(4229)]CRS-1607:CSSD evicting node xxx2. Details in /opt/oracle/product/crs/log/xxx1/cssd/ocssd.log.[cssd(4229)]CRS-1601:CSSD Reconfiguration complete. Active nodes are xxx1 . 檢查cssd日誌如下:[ CSSD]2012-12-15 23:55:16.183 [18] >WARNING: clssnmPollingThread: node xxx2 (2) at 50 2.000000e+00artbeat fatal, eviction in 14.489 seconds[ CSSD]2012-12-15 23:55:16.183 [18] >TRACE: clssnmPollingThread: node xxx2 (2) is impending reconfig, flag 1037, misstime 15511[ CSSD]2012-12-15 23:55:16.183 [18] >TRACE: clssnmPollingThread: diskTimeout set to (27000)ms impending reconfig status(1)[ CSSD]2012-12-15 23:55:23.183 [18] >WARNING: clssnmPollingThread: node xxx2 (2) at 75 2.000000e+00artbeat fatal, eviction in 7.489 seconds[ CSSD]2012-12-15 23:55:24.181 [18] >WARNING: clssnmPollingThread: node xxx2 (2) at 75 2.000000e+00artbeat fatal, eviction in 6.490 seconds[ CSSD]2012-12-15 23:55:28.183 [18] >WARNING: clssnmPollingThread: node xxx2 (2) at 90 2.000000e+00artbeat fatal, eviction in 2.489 seconds[ CSSD]2012-12-15 23:55:29.180 [18] >WARNING: clssnmPollingThread: node xxx2 (2) at 90 2.000000e+00artbeat fatal, eviction in 1.491 seconds[ CSSD]2012-12-15 23:55:30.183 [18] >WARNING: clssnmPollingThread: node xxx2 (2) at 90 2.000000e+00artbeat fatal, eviction in 0.489 seconds可以獲知節點2在這個時刻已經在重新設定叢集了,將節點2剔除了叢集。在通過將儲存active之後,叢集自動在節點2啟動並恢複正常生產。 通過/var/adm/syslog/syslog.log 和old日誌發現節點系統重啟了,奇怪的是竟然沒有重啟之前的日誌資訊,只能打包/var/adm/crash目錄下的系統crash(可以通過 q4 crash檔案大概查看一下)日誌資訊給HP技術服務人員。-The End-