Oracle 11gR2 RAC node crash Fault Analysis
Environment: AIX 7100
Oracle 11gR2 RAC
Detailed version: 11.2.0.4
Symptom:
Node 2 crs hang is down. The CRSCTL command does not respond at all. After the CRS process host is restarted, the VIP is not migrated to node 1.
Analysis ideas;
1. alert logs and related trace logs in DB.
2. view the output of "errpt-a" on all nodes.
3. view the GI logs of all nodes when the problem occurs:
<GRID_HOME>/log/<GRID_HOME>/log/<GRID_HOME>/log/<GRID_HOME>/log/<GRID_HOME>/log//Etc/oracle/lastgasp/*, or/var/opt/oracle/lastgasp/* (If have)
Note: If the host is restarted by CRS, a record will be added to the file in the/etc/oracle/lastgasp/directory.
4. Check the LMON, LMS *, and LMD0 trace files of all nodes when a problem occurs.
5. View All OSW output of all nodes when a problem occurs.
-------------------------------------- Split line --------------------------------------
Install Oracle 11gR2 (x64) in CentOS 6.4)
Steps for installing Oracle 11gR2 in vmwarevm
Install Oracle 11g XE R2 In Debian
-------------------------------------- Split line --------------------------------------
The detailed analysis process is as follows:
Alert Log of Node 1 dB:
Tue Mar 25 12:59:07 2014
Thread 1 advanced to log sequence 245 (LGWR switch)
Current log #2 seq #245 mem #0: + SYSDG/dbracdb/onlinelog/group_2.264.840562709
Current log #2 seq #245 mem #1: + SYSDG/dbracdb/onlinelog/group_2.265.840562727
Tue Mar 25 12:59:20 2014
Archived Log entry 315 added for thread 1 sequence 244 ID 0xffffffff82080958 dest 1:
Tue Mar 25 13:14:54 2014
IPC Send timeout detected. Sender: ospid 6160700 [oracle @ dbrac1 (LMS0)]
Explorer: inst 2 binc 291585594 ospid 11010320
IPC Send timeout to 2.1 inc 50 for msg type 65518 from opid 12
Tue Mar 25 13:14:59 2014
Communications reconfiguration: instance_number 2
Tue Mar 25 13:15:01 2014
IPC Send timeout detected. Sender: ospid 12452050 [oracle @ dbrac1 (LMS1)]
Explorer: inst 2 binc 291585600 ospid 11534636
IPC Send timeout to 2.2 inc 50 for msg type 65518 from opid 13
Tue Mar 25 13:15:22 2014
IPC Send timeout detected. Sender: ospid 10682630 [oracle @ dbrac1 (TNS V1-V3)]
Explorer: inst 2 binc 50 ospid 6095056
Tue Mar 25 13:15:25 2014
Detected an inconsistent instance membership by instance 1
Evicting instance 2 from cluster
Waiting for instances to leave: 2
Tue Mar 25 13:15:26 2014
Dumping diagnostic data in directory = [cdmp_20140325131526], requested by (instance = 2, osid = 8192018 (LMD0), summary = [abnormal instance termination].
Tue Mar 25 13:15:42 2014
Reconfiguration started (old inc 50, new inc 54)
List of instances:
1 (myinst: 1)
...
Tue Mar 25 13:15:52 2014
Archived Log entry 316 added for thread 2 sequence 114 ID 0xffffffff82080958 dest 1:
Tue Mar 25 13:15:53 2014
ARC3: Archiving disabled thread 2 sequence 115
Archived Log entry 317 added for thread 2 sequence 115 ID 0xffffffff82080958 dest 1:
Tue Mar 25 13:16:37 2014
Thread 1 advanced to log sequence 246 (LGWR switch)
Current log #3 seq #246 mem #0: + SYSDG/dbracdb/onlinelog/group_3.266.840562735
Current log #3 seq #246 mem #1: + SYSDG/dbracdb/onlinelog/group_3.267.840562747
Tue Mar 25 13:16:46 2014
Decreasing number of real time LMS from 2 to 0
Tue Mar 25 13:16:51 2014
Archived Log entry 318 added for thread 1 sequence 245 ID 0xffffffff82080958 dest 1:
Tue Mar 25 13:20:50 2014
IPC Send timeout detected. Sender: ospid 9306248 [oracle @ dbrac1 (PING)]
Explorer: inst 2 binc 291585377 ospid 2687058
Tue Mar 25 13:30:08 2014
Thread 1 advanced to log sequence 247 (LGWR switch)
Current log #1 seq #247 mem #0: + SYSDG/dbracdb/onlinelog/group_1.262.840562653
Current log #1 seq #247 mem #1: + SYSDG/dbracdb/onlinelog/group_1.263.840562689
Tue Mar 25 13:30:20 2014
Archived Log entry 319 added for thread 1 sequence 246 ID 0xffffffff82080958 dest 1:
Tue Mar 25 13:45:23 2014
Thread 1 advanced to log sequence 248 (LGWR switch)
Current log #2 seq #248 mem #0: + SYSDG/dbracdb/onlinelog/group_2.264.840562709
Current log #2 seq #248 mem #1: + SYSDG/dbracdb/onlinelog/group_2.265.840562727
Alert Log of Node 2 dB:
Tue Mar 25 12:07:15 2014
Archived Log entry 309 added for thread 2 sequence 112 ID 0xffffffff82080958 dest 1:
Tue Mar 25 12:22:22 2014
Dumping diagnostic data in directory = [cdmp_20140325122222], requested by (instance = 1, osid = 7012828), summary = [incident = 384673].
Tue Mar 25 12:45:21 2014
Thread 2 advanced to log sequence 114 (LGWR switch)
Current log #6 seq #114 mem #0: + SYSDG/dbracdb/onlinelog/group_6.274.840563009
Current log #6 seq #114 mem #1: + SYSDG/dbracdb/onlinelog/group_6.275.840563017
Tue Mar 25 12:45:22 2014
Archived Log entry 313 added for thread 2 sequence 113 ID 0xffffffff82080958 dest 1:
Tue Mar 25 13:14:57 2014
IPC Send timeout detected. Receiver ospid 11010320
Tue Mar 25 13:14:57 2014
Errors in file/oraclelog/diag/rdbms/dbracdb/dbracdb2/trace/dbracdb2_lms0_11010320.trc:
IPC Send timeout detected. Receiver ospid 11534636 [
Tue Mar 25 13:15:01 2014
Errors in file/oraclelog/diag/rdbms/dbracdb/dbracdb2/trace/dbracdb2_lms1_1151_36.trc:
Tue Mar 25 13:15:25 2014
LMS0 (ospid: 11010320) has detected no messaging activity from instance 1
LMS0 (ospid: 11010320) issues an IMR to resolve the situation
Please check LMS0 trace file for more detail.
Tue Mar 25 13:15:25 2014
Suppressed nested communications reconfiguration: instance_number 1
Detected an inconsistent instance membership by instance 1
Tue Mar 25 13:15:25 2014
Received an instance abort message from instance 1
Please check instance 1 alert and LMON trace files for detail.
LMD0 (ospid: 8192018): terminating the instance due to error 481
Tue Mar 25 13:15:26 2014
ORA-1092: opitsk aborting process
Tue Mar 25 13:15:29 2014
System state dump requested by (instance = 2, osid = 8192018 (LMD0), summary = [abnormal instance termination].
System State dumped to trace file/oraclelog/diag/rdbms/dbracdb/dbracdb2/trace/dbracdb2_diag_9699724_20140325131529.trc
Instance terminated by LMD0, pid = 8192018
Osw prvtnet log of Node 1:
Zzz *** Tue Mar 25 13:12:19 BEIST 2014
Trying to get source for 192.168.100.1
Source shoshould be 192.168.100.1
Traceroute to 192.168.100.1 (192.168.100.1) from 192.168.100.1 (192.168.100.1), 30 hops max
Outgoing mtu= 1500
1 dbrac1-priv (192.168.100.1) 1 MS 0 MS 0 MS
Trying to get source for 192.168.100.2
Source shoshould be 192.168.100.1
Traceroute to 192.168.100.2 (192.168.100.2) from 192.168.100.1 (192.168.100.1), 30 hops max
Outgoing mtu= 1500
1 dbrac2-priv (192.168.100.2) 1 MS 0 MS *
Zzz *** Warning. Traceroute response is spanning snapshot intervals.
Zzz *** Tue Mar 25 13:12:31 BEIST 2014
Trying to get source for 192.168.100.1
Source shoshould be 192.168.100.1
Traceroute to 192.168.100.1 (192.168.100.1) from 192.168.100.1 (192.168.100.1), 30 hops max
Outgoing mtu= 1500
1 dbrac1-priv (192.168.100.1) 1 MS 0 MS 0 MS
Trying to get source for 192.168.100.2
Source shoshould be 192.168.100.1
Traceroute to 192.168.100.2 (192.168.100.2) from 192.168.100.1 (192.168.100.1), 30 hops max
Outgoing mtu= 1500
1 ***
2 ***
3 * dbrac2-priv (192.168.100.2) 0 MS *
Zzz *** Warning. Traceroute response is spanning snapshot intervals.
Zzz *** Tue Mar 25 13:13:17 BEIST 2014
Trying to get source for 192.168.100.1
Source shoshould be 192.168.100.1
Traceroute to 192.168.100.1 (192.168.100.1) from 192.168.100.1 (192.168.100.1), 30 hops max
Outgoing mtu= 1500
1 dbrac1-priv (192.168.100.1) 1 MS 0 MS 0 MS
Trying to get source for 192.168.100.2
Source shoshould be 192.168.100.1
Traceroute to 192.168.100.2 (192.168.100.2) from 192.168.100.1 (192.168.100.1), 30 hops max
Outgoing mtu= 1500
1 ***
2 ***
3 dbrac2-priv (192.168.100.2) 0 MS **
Zzz *** Warning. Traceroute response is spanning snapshot intervals.
Zzz *** Tue Mar 25 13:14:04 BEIST 2014
Trying to get source for 192.168.100.1
Source shoshould be 192.168.100.1
Traceroute to 192.168.100.1 (192.168.100.1) from 192.168.100.1 (192.168.100.1), 30 hops max
Outgoing mtu= 1500
1 dbrac1-priv (192.168.100.1) 1 MS 0 MS 0 MS
Trying to get source for 192.168.100.2
Source shoshould be 192.168.100.1
Traceroute to 192.168.100.2 (192.168.100.2) from 192.168.100.1 (192.168.100.1), 30 hops max
Outgoing mtu= 1500
1 *** <==================================== note: * indicates that traceroute is unsuccessful, and 3 * indicates that three network interactions are performed.
2 ***
3 ***
4 ***
5 ***
6 ***
7 ***
8 dbrac2-priv (192.168.100.2) 0 MS 0 MS *
Zzz *** Warning. Traceroute response is spanning snapshot intervals.
Zzz *** Tue Mar 25 13:16:01 BEIST 2014 <=============================== ======= This snapshot is taken after 2 mins, OSW gap happened.
Trying to get source for 192.168.100.1
Source shoshould be 192.168.100.1
Traceroute to 192.168.100.1 (192.168.100.1) from 192.168.100.1 (192.168.100.1), 30 hops max
Outgoing mtu= 1500
1 dbrac1-priv (192.168.100.1) 1 MS 0 MS 0 MS
Trying to get source for 192.168.100.2
Source shoshould be 192.168.100.1
Traceroute to 192.168.100.2 (192.168.100.2) from 192.168.100.1 (192.168.100.1), 30 hops max
Outgoing mtu= 1500
1 * dbrac2-priv (192.168.100.2) 0 MS 0 MS
For more details, please continue to read the highlights on the next page: