A colleague from the analysis and optimization team of a node "downtime" of a RAC database received a notification that a set of RAC databases in Beijing had "crashed" at on January 1, July 1 ", it is unknown whether all the machines are down or if one of the nodes is down. You need to troubleshoot the problem. After analyzing the related logs and nmon of the RAC, it is found that it is a "pseudo-downtime". The real cause is that the network of the 2nd nodes for external services has no traffic, and the server is manually restarted. The analysis process is as follows. I. First, post the analyzed fault records here to help readers quickly understand the cause and effect of the process: Description of sequence time phenomenon 18:00:00 the Public network traffic of Node 2 suddenly drops to 0, but there was no disconnection, and it continued until the server on node 2 was shut down at. After investigation, the command used was "init 0" 39: 40: node 18 was found to have a problem with the heartbeat of Node 2: 49: 40: 32 removed the two nodes from the CRS 59: 40: 32 reconstructed CRS, in CRS, only the first segment of the VIP address, 69: 40: 36, is retained. The VIP address originally belongs to node 2 is migrated and registered to node 1 at node 79: 45: 13 and node 2. The server starts up at node 89: 49: 42 secondary node CRS start 99: 50: 24CRS re-reconstruction has been started. Add the secondary node to CRS and close the secondary node CRS1110: 47: 45 secondary node CRS restart again, it has been added to CRS. 2. Detailed analysis process. 1. Host Name of the RAC basic information node.
public IP / VIP(eth2)priv IP (eth3)1XXXXdb710.X.XX.80/10.X.XX.82192.XX.XX.802XXXXdb810.X.XX.81/10.X.XX.83/10.X.XX.84192.XX.XX.81
2. view the log information of node 1st (XXXXdb7) 2.1 and OS messages of node 1st (XXXXdb7:
Jun 30 04:03:06 XXXXdb7 syslogd 1.4.1: restart.Jul 1 09:40:36 XXXXdb7 avahi-daemon[9246]: Registering new address record for 10.X.XX.83 on eth2.Jul 1 09:40:36 XXXXdb7 avahi-daemon[9246]: Withdrawing address record for 10.X.XX.83 on eth2.Jul 1 09:40:36 XXXXdb7 avahi-daemon[9246]: Registering new address record for 10.X.XX.83 on eth2.Jul 1 09:40:36 XXXXdb7 avahi-daemon[9246]: Withdrawing address record for 10.X.XX.83 on eth2.Jul 1 09:40:36 XXXXdb7 avahi-daemon[9246]: Registering new address record for 10.X.XX.83 on eth2.Jul 1 09:40:36 XXXXdb7 avahi-daemon[9246]: Registering new address record for 10.X.XX.84 on eth2.Jul 1 09:40:36 XXXXdb7 avahi-daemon[9246]: Withdrawing address record for 10.X.XX.84 on eth2.
Analysis: (1) 09:40:36, which belongs to 10 of 2nd nodes. x. XX.83/10. x. XX.84 two VIP addresses are moved to the eth2 Nic of the 1st node, which at least indicates that the CRS of the 2nd node is down (2) no software or hardware error log 2.2, 1st node (XXXXdb7) CRS_alert log: 09:40:18. 629
[cssd(9742)]CRS-1612:Network communication with node XXXXdb8 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.290 seconds2013-07-01 09:40:25.663[cssd(9742)]CRS-1611:Network communication with node XXXXdb8 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.250 seconds2013-07-01 09:40:30.673[cssd(9742)]CRS-1610:Network communication with node XXXXdb8 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.240 seconds2013-07-01 09:40:32.920[cssd(9742)]CRS-1632:Node XXXXdb8 is being removed from the cluster in cluster incarnation 2494397012013-07-01 09:40:32.940[cssd(9742)]CRS-1601:CSSD Reconfiguration complete. Active nodes are XXXXdb7 .2013-07-01 09:40:33.902[crsd(10300)]CRS-5504:Node down event reported for node 'XXXXdb8'.2013-07-01 09:40:44.683[crsd(10300)]CRS-2773:Server 'XXXXdb8' has been removed from pool 'Generic'.2013-07-01 09:40:44.684[crsd(10300)]CRS-2773:Server 'XXXXdb8' has been removed from pool 'ora.XXXXc'.2013-07-01 09:50:24.410[cssd(9742)]CRS-1601:CSSD Reconfiguration complete. Active nodes are XXXXdb7 XXXXdb8.2013-07-01 09:51:10.792[cssd(9742)]CRS-1625:Node XXXXdb8, number 2, was manually shut down2013-07-01 09:51:10.831[cssd(9742)]CRS-1601:CSSD Reconfiguration complete. Active nodes are XXXXdb7 .2013-07-01 10:47:45.660[cssd(9742)]CRS-1601:CSSD Reconfiguration complete. Active nodes are XXXXdb7 XXXXdb8
Analysis: (1) 09:40:18. 629, 1st nodes found that the heartbeat network of 2nd node XXXXdb8 (2) is no longer connected; (2) after three consecutive heartbeat failures, 1st nodes thought that 2nd nodes had failed, remove 2nd nodes from the cluster; (3) 09:40:32. 940 restructured RAC, and only one node (1st nodes) is retained in the cluster (4). It is already known that the 2nd node is faulty first. 3. view the log information of node 2nd (XXXXdb8) 3.1 and OS messages of node 2nd (XXXXdb7:
Jul 1 09:39:59 XXXXdb8 shutdown[8014]: shutting down for system haltJul 1 09:40:00 XXXXdb8 pcscd: winscard.c:304:SCardConnect() Reader E-Gate 0 0 Not FoundJul 1 09:40:02 XXXXdb8 smartd[10533]: smartd received signal 15: TerminatedJul 1 09:40:02 XXXXdb8 smartd[10533]: smartd is exiting (exit status 0)Jul 1 09:45:13 XXXXdb8 syslogd 1.4.1: restart.
Analysis: (1) at 09:39:59, the two-node server was shut down, so far, it has been known that the server was shut down (2) the most important thing to note is that before closing the server, the server does not report any software or hardware-related alarm information, indicating that the software and hardware are normal 3.2 and 2nd nodes (XXXXdb) CRS Alert log:
2013-06-30 23:00:19.744[ctssd(11038)]CRS-2409:The clock on host XXXXdb8 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.2013-07-01 09:49:42.394[ohasd(10803)]CRS-2112:The OLR service started on node XXXXdb8.
Analysis: At 9:49:42, CRS was started on the second node. Before CRS was started, there was no CRS closed LOG or CRS-related error information. This problem was serious, it indicates that the CRS is not normally disabled, probably because the server is automatically or manually restarted, the CRS is not written to the LOG shut down by CRS at all. 4. At this point, there should be only three reasons for determining that the Service server is restarted (points 2nd and 3 are excluded after analysis ): 1. Manually restart the server. 2. If the server load is too high, the two nodes are DOWN or the one node is evicted. After careful analysis, it was found that 2 nodes shut down at 09:39:59, and 1 node did not connect to 2 node network heartbeat until 09:40:18, it indicates that node 2 is not evicted by node 1 and causes a restart. Is it because the load on node 2 is too high, after analysis, the CPU, I/O, and Memory load on the two nodes has been very low and stable, so this should not cause the server to restart. 3. The two nodes caused by network faults are evicted and restarted, this is the same as the above analysis. It was also because 2 nodes shut down at 09:39:59, and 1 node found that the network heartbeat connection with 2 nodes failed at 09:40:18, it also shows that this factor is not a pleasant surprise: an interesting problem was found in the NMON information. Eth1 (Heartbeat Network) is normal, and Eth2 (a network that provides external services) is 2 nodes) at a.m., all traffic drops to 0, and all Memory usage starts to be released. This indicates that there is a Public network problem, but there is no disconnection summary: the reason for server restart is probably due to manual restart. 5. Verify if the server is manually restarted. 5.1. Check the historical commands of the Two-node server to see if any restart command exists.
957 init 0 958 su - grid 959 su - oracle 960 su - grid 961 su - grid 962 cd /u01/app/11.2.0/grid/bin 963 ./crsctl start crs 964 su - oracle
Analysis: there is indeed an init 0 shutdown command in the last commands of the ROOT user and the user. After the command is shut down, there is a CRS startup command here, which has been confirmed, the server was shut down by commands. 3. Reasons for manual shutdown. 1. The Public network encountered a problem at a.m.. The user could not connect to the database. 2. a.m, to solve this problem, the related personnel shut down the server. 3. It took 5 minutes to restart the server. 4. In order to verify the Administrator's network problems, looking at the recent history commands of Node 1, we found that someone is testing the network of ping Node 2 on node 1.
929 cat /etc/hosts 930 ping 10.X.XX.80 931 ping 10.X.XX.81 932 ping 10.X.XX.82 933 ping 10.X.XX.83 934 su - oracle 935 cd /home/ 936 ls 937 su - grid 938 ssh 10.1.146.33 939 top 940 free -m 941 su - grid 942 vmstat 1 01 943 vmstat 1 10 944 su - grid 945 su - grid 946 su - oracle 947 su - oracle 948 su - grid 949 su - oracle 950 su - grid 951 cat /etc/hosts 952 ping 10.X.XX.81 953 ping 10.X.XX.83