Why a node in a RAC database is "down"

Source: Internet
Author: User

A colleague from the analysis and optimization team of a node "downtime" of a RAC database received a notification that a set of RAC databases in Beijing had "crashed" at on January 1, July 1 ", it is unknown whether all the machines are down or if one of the nodes is down. You need to troubleshoot the problem. After analyzing the related logs and nmon of the RAC, it is found that it is a "pseudo-downtime". The real cause is that the network of the 2nd nodes for external services has no traffic, and the server is manually restarted. The analysis process is as follows. I. First, post the analyzed fault records here to help readers quickly understand the cause and effect of the process: Description of sequence time phenomenon 18:00:00 the Public network traffic of Node 2 suddenly drops to 0, but there was no disconnection, and it continued until the server on node 2 was shut down at. After investigation, the command used was "init 0" 39: 40: node 18 was found to have a problem with the heartbeat of Node 2: 49: 40: 32 removed the two nodes from the CRS 59: 40: 32 reconstructed CRS, in CRS, only the first segment of the VIP address, 69: 40: 36, is retained. The VIP address originally belongs to node 2 is migrated and registered to node 1 at node 79: 45: 13 and node 2. The server starts up at node 89: 49: 42 secondary node CRS start 99: 50: 24CRS re-reconstruction has been started. Add the secondary node to CRS and close the secondary node CRS1110: 47: 45 secondary node CRS restart again, it has been added to CRS. 2. Detailed analysis process. 1. Host Name of the RAC basic information node.

public IP / VIP(eth2)priv IP (eth3)1XXXXdb710.X.XX.80/10.X.XX.82192.XX.XX.802XXXXdb810.X.XX.81/10.X.XX.83/10.X.XX.84192.XX.XX.81

 

2. view the log information of node 1st (XXXXdb7) 2.1 and OS messages of node 1st (XXXXdb7:
Jun 30 04:03:06 XXXXdb7 syslogd 1.4.1: restart.Jul  1 09:40:36 XXXXdb7 avahi-daemon[9246]: Registering new address record for 10.X.XX.83 on eth2.Jul  1 09:40:36 XXXXdb7 avahi-daemon[9246]: Withdrawing address record for 10.X.XX.83 on eth2.Jul  1 09:40:36 XXXXdb7 avahi-daemon[9246]: Registering new address record for 10.X.XX.83 on eth2.Jul  1 09:40:36 XXXXdb7 avahi-daemon[9246]: Withdrawing address record for 10.X.XX.83 on eth2.Jul  1 09:40:36 XXXXdb7 avahi-daemon[9246]: Registering new address record for 10.X.XX.83 on eth2.Jul  1 09:40:36 XXXXdb7 avahi-daemon[9246]: Registering new address record for 10.X.XX.84 on eth2.Jul  1 09:40:36 XXXXdb7 avahi-daemon[9246]: Withdrawing address record for 10.X.XX.84 on eth2.

 

Analysis: (1) 09:40:36, which belongs to 10 of 2nd nodes. x. XX.83/10. x. XX.84 two VIP addresses are moved to the eth2 Nic of the 1st node, which at least indicates that the CRS of the 2nd node is down (2) no software or hardware error log 2.2, 1st node (XXXXdb7) CRS_alert log: 09:40:18. 629
[cssd(9742)]CRS-1612:Network communication with node XXXXdb8 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.290 seconds2013-07-01 09:40:25.663[cssd(9742)]CRS-1611:Network communication with node XXXXdb8 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.250 seconds2013-07-01 09:40:30.673[cssd(9742)]CRS-1610:Network communication with node XXXXdb8 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.240 seconds2013-07-01 09:40:32.920[cssd(9742)]CRS-1632:Node XXXXdb8 is being removed from the cluster in cluster incarnation 2494397012013-07-01 09:40:32.940[cssd(9742)]CRS-1601:CSSD Reconfiguration complete. Active nodes are XXXXdb7 .2013-07-01 09:40:33.902[crsd(10300)]CRS-5504:Node down event reported for node 'XXXXdb8'.2013-07-01 09:40:44.683[crsd(10300)]CRS-2773:Server 'XXXXdb8' has been removed from pool 'Generic'.2013-07-01 09:40:44.684[crsd(10300)]CRS-2773:Server 'XXXXdb8' has been removed from pool 'ora.XXXXc'.2013-07-01 09:50:24.410[cssd(9742)]CRS-1601:CSSD Reconfiguration complete. Active nodes are XXXXdb7 XXXXdb8.2013-07-01 09:51:10.792[cssd(9742)]CRS-1625:Node XXXXdb8, number 2, was manually shut down2013-07-01 09:51:10.831[cssd(9742)]CRS-1601:CSSD Reconfiguration complete. Active nodes are XXXXdb7 .2013-07-01 10:47:45.660[cssd(9742)]CRS-1601:CSSD Reconfiguration complete. Active nodes are XXXXdb7 XXXXdb8

 

Analysis: (1) 09:40:18. 629, 1st nodes found that the heartbeat network of 2nd node XXXXdb8 (2) is no longer connected; (2) after three consecutive heartbeat failures, 1st nodes thought that 2nd nodes had failed, remove 2nd nodes from the cluster; (3) 09:40:32. 940 restructured RAC, and only one node (1st nodes) is retained in the cluster (4). It is already known that the 2nd node is faulty first. 3. view the log information of node 2nd (XXXXdb8) 3.1 and OS messages of node 2nd (XXXXdb7:
Jul  1 09:39:59 XXXXdb8 shutdown[8014]: shutting down for system haltJul  1 09:40:00 XXXXdb8 pcscd: winscard.c:304:SCardConnect() Reader E-Gate 0 0 Not FoundJul  1 09:40:02 XXXXdb8 smartd[10533]: smartd received signal 15: TerminatedJul  1 09:40:02 XXXXdb8 smartd[10533]: smartd is exiting (exit status 0)Jul  1 09:45:13 XXXXdb8 syslogd 1.4.1: restart.

 

Analysis: (1) at 09:39:59, the two-node server was shut down, so far, it has been known that the server was shut down (2) the most important thing to note is that before closing the server, the server does not report any software or hardware-related alarm information, indicating that the software and hardware are normal 3.2 and 2nd nodes (XXXXdb) CRS Alert log:
2013-06-30 23:00:19.744[ctssd(11038)]CRS-2409:The clock on host XXXXdb8 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.2013-07-01 09:49:42.394[ohasd(10803)]CRS-2112:The OLR service started on node XXXXdb8.

 

Analysis: At 9:49:42, CRS was started on the second node. Before CRS was started, there was no CRS closed LOG or CRS-related error information. This problem was serious, it indicates that the CRS is not normally disabled, probably because the server is automatically or manually restarted, the CRS is not written to the LOG shut down by CRS at all. 4. At this point, there should be only three reasons for determining that the Service server is restarted (points 2nd and 3 are excluded after analysis ): 1. Manually restart the server. 2. If the server load is too high, the two nodes are DOWN or the one node is evicted. After careful analysis, it was found that 2 nodes shut down at 09:39:59, and 1 node did not connect to 2 node network heartbeat until 09:40:18, it indicates that node 2 is not evicted by node 1 and causes a restart. Is it because the load on node 2 is too high, after analysis, the CPU, I/O, and Memory load on the two nodes has been very low and stable, so this should not cause the server to restart. 3. The two nodes caused by network faults are evicted and restarted, this is the same as the above analysis. It was also because 2 nodes shut down at 09:39:59, and 1 node found that the network heartbeat connection with 2 nodes failed at 09:40:18, it also shows that this factor is not a pleasant surprise: an interesting problem was found in the NMON information. Eth1 (Heartbeat Network) is normal, and Eth2 (a network that provides external services) is 2 nodes) at a.m., all traffic drops to 0, and all Memory usage starts to be released. This indicates that there is a Public network problem, but there is no disconnection summary: the reason for server restart is probably due to manual restart. 5. Verify if the server is manually restarted. 5.1. Check the historical commands of the Two-node server to see if any restart command exists.
  957  init 0  958  su - grid  959  su - oracle  960  su - grid  961  su - grid  962  cd /u01/app/11.2.0/grid/bin  963  ./crsctl start crs  964  su - oracle

 

Analysis: there is indeed an init 0 shutdown command in the last commands of the ROOT user and the user. After the command is shut down, there is a CRS startup command here, which has been confirmed, the server was shut down by commands. 3. Reasons for manual shutdown. 1. The Public network encountered a problem at a.m.. The user could not connect to the database. 2. a.m, to solve this problem, the related personnel shut down the server. 3. It took 5 minutes to restart the server. 4. In order to verify the Administrator's network problems, looking at the recent history commands of Node 1, we found that someone is testing the network of ping Node 2 on node 1.
  929  cat /etc/hosts  930  ping 10.X.XX.80  931  ping 10.X.XX.81  932  ping 10.X.XX.82  933  ping 10.X.XX.83  934  su - oracle  935  cd /home/  936  ls  937  su - grid  938  ssh 10.1.146.33  939  top  940  free -m  941  su - grid  942  vmstat 1 01  943  vmstat 1 10  944  su - grid  945  su - grid  946  su - oracle  947  su - oracle  948  su - grid  949  su - oracle  950  su - grid  951  cat /etc/hosts  952  ping 10.X.XX.81  953  ping 10.X.XX.83

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.