This article introduces how to diagnose node restart problems in the RAC environment. Applicable to 10gr2 and 11gr1.
First, we will introduce the CRS process that can cause node restart.
1. ocssd: its main functions are node monitoring and group management. It is one of the core processes of CRS. Node monitoring is used to monitor the health of nodes in a cluster. It is implemented through network heartbeat and disk heartbeat, if a node in the cluster continuously loses the disk heartbeat or network heartbeat, the node will be evicted from the cluster, that is, the node restarts. Node restart caused by group management is called node kill escalation (applicable only in 11gr1 and later versions). We will introduce it in detail in the following article. Restart must be completed within the specified time (reboot time, usually 3 seconds.
Network heartbeat: The ocssd. Bin process sends network heartbeat information to each node in the cluster over the Intranet every second to confirm whether each node is normal. If a node continuously loses its network heartbeat and reaches the threshold value, misscount (30 seconds by default, and 600 seconds if other cluster management software exists) will vote by voting disk, the node that loses the network heartbeat is evicted from the cluster by the master node, that is, the node restarts. If the cluster contains only two nodes, split-brain occurs. As a result, nodes with small node numbers survive, even nodes with small node numbers have network problems.
Disk heartbeat: The ocssd. Bin process registers the status information of the current node with all voting disks (Voting file) every second. This process is called the disk heartbeat. If a node continuously loses the disk heartbeat to the threshold value, disk timeou (generally 200 seconds), the node will automatically restart to ensure cluster consistency. In addition, CRS only requires [n/2] + 1 voting disk available, where N is the number of voting disks, generally an odd number.
2. oclsomon: This process monitors whether ocssd is suspended. If ocssd. Bin has performance problems, restart the node.
3. oprocd: This process only appears in Linux and Unix systems, and the third-party cluster management software is not installed. If the node is suspended, restart the node.
Note: all the preceding steps are generated by the script init.css D.
Next, we often collect information to diagnose node restart problems.
1. Operating System Logs
2. <CRS main directory>/log/<node name>/cssd/ocssd. Log
3. oprocd. Log (/etc/Oracle/oprocd/*. log. * Or/var/opt/Oracle/oprocd/*. log .*)
4. <CRS main directory>/log/<node name>/cssd/oclsomon. Log
5. Oracle oswatcher report
Next we will discuss how to diagnose node restart.
1. Restart the node caused by ocssd.
If the following error occurs in ocssd. Log, it indicates that the node is restarted because the network heartbeat is lost. Next, you need to view network-related information, such as operating system logs and OSW report (traceroute output) to determine whether there is a problem at the network layer (cluster Interconnect) and determine the final cause.
[Cssd] 23:56:18. 749 [3086]> warning: clssnmpollingthread: node <node_name> at 50% heartbeat fatal, eviction in 14.494 seconds
[Cssd] 23:56:25. 749 [3086]> warning: clssnmpollingthread: node <node_name> at 75% heartbeat fatal, eviction in 7.494 seconds
[Cssd] 23:56:32. 749 [3086]> warning: clssnmpollingthread: node <node_name> at 90% heartbeat fatal, eviction in 0.494 seconds
[Cssd] 23:56:33. 243 [3086]> trace: clssnmpollingthread: Eviction started for node <node_name>, flags 0x040d, State 3, wt4c 0
[Cssd] 23:56:33. 243 [3086]> trace: clssnmdischelper: <node_name>, node (4) connection failed, con (1128a5530), probe (0)
[Cssd] 23:56:33. 243 [3086]> trace: clssnmdischelper: node 4 clean up, con (1128a5530), init state 5, cur state 5
[Cssd] 23:56:33. 243 [3600]> trace: clssnmdosyncupdate: Initiating sync 196446491
[Cssd] 23:56:33. 243 [3600]> trace: clssnmdosyncupdate: disktimeout set to (27000) MS
Note: If the time point for the above information appears in the ocssd. log of the master node is later than the restart time of the node, the reason for node restart is not the loss of network heartbeat.
If the following error occurs in ocssd. Log, it indicates that the node is restarted because the disk heartbeat is lost. Next, you need to view the operating system logs and oswatcher report (iostat output) to determine whether there is a problem at the I/O Level and determine the final cause.
18:34:37. 423: [cssd] [150477728] clssnmvdiskopen: Opening/dev/sdb8
18:34:37. 423: [clsf] [150477728] opened HDL: 0xf4336530 for Dev:/dev/sdb8:
18:34:37. 429: [skgfd] [150477728] error:-9 (error 27072, OS error (Linux error: 5: Input/Output Error
Additional information: 4
Additional information: 720913
Additional information:-1)
)
18:34:37. 429: [cssd] [150477728] (: cssnm00060:) clssnmvreadblocks: Read failed at offset 17 of/dev/sdb8
18:34:38. 205: [cssd] [4110736288] (: cssnm00058:) clssnmvdiskcheck: No I/O completions for 200880 MS for voting file/dev/sdb8)
18:34:38. 206: [cssd] [4110736288] (: cssnm00018:) clssnmvdiskcheck: Aborting, 0 of 1 configured voting disks available, need 1
18:34:38. 206: [cssd] [4110736288] ################################## #
18:34:38. 206: [cssd] [4110736288] clssscexit: cssd aborting from thread clssnmvdiskpingmonitorthread
18:34:38. 206: [cssd] [4110736288] ################################## #
2. node restart caused by oclsomon.
In oclsomon. if an error occurs in the log, it indicates that the node is restarted because the ocssd process is suspended. Because the ocssd process has a real-time (RT) priority, the operating system may have resource (such as CPU) competition, next, you need to view the operating system logs and OSW reports (output of vmstat and top) to determine the final cause.
3. Restart the node caused by oprocd.
If the following information appears in the oprocd log, it indicates that the restart of the node is caused by the oprocd process.
Dec 21 16:15:30. 369857 | lastgasp | alarmhandler: timeout (2312 msec) exceeds interval (1000 msec) + margin (500 msec). rebooting now.
Because the oprocd process checks the system time to determine whether the operating system is suspended, correctly configuring NTP (or other time synchronization software) and adjusting diagwait = 13 can avoid node restart. In addition, if you want to significantly modify the system ?? Time. We recommend that you stop CRS and restart it after modification. Of course, we do not rule out that the oprocd node is restarted due to the suspension of the operating system. Therefore, you also need to check the oswatcher report (output of vmstat and top) to determine the final cause.
This article only introduces the idea of diagnosing node restart problems, and needs to be used flexibly in actual problems.
For more information, read the following MOS article.
Note 265769.1: troubleshooting 10g and 11.1 clusterware reboots
Note 1050693.1: troubleshooting 11.2 clusterware node evictions (reboots)
How to diagnose node restart Problems