An article written in the previous troubleshooting section was accidentally found in the analysis of node cascade restart failures of rac clusters, which was recorded for future search. Www.2cto.com rac cluster node cascade restart Fault Analysis Environment: OS: linuxdb: rac10g + ocfs2 rac database environment actually contains two clusters, one is clusterware cluster and the other is instance cluster. Their approximate ways of working are: 1. if clusterware first discovers a cluster fault, it will reorganize the cluster directly, and the remaining nodes will lock the journal of the dead node and restore it. After the reorganization of clusterware, it will notify the upper-layer instance cluster, make instance cluster reorganization to new stable state 2. if the instance cluster first finds a cluster fault, rac will stop providing external services and notify the clusterware-layer cluster to complete cluster reconstruction to reach a new stable state. After the cluster is restructured, on the cluster layer of the notified instance, rac starts reconstruction again. However, if terware cannot complete the reconstruction, rac uses the IMR mechanism to rebuild the cluster to achieve new stable status www.2cto.com rac cluster cascade restart. Generally, the reason is that the voting disk hang caused by the restart of a node in the master database causes other nodes to be inaccessible, as a result, the occsd process is faulty and clusterware detects a new cluster fault. Therefore, the cluster is reorganized to a new stable state. The reason for the adjustment is that the other nodes continue to restart because the voting disk remains unresponsive for a long time. Which parameters may cause the clusterware cluster to be restarted due to the disk hang: o2CB_HEARTBEAT_THRESHOLD of o2cb updates the system file (Disk File) every two seconds to ensure the node is alive. If the threshold value is exceeded, restart the rac cluster. The disktimeout parameter of voting disk is 200 s by default, if the threshold value is exceeded, the node will restart the multi-path software device-mapper-multipath used in linux. To avoid node cascade restart, you can increase the dead threshold of clusterware to avoid restart. according to the following formula (10.2.0.2 or later), O2CB_HEARTBEAT_THRESHOLD >=( (max (HW_STORAGE_TIMEOUT, SW_STORAGE_TIMEOUT)/2) + 1) disktime Ax (records-1) * 2, HW_STORAGE_TIMEOUT, SW_STORAGE_TIMEOUT) So adjust O2CB_HEARTBEAT_THRESHOLD = 31 to O2CB_HEARTBEAT_THRESHOLD = 61 (increased from 60 seconds to 120 seconds ), this adjustment aims to provide sufficient recover time for the voting disk to prevent the misscount parameter from being mistakenly restarted on the node, because we have not directly found the cause of the network from the restarted log, after testing in the offline environment, it is found that the simulated ocfs2 file system suddenly suffers a problem and the log information similar to that of restarting the production environment can be reproduced. Check whether you need to adjust this parameter www.2cto.com to adjust O2CB_HEARTBEAT_THRESHOLD STEP 0. stop all services connected to the database. stop the crs2.stop ocfs2 service on all nodes. 3. modify all node parameters O2CB_HEARTBEAT_THRESHOLD4. restart the o2bc service on all nodes, start ocfs2, and start the crs service. test whether the application is normal or not. 1. It affects the external service time of the database. 2. It does not affect the stability of the rac cluster and data loss. If an exception is found, you can refer to [ID 395878.1] [ID 457423.1] [ID 391771.1] [ID 294430.1] [ID] [ID] author skate by simply adjusting the parameters.