Fault Analysis on Cascade restart of RAC cluster nodes

Source: Internet
Author: User
Tags call back

Author: skate
Time: 2012/07/16

 

I accidentally found an article written for troubleshooting and recorded it for search.

 

Fault Analysis on Cascade restart of RAC cluster nodes

 

Environment:
OS: Linux
DB: rac10g + ocfs2

 

The RAC database environment actually contains two clusters: clusterware cluster and instance cluster. They generally work in the following ways:

 

1. if clusterware first discovers a cluster fault, it will reorganize the cluster directly, and the remaining nodes will lock the journal of the dead node and restore it. After the reorganization of clusterware, it will notify the upper-layer instance cluster, make instance cluster reorganization to a new stable state

2. if the instance cluster first finds a cluster fault, RAC will stop providing external services and notify the clusterware-layer cluster to complete cluster reconstruction to reach a new stable state. After the cluster is restructured, at the cluster layer where the instance is notified, RAC starts restructuring again. However, if clusterware cannot complete the reconstruction, RAC reconstructs the cluster through the IMR mechanism to achieve new stability.


General causes of cascade restart of RAC Clusters
The voting disk hang caused by the restart of a node in the master database causes access from other nodes, leading to occsd process failure. clusterware detects a new cluster failure, therefore, the cluster is reorganized to a new stable state.

Adjustment basis
This is because the voting disk continues to restart other nodes because the Hang does not respond for a long time,

Which parameters may be restarted due to disk hang?
Clusterware cluster:O2cb_heartbeat_threshold of o2cb updates the system file (Disk File) every two seconds to ensure that the node is alive. If the threshold value is exceeded, restart
RAC cluster: The disktimeout parameter of the voting disk is 200 s by default. If the value exceeds this threshold, the node restarts.

The multi-path software device-mapper-multipath used in Linux

To avoid node cascade restart, you can increase the dead threshold of clusterware to avoid restart. The formula is as follows (10.2.0.2 or later)

O2cb_heartbeat_threshold> = (max (hw_storage_timeout, sw_storage_timeout)/2) + 1)

Disktimeout> MAX (o2cb_heartbeat_threshold-1) * 2, hw_storage_timeout, sw_storage_timeout)

Therefore, the o2cb_heartbeat_threshold = 31 is adjusted to o2cb_heartbeat_threshold = 61 (increased from 60 seconds to 120 seconds), so that sufficient recover time is provided for the voting disk to avoid node restarts by mistake.

The misscount parameter is not adjusted first, because we did not directly find it from the restart log because of network reasons. After offline environment testing, we found that a sudden problem occurred in the ocfs2 file system, the log information similar to the restart of the production environment can be reproduced. Check whether the parameter needs to be adjusted based on the observed adjustments.

To adjust o2cb_heartbeat_threshold
0. Stop all services connected to the database
1. Stop the CRS of all nodes
2. Stop ocfs2 Service
3. modify all node parameters o2cb_heartbeat_threshold
4. Restart the o2bc service on all nodes, start ocfs2, and start the CRS service.
5. test whether the application is normal or not

Impact
1. affecting the external service time of DB
2. The stability and data loss of the RAC cluster will not be affected.

If an exception is found, you only need to call back the parameters.

 

References
[ID 395878.1] [ID 457423.1 | [ID 391771.1] [ID 294430.1]

--- End ---

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.