Fault Analysis on Cascade restart of RAC cluster nodes

Last Update:2018-12-04 Source: Internet

Author: User

Tags call back

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Author: skate
Time: 2012/07/16

I accidentally found an article written for troubleshooting and recorded it for search.

Fault Analysis on Cascade restart of RAC cluster nodes

Environment:
OS: Linux
DB: rac10g + ocfs2

The RAC database environment actually contains two clusters: clusterware cluster and instance cluster. They generally work in the following ways:

1. if clusterware first discovers a cluster fault, it will reorganize the cluster directly, and the remaining nodes will lock the journal of the dead node and restore it. After the reorganization of clusterware, it will notify the upper-layer instance cluster, make instance cluster reorganization to a new stable state

2. if the instance cluster first finds a cluster fault, RAC will stop providing external services and notify the clusterware-layer cluster to complete cluster reconstruction to reach a new stable state. After the cluster is restructured, at the cluster layer where the instance is notified, RAC starts restructuring again. However, if clusterware cannot complete the reconstruction, RAC reconstructs the cluster through the IMR mechanism to achieve new stability.

General causes of cascade restart of RAC Clusters
The voting disk hang caused by the restart of a node in the master database causes access from other nodes, leading to occsd process failure. clusterware detects a new cluster failure, therefore, the cluster is reorganized to a new stable state.

Adjustment basis
This is because the voting disk continues to restart other nodes because the Hang does not respond for a long time,

Which parameters may be restarted due to disk hang?
Clusterware cluster:O2cb_heartbeat_threshold of o2cb updates the system file (Disk File) every two seconds to ensure that the node is alive. If the threshold value is exceeded, restart
RAC cluster: The disktimeout parameter of the voting disk is 200 s by default. If the value exceeds this threshold, the node restarts.

The multi-path software device-mapper-multipath used in Linux

To avoid node cascade restart, you can increase the dead threshold of clusterware to avoid restart. The formula is as follows (10.2.0.2 or later)

O2cb_heartbeat_threshold> = (max (hw_storage_timeout, sw_storage_timeout)/2) + 1)

Disktimeout> MAX (o2cb_heartbeat_threshold-1) * 2, hw_storage_timeout, sw_storage_timeout)

Therefore, the o2cb_heartbeat_threshold = 31 is adjusted to o2cb_heartbeat_threshold = 61 (increased from 60 seconds to 120 seconds), so that sufficient recover time is provided for the voting disk to avoid node restarts by mistake.

The misscount parameter is not adjusted first, because we did not directly find it from the restart log because of network reasons. After offline environment testing, we found that a sudden problem occurred in the ocfs2 file system, the log information similar to the restart of the production environment can be reproduced. Check whether the parameter needs to be adjusted based on the observed adjustments.

To adjust o2cb_heartbeat_threshold
0. Stop all services connected to the database
1. Stop the CRS of all nodes
2. Stop ocfs2 Service
3. modify all node parameters o2cb_heartbeat_threshold
4. Restart the o2bc service on all nodes, start ocfs2, and start the CRS service.
5. test whether the application is normal or not

Impact
1. affecting the external service time of DB
2. The stability and data loss of the RAC cluster will not be affected.

If an exception is found, you only need to call back the parameters.

References
[ID 395878.1] [ID 457423.1 | [ID 391771.1] [ID 294430.1]

--- End ---

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More