Case study of incomplete records of RAC split-brain logs due to not configured diagwait

Source: Internet
Author: User

Case study of incomplete records of RAC split-brain logs due to not configured diagwait
1. Fault

A rac with the CRS version 10.2.0.4. After the server goes DOWN at the second node, the server goes DOWN at the first node.

2. CRS log analysis 2.1 2-node logs

CRS_LOG

[Cssd (8796)] CRS-1611: node XXdb1 (1) at 75% heartbeat fatal, eviction in 14.118 seconds

2014-07-04 22:49:38. 556

[Cssd (8796)] CRS-1611: node XXdb1 (1) at 75% heartbeat fatal, eviction in 13.128 seconds

2014-07-04 22:49:46. 561

[Cssd (8796)] CRS-1610: node XXdb1 (1) at 90% heartbeat fatal, eviction in 5.128 seconds

2014-07-05 03:00:08. 142

[Cssd (8812)] CRS-1605: CSSD voting file is online: /dev/raw/raw18. Details in/home/Oracle/product/10.2.0/crs/log/XXdb2/cssd/ocssd. log.

From 22:49:46. 561 to 03:00:08. 142, no other records exist in the middle. In fact, the logs of cluster splitting are not completely written, such as node drive information and cluster reconstruction information.

2.2 One-node logs

23:00:00. 018

[Cssd (27561)] CRS-1612: node XXdb2 (2) at 50% heartbeat fatal, eviction in 29.144 seconds

23:00:15. 017

[Cssd (27561)] CRS-1611: node XXdb2 (2) at 75% heartbeat fatal, eviction in 14.144 seconds

23:00:24. 014

[Cssd (27561)] CRS-1610: node XXdb2 (2) at 90% heartbeat fatal, eviction in 5.144 seconds

23:00:25. 016

[Cssd (27561)] CRS-1610: node XXdb2 (2) at 90% heartbeat fatal, eviction in 4.144 seconds

2014-07-05 01:21:06. 620

[Cssd (31191)] CRS-1605: CSSD voting file is online: /dev/raw/raw18. Details in/home/oracle/product/10.2.0/crs/log/XXdb1/cssd/ocssd. log.

From 23:00:25. 016 to 01:21:06. 620, no other records exist in the middle. In fact, the logs of cluster splitting are not completely written, such as node drive information and cluster reconstruction information.

2.3 problem summary

The restart logs of the two nodes are not completely written, and the operating system is restarted. The drive information of the two nodes is not sent to the same node. As a result, one node does not know that the two nodes have disappeared, then, one node pinged the two nodes through the heartbeat line and found that there was an exception with the heartbeat of the two nodes. The reason for the restart of the one node was due to the lack of operating system performance monitoring data support (such as whether the server load was high at the time) and the log is incomplete, it is difficult to determine the real cause of restart.

3. Normal logs

2014-06-24 14:53:21. 258

[Crsd (8825)] CRS-5504: Node down event reported for node 'tsrrac02 '.

2014-06-24 14:53:21. 259

[Crsd (8825)] CRS-2773: Server 'tsrrac02 'has been removed from pool 'ora. crmout '.

2014-06-24 14:53:21. 259

[Crsd (8825)] CRS-2773: Server 'tsrrac02 'has been removed from pool 'generic '.

4. Check CRS Configuration

$ Crsctl get css diagwait

Configuration parameter diagwait is not defined.

Problem: The two nodes are configured the same, but no diagwait is configured.

5. official description of default values and risks of problems not configured for diagwait

Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions (Doc ID 559365.1)

= This setting will provide more time for diagnostic data to be collected by safely and will NOT increase probability of permitted uption.

OPROCD is used to check whether the node is hang. When it finds the node hang, it will initiate a start point restart.
It has two important parameters:
Oprocd. debugging-t 1000-m 500

Timeout value (-t <to-millisec>): The interval between each check. The default value is 1000 ms (1 s ).
Margin (-m <margin-millisec>): the time delay is allowed. The default value is 500 ms (0.5 s ))

The OPROCD process performs a check every to-millisec (1 s). During the check, the OS time is obtained, and then the time is used minus the time of the last acquired OS, if the time difference is greater than to-millisec + margin-millisec, OPROCD considers the OS hang and then restarts. Simply put, if the values of the above two parameters are not changed, then by default, if OPROCD cannot get the OS time at 1.5s, the OS hang will be considered.

 

After diagwait is changed to 13 s, the margin-millisec is set to 10 s, that is, the time allowed to get the OS reaches 11 s (1 s + 10 s ).

6. Improvement Plan

This problem only occurs in versions earlier than ORACLE 11.2. In 11G R2, the default value of diagwait is 13.

For versions earlier than 11.2, You need to manually change diagwait to 13 to delay the restart time so that the cached log information can be written into the disk file, and reduce the possibility of restarting because the interaction with the OS is too short.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.