Because diagwait is not configured analysis cases that result in incomplete log records for RAC Brain

Source: Internet
Author: User

1, fault phenomenon

A Rac,crs version is 10.2.0.4, after the second node down the machine, the first node also successively down.

2. CRS Log Analysis2.1 Two node log condition

Crs_log

[CSSD (8796)]crs-1611:node XXdb1 (1) at 75% Heartbeat fatal, eviction in 14.118 seconds

Span style= "font-size:14px" >2014-07-04 22:49:38.556

[CSSD (8796)] Crs-1611:node XXDB1 (1) at 75% Heartbeat fatal, eviction in 13.128 seconds

2014-07-04 22:49:46.561

[CSSD (8796)]crs-1610:node XXdb1 (1 ) at 90% Heartbeat fatal, eviction in 5.128 seconds

2014-07-05 03:00:08.142

[CSSD (8812)]CRS-1605:CSSD voting file is online:/ Dev/raw/raw18. Details in/home/oracle/product/10.2.0/crs/log/xxdb2/cssd/ocssd.log.

from 2014-07-04 22:49:46.561 jump directly to 03:00:08.142, there is no other record in the middle, in fact, the cluster split log is not complete, such as node flooding information, and cluster reconstruction information

2.2 A node log condition

2014-07-04 23:00:00.018

[CSSD (27561)] Crs-1612:node XXDB2 (2) at 50% Heartbeat fatal, eviction in 29.144 seconds

2014-07-04 23:00:15.017

[CSSD (27561)] Crs-1611:node XXDB2 (2) at 75% Heartbeat fatal, eviction in 14.144 seconds

2014-07-04 23:00:24.014

[CSSD (27561)] Crs-1610:node XXDB2 (2) at 90% Heartbeat fatal, eviction in 5.144 seconds

2014-07-04 23:00:25.016

[CSSD (27561)] Crs-1610:node XXDB2 (2) at 90% Heartbeat fatal, eviction in 4.144 seconds

2014-07-05 01:21:06.620

[CSSD (31191)] CRS-1605:CSSD voting file is online:/dev/raw/raw18. Details In/home/oracle/product/10.2.0/crs/log/xxdb1/cssd/ocssd.log.

from 2014-07-04 23:00:25.016 jump directly to 01:21:06.620, there is no other record in the middle, in fact, the cluster split log is not complete, such as node flooding information, and cluster reconstruction information

2.3 Summary of issues

Two nodes of the restart log is not complete the restart of the operating system, two of the drive information is not enough to send to a node, so that a node does not know that the two node has disappeared, and then a node also go through the heartbeat line ping two node, found with two node heartbeat is abnormal, One-node restart reason due to lack of operating system performance monitoring data support (such as server load is very high) and log incomplete is difficult to determine the true cause of the restart.

3, the normal log should be the case

2014-06-24 14:53:21.258

[CRSD (8825)] Crs-5504:node down event reported for Node ' Tsrrac02 '.

2014-06-24 14:53:21.259

[CRSD (8825)] Crs-2773:server ' TSRRAC02 ' have been removed from pool ' ora.crmout '.

2014-06-24 14:53:21.259

[CRSD (8825)] Crs-2773:server ' TSRRAC02 ' have been removed from pool ' Generic '.

4, the CRS configuration check

$ crsctl get CSS diagwait

Configuration parameter diagwait is not defined.

issue: Two node configurations are the same, not configured for diagwait

5, the diagwait not configured default values and the issue of the risk of official description

Using diagwait as a diagnostic to get more information for diagnosing Oracle clusterware Node Evictions (Doc ID 559365.1 ) 

"==this setting'll provide more time for diagnostic data to being collected by safely and would not increase probability of corruption.

Oprocdis used to check if the node is hang, when it finds the node hang, it initiates a start-up restart.
It has two important parameters:
oprocd.debug-t 1000-m

Timeout value (-t <to-millisec>): The default is 1000ms (1s) each time the check interval is performed.
margin (-M <margin-millisec>): Allow delay time, default is 500ms (0.5s))

Oprocdprocess every to-millisec(1s) to do a check, check the time to get the OS, and then use this time to subtract the last time the OS acquired, if the difference is greater than to-millisec + margin-millisec, then OPROCD will think the OS hang, will initiate a reboot. Simply put, if the value of the above two parameters is not changed, then by default, if OPROCD is unable to get to the OS at 1.5s, the OS hang is assumed.

After modifying the diagwait to 13s, the margin-millisec is set to 10s, that is, the time allowed to obtain the OS reaches 11s (1s+10s).

6. Improvement Plan

This issue only occurs in previous versions of Oracle 11.2, and in the 11G R2 version, the value of DIAGWAIT is configured by default to 13

for versions older than 11.2, it is necessary to manually modify the diagwaitto a length of time to postpone the reboot to allow enough time for the log information in the cache to be written to the disk file, as well as to reduce the possibility of a reboot due to the short time allowed to interact with the OS.

This article Li Junjie (Network Name: casing), engaged in "system architecture, operating systems, storage devices, databases, middleware, applications" six levels of systematic performance optimization work

Welcome to the System performance Optimization Professional group, to discuss performance optimization technology together. Group number: 258187244

Because diagwait is not configured analysis cases that result in incomplete log records for RAC Brain

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.