1, fault phenomenon
A Rac,crs version is 10.2.0.4, after the second node down the machine, the first node also successively down.
2. CRS Log Analysis2.1 Two node log condition
Crs_log
[CSSD (8796)]crs-1611:node XXdb1 (1) at 75% Heartbeat fatal, eviction in 14.118 seconds Span style= "font-size:14px" >2014-07-04 22:49:38.556 [CSSD (8796)] Crs-1611:node XXDB1 (1) at 75% Heartbeat fatal, eviction in 13.128 seconds 2014-07-04 22:49:46.561 [CSSD (8796)]crs-1610:node XXdb1 (1 ) at 90% Heartbeat fatal, eviction in 5.128 seconds 2014-07-05 03:00:08.142 [CSSD (8812)]CRS-1605:CSSD voting file is online:/ Dev/raw/raw18. Details in/home/oracle/product/10.2.0/crs/log/xxdb2/cssd/ocssd.log. |
from 2014-07-04 22:49:46.561 jump directly to 03:00:08.142, there is no other record in the middle, in fact, the cluster split log is not complete, such as node flooding information, and cluster reconstruction information
2.2 A node log condition
2014-07-04 23:00:00.018 [CSSD (27561)] Crs-1612:node XXDB2 (2) at 50% Heartbeat fatal, eviction in 29.144 seconds 2014-07-04 23:00:15.017 [CSSD (27561)] Crs-1611:node XXDB2 (2) at 75% Heartbeat fatal, eviction in 14.144 seconds 2014-07-04 23:00:24.014 [CSSD (27561)] Crs-1610:node XXDB2 (2) at 90% Heartbeat fatal, eviction in 5.144 seconds 2014-07-04 23:00:25.016 [CSSD (27561)] Crs-1610:node XXDB2 (2) at 90% Heartbeat fatal, eviction in 4.144 seconds 2014-07-05 01:21:06.620 [CSSD (31191)] CRS-1605:CSSD voting file is online:/dev/raw/raw18. Details In/home/oracle/product/10.2.0/crs/log/xxdb1/cssd/ocssd.log. |
from 2014-07-04 23:00:25.016 jump directly to 01:21:06.620, there is no other record in the middle, in fact, the cluster split log is not complete, such as node flooding information, and cluster reconstruction information
2.3 Summary of issues
Two nodes of the restart log is not complete the restart of the operating system, two of the drive information is not enough to send to a node, so that a node does not know that the two node has disappeared, and then a node also go through the heartbeat line ping two node, found with two node heartbeat is abnormal, One-node restart reason due to lack of operating system performance monitoring data support (such as server load is very high) and log incomplete is difficult to determine the true cause of the restart.
3, the normal log should be the case
2014-06-24 14:53:21.258 [CRSD (8825)] Crs-5504:node down event reported for Node ' Tsrrac02 '. 2014-06-24 14:53:21.259 [CRSD (8825)] Crs-2773:server ' TSRRAC02 ' have been removed from pool ' ora.crmout '. 2014-06-24 14:53:21.259 [CRSD (8825)] Crs-2773:server ' TSRRAC02 ' have been removed from pool ' Generic '. |
4, the CRS configuration check
$ crsctl get CSS diagwait Configuration parameter diagwait is not defined. |
issue: Two node configurations are the same, not configured for diagwait
5, the diagwait not configured default values and the issue of the risk of official description
Using diagwait as a diagnostic to get more information for diagnosing Oracle clusterware Node Evictions (Doc ID 559365.1 )
"==this setting'll provide more time for diagnostic data to being collected by safely and would not increase probability of corruption.
Oprocdis used to check if the node is hang, when it finds the node hang, it initiates a start-up restart. It has two important parameters: oprocd.debug-t 1000-m
Timeout value (-t <to-millisec>): The default is 1000ms (1s) each time the check interval is performed. margin (-M <margin-millisec>): Allow delay time, default is 500ms (0.5s))
Oprocdprocess every to-millisec(1s) to do a check, check the time to get the OS, and then use this time to subtract the last time the OS acquired, if the difference is greater than to-millisec + margin-millisec, then OPROCD will think the OS hang, will initiate a reboot. Simply put, if the value of the above two parameters is not changed, then by default, if OPROCD is unable to get to the OS at 1.5s, the OS hang is assumed. After modifying the diagwait to 13s, the margin-millisec is set to 10s, that is, the time allowed to obtain the OS reaches 11s (1s+10s). |
6. Improvement Plan
This issue only occurs in previous versions of Oracle 11.2, and in the 11G R2 version, the value of DIAGWAIT is configured by default to 13
for versions older than 11.2, it is necessary to manually modify the diagwaitto a length of time to postpone the reboot to allow enough time for the log information in the cache to be written to the disk file, as well as to reduce the possibility of a reboot due to the short time allowed to interact with the OS.
This article Li Junjie (Network Name: casing), engaged in "system architecture, operating systems, storage devices, databases, middleware, applications" six levels of systematic performance optimization work
Welcome to the System performance Optimization Professional group, to discuss performance optimization technology together. Group number: 258187244
Because diagwait is not configured analysis cases that result in incomplete log records for RAC Brain