What are the brain fissures in Oracle Clusterware and RAC
From:
What's Split Brain in Oracle clusterware and Real application Cluster (document ID 1425586.1)
Suitable for:
Oracle database-enterprise edition-version 10.1.0.2 and later
Information in this document applies to any platform.
Objective:
This article explains the brain fissures in Oracle Clusterware and RAC, as well as the errors and results associated with brain fissures.
Details:
In general terms, the brain fissure indicates inconsistent data, and this database inconsistency originates from two different datasets that overlap in scope.
Either because it is a network design between servers or a faulty environment, the environment is based on mutual communication and unified data between servers.
There are two components that undergo a brain fissure:
1. Clusterware Layer:
The cluster nodes maintain their heartbeats through private networks and voting disk.
When the private network is damaged, after the time period set by Misscount setting, the cluster nodes cannot communicate with each other through the private network, and the brain crack occurs.
In this case, voting disk will be used to determine which node survived and which node was evict out of the cluster. The usual voting results are as follows:
A.The group with + cluster nodes survive b.the group with lower node member in case of same number of node (s) available In each group C.some improvement have been made to ensure node (s) with lower load survive in case the eviction is caused B Y high system load.
In general, when a brain fissure occurs, in Ocssd.log, you will see a message similar to the following:
[Cssd]2011-01-12 23:23:08.090 [1262557536] >trace:clssnmcheckdskinfo:checking disk info ... [Cssd]2011-01-12 23:23:08.090 [1262557536] >error:clssnmcheckdskinfo:aborting Local node to avoid splitbrain. [Cssd]2011-01-12 23:23:08.090 [1262557536] >error:: My Node (2), Leader (2), size (1) VS node (1), Leader (1), Size (2) [CS Sd]2011-01-12 23:23:08.090 [1262557536] >error: ###################################[CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR:CLSSSCEXIT:CSSD aborting###################################
The above information shows that communication from Node 2 to Node 1 does not work, so node 2 can see only one node (that is, Node 2 itself), but node 1 is working properly, and Node 1 can see 2 node in the cluster, in order to avoid brain splitting, Node 2 abo RTed itself.
Solution: Contact your network administrator to check the private network layer to eliminate any network problems.
2. RAC (database) layer
To ensure data consistency, each instance in the RAC database needs to remain heartbeat with other instance. Heartbeat is maintained by the background process Lmon,lmd,lms and lck.
Any process in these processes that experiences an IPC Send time out will result in a communication reconfiguration (communication reconfiguration) and an instance expulsion to avoid brain fissures.
Similar to the clusterware level of voting disk, the control file is used to determine which instance survived and which instance was evict.
The voting result is a similar to clusterware voting result. As the result, 1 or more instance (s) would be evicted.
Common messages in Instance alert log is similar to:
Alert log of Instance 1:---------Mon Dec 19:43:05 2011IPC Send timeout detected. Sender:ospid 26318receiver:inst 2 binc 554466600 ospid 29940IPC Send Timeout to 2.0 Inc. 8 for MSG type 65521 from Opid 2 0Mon Dec 19:43:07 2011Communications reconfiguration:instance_number 2Mon Dec 19:43:07 2011Trace dumping is perform ing id=[cdmp_20091207194307]waiting for clusterware split-brain resolutionmon Dec 19:53:07 2011Evicting Instance 2 from Clusterwaiting for instances to leave:2 ... alert log of instance 2:---------Mon Dec 19:42:18 2011IPC Send timeout det Ected. Receiver ospid 29940Mon Dec 19:42:18 2011Errors in FILE/U01/APP/ORACLE/DIAG/RDBMS/BD/BD2/TRACE/BD2_LMD0_29940.TRC: Trace dumping is performing Id=[cdmp_20091207194307]mon Dec 19:42:20 2011Waiting for Clusterware split-brain resolution Mon Dec 19:44:45 2011error:lms0 (ospid:29942) detects an idle connection to instance 1Mon Dec 19:44:51 2011ERROR: LMD0 (ospid:29940) detects an idle connection to instance 1Mon Dec 19:45:38 2011error:lms1 (ospid:29954) detects an idle connection to instance 1Mon Dec 19:52:27 2011Errors In File/u01/app/oracle/diag/rdbms/bd/bd2/trace/pvbd2_lmon_29938.trc (incident=90153): Ora-29740:evicted by member 0, Group incarnation 10Incident details in:/u01/app/oracle/diag/rdbms/bd/bd2/incident/incdir_90153/bd2_lmon_29938_ I90153.trc
In the example above, instance 2 LMD0 (PID 29940) is the receiver in IPC Send timeout. There could be various reasons causing IPC Send timeout. For example:
A. Network problem
B. Process Hang
C. Bugs etc
Please see Top 5 issues for Instance eviction Document 1374110.1 for more information.
In the case of instance expulsion, alert log and all background traces need to be checked to determine the root cause.
Known Issues1. Bug 7653579-IPC send timeout in RAC after period Document 7653579.8 refer:ora-29740 Instance (asm/db) Evi ction on Solaris SPARC Document 761717.1 Fixed in:11.2.0.1, 11.1.0.7.2 PSU and 11.1.0.7 Patch in Windows2. Unpublished Bug 8267580:wrong Instance evicted under high CPU load Refer:wrong Instance evicted under high CPU load I n 11.1.0.7 Document 1373749.1 Fixed in:11.2.0.13. Bug 8365141-drm quiesce Step hang causes instance eviction Document 8365141.8 Fixed in:10.2.0.5, 11.1.0.7.3, 11.1.0. 7 Patch for Windows and 11.2.0.14. Bug 7587008-hung RAC instance not evicted from cluster Document 7587008.8 Fixed in:10.2.0.4.4, 10.2.0.5 and 11.2.0. 1, one-off patch available for various 11.1.0.7 release5. Bug 11890804-LMHB crashes instance with ORA-29770 after long "control file sequential read" Waits Document 11890804.8 Fixed in 11.2.0.2.5, 11.2.0.3 and 11.2.0.2 patches on WINDOWS6. Bug:13732226-node GETS evicted with REASONCODE 0X2 Bug:13399435-kjfcdrmrcfg waited 249 SECS for LMD to RECEIVE all ftdones, requesting KILL Bug:13503204-i Nstance eviction DUE to REASON 0X200000 Refer:11gr2:lmon received a instance eviction notification from instance n D Ocument 1440892.1 Fixed in:11.2.0.4 and some merge patch available for 11.2.0.2 and 11.2.0.3
"Translated from MoS article" What is the brain fissure in Oracle Clusterware and RAC