In this Document
|
2. Real application Cluster (database) layer |
Applies To:
Oracle database-enterprise edition-version 10.1.0.2 and later
Information in this document applies to any platform.
PURPOSE
This-is-explain-what-is-split brain in a Oracle Real application cluster and what errors/consequences are Associa Ted with it.
SCOPE
For DBAs and support engineer.
DETAILS
In generic term, split-brain indicates data inconsistencies originating from the maintenance of the separate data sets wit h overlap in scope, either because of servers in a network design, or a failure condition based on servers not Communicati Ng and unifying their data to each of the other.
There is both components in Oracle Real application Cluster implementation could experience split brain.
1. Clusterware Layer
Cluster nodes maintain their heartbeat via private network and voting disk. When there are a private network disruption, cluster nodes can not communicate to all other via private network for the TI Me period of misscount setting, split brain would happen. In such case, voting disk is used to determine which node (s) Survive and which node (s) would be evicted. The common voting result would be:
A. The group with more cluster nodes survive
B. The group with lower node member in case of same number of node (s) available in each group
C. Some improvement have been made to ensure node (s) with lower load survive in case the eviction was caused by high system Load.
Commonly, one would see messages similar to the followings in Ocssd.log when split brain happens:
[Cssd]2011-01-12 23:23:08.090 [1262557536] >trace:clssnmcheckdskinfo:checking disk info ... [Cssd]2011-01-12 23:23:08.090 [1262557536] >error:clssnmcheckdskinfo:aborting Local node to avoid splitbrain. [Cssd]2011-01-12 23:23:08.090 [1262557536] >error:: My Node (2), Leader (2), size (1) VS node (1), Leader (1), Size (2) [CS Sd]2011-01-12 23:23:08.090 [1262557536] >error: ###################################[CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR:CLSSSCEXIT:CSSD aborting###################################
Above messages indicate the communication from node 2 to Node 1 isn't working, hence node 2 only sees 1 node, but node 1 is working fine and it can see both nodes in the cluster. To avoid splitbrain, Node 2 aborted itself.
Solution:please Engage network administrator to check the private network layer to eliminate any network fault.
2. Real application Cluster (database) layer
To ensure data consistency, each instance of a RAC database needs to keep heartbeat with the other instances. The heartbeat is maintained by background processes like Lmon, LMD, LMS and LCK. Any of these processes experience IPC Send time out would incur communication reconfiguration and instance eviction to Avoi D Split Brain. Controlfile is used similarly to voting disk in Clusterware layer to determine which instance (s) survive and which Instanc E (s) evict. The voting result is a similar to clusterware voting result. As the result, 1 or more instance (s) would be evicted.
Common messages in Instance alert log is similar to:
Alert log of Instance 1:
---------
Mon Dec 07 19:43:05 2011
IPC Send Timeout detected. Sender:ospid 26318
Receiver:inst 2 binc 554466600 ospid 29940
IPC Send Timeout to 2.0 Inc. 8 for MSG type 65521 from Opid 20
Mon Dec 07 19:43:07 2011
Communications Reconfiguration:instance_number 2
Mon Dec 07 19:43:07 2011
Trace dumping is performing id=[cdmp_20091207194307]
Waiting for Clusterware split-brain resolution
Mon Dec 07 19:53:07 2011
Evicting instance 2 from cluster
Waiting for instances to leave:
2
...
Alert log of Instance 2:
---------
Mon Dec 07 19:42:18 2011
IPC Send Timeout detected. Receiver Ospid 29940
Mon Dec 07 19:42:18 2011
Errors in File
/U01/APP/ORACLE/DIAG/RDBMS/BD/BD2/TRACE/BD2_LMD0_29940.TRC:
Trace dumping is performing id=[cdmp_20091207194307]
Mon Dec 07 19:42:20 2011
Waiting for Clusterware split-brain resolution
Mon Dec 07 19:44:45 2011
ERROR:LMS0 (ospid:29942) detects an idle connection to instance 1
Mon Dec 07 19:44:51 2011
ERROR:LMD0 (ospid:29940) detects an idle connection to instance 1
Mon Dec 07 19:45:38 2011
ERROR:LMS1 (ospid:29954) detects an idle connection to instance 1
Mon Dec 07 19:52:27 2011
Errors in File
/u01/app/oracle/diag/rdbms/bd/bd2/trace/pvbd2_lmon_29938.trc
(incident=90153):
Ora-29740:evicted by member 0, group incarnation 10
Incident details in:
/u01/app/oracle/diag/rdbms/bd/bd2/incident/incdir_90153/bd2_lmon_29938_i90153.trc
In above example, Instance 2 LMD0 (PID 29940) are the receiver in IPC Send timeout. There could be various reasons causing IPC Send timeout. For example:
A. Network problem
B. Process Hang
C. Bugs etc
Please see Top 5 issues for Instance eviction Document 1374110.1 for more information.
In case of instance eviction, alert logs and all background traces need to being checked to determine the root cause.
Known Issues
1. Bug 7653579-IPC send timeout in RAC after period Document 7653579.8
refer:ora-29740 Instance (asm/db) eviction on Solaris SPARC Document 761717.1
Fixed in:11.2.0.1, 11.1.0.7.2 PSU and 11.1.0.7 Patch on Windows
2. Unpublished Bug 8267580:wrong Instance evicted under high CPU Load
Refer:wrong Instance evicted under high CPU Load in 11.1.0.7 Document 1373749.1
Fixed in:11.2.0.1
3. Bug 8365141-drm Quiesce Step hang causes instance eviction Document 8365141.8
Fixed in:10.2.0.5, 11.1.0.7.3, 11.1.0.7 patch for Windows and 11.2.0.1
4. Bug 7587008-hung RAC instance not evicted from cluster Document 7587008.8
Fixed in:10.2.0.4.4, 10.2.0.5 and 11.2.0.1, one-off patch available for various 11.1.0.7 release
5. Bug 11890804-LMHB crashes instance with ORA-29770 after long "control file sequential read" Waits Document 11890804.8
Fixed in 11.2.0.2.5, 11.2.0.3 and 11.2.0.2 patches on Windows
6. Bug:13732226-node GETS evicted with REASON CODE 0X2
Bug:13399435-kjfcdrmrcfg waited 249 SECS for LMD to RECEIVE all ftdones, requesting KILL
Bug:13503204-instance eviction DUE to REASON 0X200000
Refer:11gr2:lmon received an instance eviction notification from instance n Document 1440892.1
Fixed in:11.2.0.4 and some merge patch available for 11.2.0.2 and 11.2.0.3
What's Split Brain in Oracle clusterware and Real application Cluster (document ID 1425586.1)