【翻譯自mos文章】什麼是Oracle Clusterware 和RAC中的腦裂,mosclusterware
什麼是Oracle Clusterware 和RAC中的腦裂
來源於:
What is Split Brain in Oracle Clusterware and Real Application Cluster (文檔 ID 1425586.1)
適用於:
Oracle Database - Enterprise Edition - Version 10.1.0.2 and later
Information in this document applies to any platform.
目的:
本文解釋了Oracle Clusterware 和RAC中的腦裂,以及與腦裂有關的錯誤和結果。
細節:
在通用的術語中,腦裂表示資料不一致,這個資料庫不一致起源於兩個不同的資料集在範圍上重疊。
要麼由於是server間的網路設計,要麼是有故障的環境,該環境基於servers間的互相通訊和統一資料。
有兩個組件會經曆腦裂:
1. Clusterware 層:
叢集節點之間通過私人網路和voting disk維持他們的heartbeats。
當私人網路損壞時,經過misscount setting設定的時間期間之後, 叢集節點之間不能通過私人網路相互連信,腦裂就會發生。
在這個case中,voting disk 將會被用來決定哪個node 倖存下來,哪個node被evict出叢集。通常的voting 結果如下:
a.The group with more cluster nodes survive b.The group with lower node member in case of same number of node(s) available in each group c.Some improvement has been made to ensure node(s) with lower load survive in case the eviction is caused by high system load.
通常,當腦裂發生時,在ocssd.log中,會看到類似如下的資訊:
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >TRACE: clssnmCheckDskInfo: Checking disk info...[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssnmCheckDskInfo: Aborting local node to avoid splitbrain.[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: : my node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(2)[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: ###################################[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssscExit: CSSD aborting###################################
以上資訊顯示出:從node 2到node 1的通訊 不工作,因此 node 2 只能看到一個 node(也就是node 2自己),但是node 1 是工作正常的,並且node 1 能看到叢集中的2個node,為了避免腦裂, node 2 aborted itself.
解決方案:請聯絡網路系統管理員以檢查私人網路層以消除任何的網路問題。
2. RAC(database)layer
為了確保資料一致性,RAC Database中乾的每個instance 需要 與其他instance 保持heartbeat。 heartbeat 是由後台進程LMON,LMD,LMS和LCK來維持的。
這些進程中的任何一個進程若是經曆IPC Send time out將會導致通訊重配(communication reconfiguration)和執行個體驅逐以避免腦裂。
類似於clusterware層面的voting disk,控制檔案被用來確定 哪個instance 倖存下來,哪個instance 被evict。
The voting result is similar to clusterware voting result. As the result, 1 or more instance(s) will be evicted.
Common messages in instance alert log are similar to:
alert log of instance 1:---------Mon Dec 07 19:43:05 2011IPC Send timeout detected.Sender: ospid 26318Receiver: inst 2 binc 554466600 ospid 29940IPC Send timeout to 2.0 inc 8 for msg type 65521 from opid 20Mon Dec 07 19:43:07 2011Communications reconfiguration: instance_number 2Mon Dec 07 19:43:07 2011Trace dumping is performing id=[cdmp_20091207194307]Waiting for clusterware split-brain resolutionMon Dec 07 19:53:07 2011Evicting instance 2 from clusterWaiting for instances to leave: 2 ...alert log of instance 2:---------Mon Dec 07 19:42:18 2011IPC Send timeout detected. Receiver ospid 29940Mon Dec 07 19:42:18 2011Errors in file /u01/app/oracle/diag/rdbms/bd/BD2/trace/BD2_lmd0_29940.trc:Trace dumping is performing id=[cdmp_20091207194307]Mon Dec 07 19:42:20 2011Waiting for clusterware split-brain resolutionMon Dec 07 19:44:45 2011ERROR: LMS0 (ospid: 29942) detects an idle connection to instance 1Mon Dec 07 19:44:51 2011ERROR: LMD0 (ospid: 29940) detects an idle connection to instance 1Mon Dec 07 19:45:38 2011ERROR: LMS1 (ospid: 29954) detects an idle connection to instance 1Mon Dec 07 19:52:27 2011Errors in file /u01/app/oracle/diag/rdbms/bd/BD2/trace/PVBD2_lmon_29938.trc (incident=90153):ORA-29740: evicted by member 0, group incarnation 10Incident details in: /u01/app/oracle/diag/rdbms/bd/BD2/incident/incdir_90153/BD2_lmon_29938_i90153.trc
在上面的例子中, instance 2 LMD0 (pid 29940) is the receiver in IPC Send timeout. There could be various reasons causing IPC Send timeout. For example:
a. Network problem
b. Process hang
c. Bug etc
Please see Top 5 issues for Instance Eviction Document 1374110.1 for more information.
在instance驅逐的案例中, alert log and all background traces需要被檢查,以確定根本原因。
Known Issues1. Bug 7653579 - IPC send timeout in RAC after only short period Document 7653579.8 Refer: ORA-29740 Instance (ASM/DB) eviction on Solaris SPARC Document 761717.1 Fixed in: 11.2.0.1, 11.1.0.7.2 PSU and 11.1.0.7 Patch 22 on Windows2. Unpublished Bug 8267580: Wrong Instance Evicted Under High CPU Load Refer: Wrong Instance Evicted Under High CPU Load in 11.1.0.7 Document 1373749.1 Fixed in: 11.2.0.13. Bug 8365141 - DRM quiesce step hang causes instance eviction Document 8365141.8 Fixed in: 10.2.0.5, 11.1.0.7.3, 11.1.0.7 patch 25 for Windows and 11.2.0.14. Bug 7587008 - Hung RAC instance not evicted from cluster Document 7587008.8 Fixed in: 10.2.0.4.4, 10.2.0.5 and 11.2.0.1, one-off patch available for various 11.1.0.7 release5. Bug 11890804 - LMHB crashes instance with ORA-29770 after long "control file sequential read" waits Document 11890804.8 Fixed in 11.2.0.2.5, 11.2.0.3 and 11.2.0.2 Patch 10 on Windows6. BUG:13732226 - NODE GETS EVICTED WITH REASON CODE 0X2 BUG:13399435 - KJFCDRMRCFG WAITED 249 SECS FOR LMD TO RECEIVE ALL FTDONES, REQUESTING KILL BUG:13503204 - INSTANCE EVICTION DUE TO REASON 0X200000 Refer: 11gR2: LMON received an instance eviction notification from instance n Document 1440892.1 Fixed in: 11.2.0.4 and some merge patch available for 11.2.0.2 and 11.2.0.3