某客戶的CRM資料庫在2015年9月17號11:38分左出現其中一個節點宕機的情況。其中節點1報錯資訊如下:
Thu Sep 17 11:38:10 2015
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
NOTE: ASMB terminating
Errors in file /oracle/app/oracle/diag/rdbms/crm2db/crm2db1/trace/crm2db1_asmb_11141424.trc:
ORA-15064: communication failure with ASM instance
ORA-00600: internal error code, arguments: [ORA_NPI_ERROR], [600], [ORA-00600: internal error code, arguments: [kffilCreate01], [crm2db1:crm2db], [], [], [], [], [], [], [], [], [], []
], [], [], [], [], [], [], [], [], []
Errors in file /oracle/app/oracle/diag/rdbms/crm2db/crm2db1/trace/crm2db1_asmb_11141424.trc:
ORA-15064: communication failure with ASM instance
ORA-00600: internal error code, arguments: [ORA_NPI_ERROR], [600], [ORA-00600: internal error code, arguments: [kffilCreate01], [crm2db1:crm2db], [], [], [], [], [], [], [], [], [], []
], [], [], [], [], [], [], [], [], []
ASMB (ospid: 11141424): terminating the instance due to error 15064
Thu Sep 17 11:38:10 2015
System state dump requested by (instance=1, osid=11141424 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/app/oracle/diag/rdbms/crm2db/crm2db1/trace/crm2db1_diag_12714592.trc
Thu Sep 17 11:38:10 2015
opiodr aborting process unknown ospid (22414298) as a result of ORA-1092
Termination issued to instance processes. Waiting for the processes to exit
Thu Sep 17 11:38:25 2015
ORA-1092 : opitsk aborting process
Thu Sep 17 11:38:26 2015
Instance termination failed to kill one or more processes
Instance terminated by ASMB, pid = 11141424
Thu Sep 17 11:39:43 2015
Starting ORACLE instance (normal)
從節點1資料庫db alert log日誌來看,在11:38:10出現ASM故障,最後在11:38:26時間,節點1的ASM執行個體被Oracle ASM執行個體的ASMB進程強行終止,因此導致資料庫執行個體也crash掉。如下是相關日誌資訊:
Thu Sep 17 11:38:25 2015 ORA-1092 : opitsk aborting process
Thu Sep 17 11:38:26 2015
Instance termination failed to kill one or more processes
Instance terminated by ASMB, pid = 11141424
而此時在節點2的資料庫alert log日誌中,沒有發現其他資訊,僅僅只有節點1執行個體被重啟的資訊。因此,不難看出,節點ASM執行個體被強行終止,是此次問題的關鍵所在。我們進一步分析節點1 ASM執行個體的日誌,發現如下資訊:
Tue Sep 15 21:51:52 2015
Time drift detected. Please check VKTM trace file for more details.
Thu Sep 17 11:38:06 2015
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_8716976.trc (incident=179751):
ORA-00600: internal error code, arguments: [kffilCreate01], [crm2db1:crm2db], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM1/incident/incdir_179751/+ASM1_ora_8716976_i179751.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Sep 17 11:38:10 2015
Dumping diagnostic data in directory=[cdmp_20150917113809], requested by (instance=1, osid=8716976), summary=[incident=179751].
Thu Sep 17 11:38:24 2015
Sweep [inc][179751]: completed
Sweep [inc2][179751]: completed
Thu Sep 17 11:39:29 2015
NOTE: ASM client crm2db1:crm2db disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_8716976.trc
Thu Sep 17 11:39:29 2015
System State dumped to trace file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_42729726.trc
Thu Sep 17 11:40:10 2015
NOTE: client crm2db1:crm2db registered, osid 50266474, mbr 0x1
我們可以看到,在11:38:06時間點,ASM執行個體拋出ORA-00600 [kffilCreate01] 錯誤後,接著將執行個體終止,同時將日誌資訊寫入到+ASM1_ora_8716976_i179751.trc檔案中,該trace 檔案的內容如下:
Address: 0xfffffffffff3cc8
Incident ID: 179751
Problem Key: ORA 600 [kffilCreate01]
Error: ORA-600 [kffilCreate01] [crm2db1:crm2db] [] [] [] [] [] [] [] [] [] []
[00]: dbgexProcessError [diag_dde]
[01]: dbgeExecuteForError [diag_dde]
[02]: dbgePostErrorKGE [diag_dde]
[03]: dbkePostKGE_kgsf [rdbms_dde]
[04]: kgeadse []
[05]: kgerinv_internal []
[06]: kgerinv []
[07]: kgeasnmierr []
[08]: kffilCreate [ASM]<-- Signaling
[09]: kfnsUFG [KFNU]
[10]: kfnsBackground [KFNU]
[11]: kfnDispatch [KFN]
[12]: opiodr []
[13]: ttcpip []
[14]: opitsk []
[15]: opiino []
[16]: opiodr []
[17]: opidrv []
[18]: sou2o []
[19]: opimai_real []
[20]: ssthrdmain []
[21]: main []
我們可以看出,Oracle 在調用kffcicreate函數時出現了異常,進而導致asm執行個體被強行終止。經過分析我們發現該問題跟Oracle Bug 16357438 (ORA-600 [KFFILCREATE01] IN ASM TRIGGERS DB CRASH WHEN MAPID REACHES MAX VALUE )比較符合。如下是Oracle metalink文檔針對該bug的call stack函數描述:
[00]: dbgexProcessError [diag_dde]
[01]: dbgeExecuteForError [diag_dde]
[02]: dbgePostErrorKGE [diag_dde]
[03]: dbkePostKGE_kgsf [rdbms_dde]
[04]: kgeadse []
[05]: kgerinv_internal []
[06]: kgerinv []
[07]: kgeasnmierr []
[08]: kffilCreate [ASM]<-- Signaling
[09]: kfnsUFG [KFNU]
[10]: kfnsBackground [KFNU]
[11]: kfnDispatch [KFN]
[12]: opiodr []
[13]: ttcpip []
[14]: opitsk []
[15]: opiino []
[16]: opiodr []
[17]: opidrv []
[18]: sou2o []
[19]: opimai_real []
[20]: ssthrdmain []
[21]: main []
[22]: __start []
對於Oracle資料庫bug的判斷,通常的做法是比較報錯記錄檔中的call stack函數調用是否與Oracle bug的call stack函數描述一致,一般來講,相似性超多90%,則意味是符合bug的可能性就極大。基於這樣的分析,我對比發現call stack與Oracle bug 的call stack描述完全一致。
根據Oracle 文檔關於該bug的描述,大致意思是指asm執行個體的mapid溢出,匯出asm執行個體無法正常運作,進而被asmb進程強行終止;當然,強行終止的目的是為了保持叢集節點資料的完整性,如下是從trace檔案中提取的mapid資訊:
Roger$ cat ASM1_ora_8716976_i179751.trc |grep 'mapid:'
(kffil) netnm: crm2db1:crm2db, mapid: 4293682398
(kffil) netnm: crm2db1:crm2db, mapid: 4249534089
(kffil) netnm: crm2db1:crm2db, mapid: 4185128984
(kffil) netnm: crm2db1:crm2db, mapid: 4185125967
(kffil) netnm: crm2db1:crm2db, mapid: 4185121734
(kffil) netnm: crm2db1:crm2db, mapid: 4152108652
(kffil) netnm: crm2db1:crm2db, mapid: 4139887931
.......
(kffil) netnm: crm2db1:crm2db, mapid: 520724217
(kffil) netnm: crm2db1:crm2db, mapid: 482740313
(kffil) netnm: crm2db1:crm2db, mapid: 394435399
(kffil) netnm: crm2db1:crm2db, mapid: 298438600
(kffil) netnm: crm2db1:crm2db, mapid: 171948102
(kffil) netnm: crm2db1:crm2db, mapid: 291
(kffil) netnm: crm2db1:crm2db, mapid: 289
(kffil) netnm: crm2db1:crm2db, mapid: 278
(kffil) netnm: crm2db1:crm2db, mapid: 275
.......
(kffil) netnm: crm2db1:crm2db, mapid: 42
(kffil) netnm: crm2db1:crm2db, mapid: 38
(kffil) netnm: crm2db1:crm2db, mapid: 3
(kffil) netnm: <null>, mapid: <null>
(kffil) netnm: crm2db1:crm2db, mapid: 4294967295
(kffil) netnm: crm2db1:crm2db, mapid: 4294967294
(kffil) netnm: crm2db1:crm2db, mapid: 4294967293
我們可以看到,mapid已經達到最大值4294967296,這是Oracle資料庫中資料對象的上限,同時我們從trace檔案中的KST trace資訊中也能看mapid溢出的類似資訊:
2015-09-17 11:38:06.075000 :C23CEF42:KFNS:kfn.c@708:kfnDispatch(): calling server stub for KFNOP=5
2015-09-17 11:38:06.075011 :C23CEF45:KFNU:kfns.c@1725:kfnsBackground(): kfnsBackground consuming in-progress message typ=13 unread=1 status=0xfffffff1
2015-09-17 11:38:06.075012 :C23CEF46:KFNU:kfns.c@1963:kfnsConsume(): kfnsConsume message 0x700000124bd6670 status=0xfffffff1 flags=0x0
2015-09-17 11:38:06.075023 :C23CEF4C:KFNU:kfn.c@5313:kfnmsg_Dump(): kfnsBg (kfnmsg) opcode=14 (KFNMS_MAPFREE) fd 0x700000124bea1e0 mid 4294967291, gn [7.3971105324], fn [267.811087543] exts 1025, start 60, cnt 965
2015-09-17 11:38:06.075025 :C23CEF4E:KFNU:kfns.c@1784:kfnsBackground(): kfnsBackground got targeted message
2015-09-17 11:38:06.075029 :C23CEF54:db_trace:kfns.c@5342:kfnsFillMapMsg(): [10491:27:865] MAP_FREE: gn=7 fn=267, mapid=-5
2015-09-17 11:38:06.075047 :C23CEF5C:KFNU:kfns.c@1940:kfnsBackground(): kfnsBackground completed in 0 seconds (KFNPM=2)
2015-09-17 11:38:06.075048 :C23CEF5D:KFNS:kfn.c@739:kfnDispatch(): completed KFNOP=5 callcnt=-1386638497
2015-09-17 11:38:06.075107 :C23CEF7A:KFNS:kfn.c@708:kfnDispatch(): calling server stub for KFNOP=5
2015-09-17 11:38:06.075109 :C23CEF7C:KFNU:kfns.c@1725:kfnsBackground(): kfnsBackground consuming in-progress message typ=14 unread=1 status=0xfffffff1
2015-09-17 11:38:06.075110 :C23CEF7E:KFNU:kfns.c@1963:kfnsConsume(): kfnsConsume message 0x700000126f38a58 status=0xfffffff1 flags=0x0
2015-09-17 11:38:06.075113 :C23CEF83:KFNU:kfn.c@5313:kfnmsg_Dump(): kfnsBg (kfnmsg) opcode=14 (KFNMS_MAPFREE) fd 0x700000124beffe0 mid 4294967292, gn [7.3971105324], fn [267.811087543] exts 1025, start 60, cnt 965
2015-09-17 11:38:06.075114 :C23CEF84:KFNU:kfns.c@1784:kfnsBackground(): kfnsBackground got targeted message
2015-09-17 11:38:06.075118 :C23CEF85:db_trace:kfns.c@5342:kfnsFillMapMsg(): [10491:27:865] MAP_FREE: gn=7 fn=267, mapid=-4
我們可以看出,mapid已經出現了負數,這邊表明mapid出現了溢出。
基於前面的種種分析,我們認為此次故障完全符合Oracle bug 16357438的描述細節,因此建議客戶在下次維護時間視窗中安裝相關Patch。