Case of Oracle RAC node restart caused by mapid overflow

Last Update:2017-01-13 Source: Internet

Author: User

Tags error code stub oracle database

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A customer's CRM database experienced a node downtime on the left at on January 1, September 17, 2015. The error message of node 1 is as follows:

Thu Sep 17 11:38:10 2015
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
NOTE: ASMB terminating
Errors in file/oracle/app/oracle/diag/rdbms/crm2db/crm2db1/trace/crm2db1_asmb_111424.trc:
ORA-15064: communication failure with ASM instance
ORA-00600: internal error code, arguments: [ORA_NPI_ERROR], [600], [ORA-00600: internal error code, arguments: [kffilCreate01], [crm2db1: crm2db], [], [], [], [], [], [], [], [], [], []
], [], [], [], [], [], [], [], [], []
Errors in file/oracle/app/oracle/diag/rdbms/crm2db/crm2db1/trace/crm2db1_asmb_111424.trc:
ORA-15064: communication failure with ASM instance
ORA-00600: internal error code, arguments: [ORA_NPI_ERROR], [600], [ORA-00600: internal error code, arguments: [kffilCreate01], [crm2db1: crm2db], [], [], [], [], [], [], [], [], [], []
], [], [], [], [], [], [], [], [], []
ASMB (ospid: 11141424): terminating the instance due to error 15064
Thu Sep 17 11:38:10 2015
System state dump requested by (instance = 1, osid = 11141424 (ASMB), summary = [abnormal instance termination].
System State dumped to trace file/oracle/app/oracle/diag/rdbms/crm2db/crm2db1/trace/crm2db1_diag_1_14592.trc
Thu Sep 17 11:38:10 2015
Opiodr aborting process unknown ospid (22414298) as a result of ORA-1092
Termination issued to instance processes. Waiting for the processes to exit
Thu Sep 17 11:38:25 2015
ORA-1092: opitsk aborting process
Thu Sep 17 11:38:26 2015
Instance termination failed to kill one or more processes
Instance terminated by ASMB, pid = 11141424
Thu Sep 17 11:39:43 2015
Starting ORACLE instance (normal)
According to the database alert log of node 1, the ASM fault occurred at 11:38:10, and the ASM instance of node 1 was forcibly terminated by the ASMB process of the Oracle ASM instance at 11:38:26, as a result, the database instance is also crash. The log information is as follows:

Thu Sep 17 11:38:25 2015 ORA-1092: opitsk aborting process
Thu Sep 17 11:38:26 2015
Instance termination failed to kill one or more processes
Instance terminated by ASMB, pid = 11141424

At this time, no other information is found in the database alert log of node 2. Only the information of node 1 instance is restarted. Therefore, it is not difficult to see that the node ASM instance is forcibly terminated, which is the key to this problem. We further analyze the log of node 1 ASM instance and find the following information:

Tue Sep 21:51:52 2015
Time drift detected. Please check VKTM trace file for more details.
Thu Sep 17 11:38:06 2015
Errors in file/oracle/app/grid/diag/asm/+ ASM1/trace/+ ASM1_ora_8716976.trc (incident = 179751 ):
ORA-00600: internal error code, arguments: [kffilCreate01], [crm2db1: crm2db], [], [], [], [], [], [], [], [], [], []
Incident details in:/oracle/app/grid/diag/asm/+ ASM1/incident/incdir_179751/+ ASM1_ora_8716976_i179751.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Sep 17 11:38:10 2015
Dumping diagnostic data in directory = [cdmp_20150917113859], requested by (instance = 1, osid = 8716976), summary = [incident = 179751].
Thu Sep 17 11:38:24 2015
Sweep [inc] [179751]: completed
Sweep [inc2] [179751]: completed
Thu Sep 17 11:39:29 2015
NOTE: ASM client crm2db1: crm2db disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file/oracle/app/grid/diag/asm/+ ASM1/trace/+ ASM1_ora_8716976.trc
Thu Sep 17 11:39:29 2015
System State dumped to trace file/oracle/app/grid/diag/asm/+ ASM1/trace/+ asm1_ora_427292.16.trc
Thu Sep 17 11:40:10 2015
NOTE: client crm2db1: crm2db registered, osid 50266474, mbr 0x1

We can see that at 11:38:06, after the ASM instance throws the ORA-00600 [kffilCreate01] error, it then terminates the instance and writes the log information to the + asmacroora_8716976_i179751.trc file, the content of the trace file is as follows:

Address: 0xfffffffffff3cc8
Incident ID: 179751
Problem Key: ORA 600 [kffilCreate01]
Error: ORA-600 [kffilCreate01] [crm2db1: crm2db] [] [] [] [] [] [] [] [] [] [] []
[00]: dbgexProcessError [diag_dde]
[01]: dbgeExecuteForError [diag_dde]
[02]: dbgePostErrorKGE [diag_dde]
[03]: dbkePostKGE_kgsf [rdbms_dde]
[04]: kgeadse []
[05]: kgerinv_internal []
[06]: kgerinv []
[07]: kgeasnmierr []
[08]: kffilCreate [ASM] <-- Signaling
[09]: kfnsUFG [KFNU]
[10]: kfnsBackground [KFNU]
[11]: kfnDispatch [KFN]
[12]: opiodr []
[13]: ttcpip []
[14]: opitsk []
[15]: opiino []
[16]: opiodr []
[17]: opidrv []
[18]: sou2o []
[19]: opimai_real []
[20]: ssthrdmain []
[21]: main []

We can see that Oracle encountered an exception when calling the kffcicreate function, resulting in the forced termination of the asm instance. After analysis, we found that the problem is consistent with Oracle Bug 16357438 (ORA-600 [KFFILCREATE01] in asm triggers db crash when mapid reaches max value. The following is the call stack function description for this bug in the Oracle metalink document:

[00]: dbgexProcessError [diag_dde]
[01]: dbgeExecuteForError [diag_dde]
[02]: dbgePostErrorKGE [diag_dde]
[03]: dbkePostKGE_kgsf [rdbms_dde]
[04]: kgeadse []
[05]: kgerinv_internal []
[06]: kgerinv []
[07]: kgeasnmierr []
[08]: kffilCreate [ASM] <-- Signaling
[09]: kfnsUFG [KFNU]
[10]: kfnsBackground [KFNU]
[11]: kfnDispatch [KFN]
[12]: opiodr []
[13]: ttcpip []
[14]: opitsk []
[15]: opiino []
[16]: opiodr []
[17]: opidrv []
[18]: sou2o []
[19]: opimai_real []
[20]: ssthrdmain []
[21]: main []
[22]: _ start []

For Oracle database bug determination, the common practice is to compare whether the call stack function call in the error log file is consistent with the call stack function description of Oracle bug. Generally, the similarity is over 90%, this means that it is highly likely to meet the bug. Based on this analysis, I found that the call stack and the call stack descriptions of Oracle bugs are exactly the same.
According to the description of this bug in the Oracle document, it generally means that the mapid of the asm instance overflows, the exported asm instance cannot operate normally, and is forcibly terminated by the asmb process. Of course, the purpose of force termination is to maintain the integrity of the cluster node data. The mapid information extracted from the trace file is as follows:

Roger $ cat ASM1_ora_8716976_i179751.trc | grep 'mapid :'
(Kffil) netnm: crm2db1: crm2db, mapid: 4293682398
(Kffil) netnm: crm2db1: crm2db, mapid: 4249534089
(Kffil) netnm: crm2db1: crm2db, mapid: 4185128984
(Kffil) netnm: crm2db1: crm2db, mapid: 4185125967
(Kffil) netnm: crm2db1: crm2db, mapid: 4185121734
(Kffil) netnm: crm2db1: crm2db, mapid: 4152108652
(Kffil) netnm: crm2db1: crm2db, mapid: 4139887931
.......
(Kffil) netnm: crm2db1: crm2db, mapid: 520724217
(Kffil) netnm: crm2db1: crm2db, mapid: 482740313
(Kffil) netnm: crm2db1: crm2db, mapid: 394435399
(Kffil) netnm: crm2db1: crm2db, mapid: 298438600
(Kffil) netnm: crm2db1: crm2db, mapid: 171948102
(Kffil) netnm: crm2db1: crm2db, mapid: 291
(Kffil) netnm: crm2db1: crm2db, mapid: 289
(Kffil) netnm: crm2db1: crm2db, mapid: 278
(Kffil) netnm: crm2db1: crm2db, mapid: 275
.......
(Kffil) netnm: crm2db1: crm2db, mapid: 42
(Kffil) netnm: crm2db1: crm2db, mapid: 38
(Kffil) netnm: crm2db1: crm2db, mapid: 3
(Kffil) netnm: <null>, mapid: <null>
(Kffil) netnm: crm2db1: crm2db, mapid: 4294967295
(Kffil) netnm: crm2db1: crm2db, mapid: 4294967294
(Kffil) netnm: crm2db1: crm2db, mapid: 4294967293

We can see that mapid has reached the maximum value of 4294967296, which is the upper limit of data objects in the Oracle Database. At the same time, we can also see similar information about mapid overflow from the KST trace information in the trace file:

11:38:06. 075000: C23CEF42: KFNS: kfn. c @ 708: kfnDispatch (): calling server stub for KFNOP = 5
11:38:06. 075011: C23CEF45: KFNU: kfns. c @ 1725: kfnsBackground (): kfnsBackground consuming in-progress message typ = 13 unread = 1 status = 0xfffffff1
11:38:06. 075012: C23CEF46: KFNU: kfns. c @ 1963: kfnsConsume (): kfnsConsume message 0x700000124bd6670 status = 0xfffffff1 flags = 0x0
11:38:06. 075023: C23CEF4C: KFNU: kfn. c @ 5313: kfnmsg_Dump (): kfnsBg (kfnmsg) opcode = 14 (KFNMS_MAPFREE) fd 0x700000124bea1e0 mid 4294967291, gn [7.3971105324], fn [267.811087543] exts 1025, start 60, cnt 965
11:38:06. 075025: C23CEF4E: KFNU: kfns. c @ 1784: kfnsBackground (): kfnsBackground got targeted message
11:38:06. 075029: C23CEF54: db_trace: kfns. c @ 5342: kfnsFillMapMsg (): [10491: 27: 865] MAP_FREE: gn = 7 fn = 267, mapid =-5
11:38:06. 075047: C23CEF5C: KFNU: kfns. c @ 1940: kfnsBackground (): kfnsBackground completed in 0 seconds (KFNPM = 2)
11:38:06. 075048: C23CEF5D: KFNS: kfn. c @ 739: kfnDispatch (): completed KFNOP = 5 callcnt =-1386638497
11:38:06. 075107: C23CEF7A: KFNS: kfn. c @ 708: kfnDispatch (): calling server stub for KFNOP = 5
11:38:06. 075109: C23CEF7C: KFNU: kfns. c @ 1725: kfnsBackground (): kfnsBackground consuming in-progress message typ = 14 unread = 1 status = 0xfffffff1
11:38:06. 075110: C23CEF7E: KFNU: kfns. c @ 1963: kfnsConsume (): kfnsConsume message 0x700000126f38a58 status = 0xfffffff1 flags = 0x0
11:38:06. 075113: C23CEF83: KFNU: kfn. c @ 5313: kfnmsg_Dump (): kfnsBg (kfnmsg) opcode = 14 (KFNMS_MAPFREE) fd 0x700000124beffe0 mid 4294967292, gn [7.3971105324], fn [267.811087543] exts 1025, start 60, cnt 965
11:38:06. 075114: C23CEF84: KFNU: kfns. c @ 1784: kfnsBackground (): kfnsBackground got targeted message
11:38:06. 075118: C23CEF85: db_trace: kfns. c @ 5342: kfnsFillMapMsg (): [10491: 27: 865] MAP_FREE: gn = 7 fn = 267, mapid =-4
We can see that mapid has a negative number, which indicates that mapid has exceeded.
Based on the previous analysis, we believe that this fault fully complies with the description details of Oracle bug 16357438. Therefore, we recommend that you install the Patch in the next maintenance time window.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More