ORA-00600 [kjctr_pbmsg:badbmsg2]
近日遇到錯誤ORA-00600 [kjctr_pbmsg:badbmsg2],並且導致RAC節點執行個體重啟,最終確認問題由於私網不穩定導致的。
ORA-00600: internal error code, arguments: [kjctr_pbmsg:badbmsg2], [0x9FFFFFFFFC996B58], [0x9FFFFFFFFC9976B8], [], [], [], [], [], [], [], [], []
LMS1 (ospid: 12379): terminating the instance due to error 484
1. 具體分析如下,首先查看日誌:
alert log
Mon Aug 11 23:53:10 2014
Errors in file /Oracle/app/oracle/diag/rdbms/cdrdb/orcl/trace/orcl_lms1_12379.trc (incident=1104178):
ORA-00600: internal error code, arguments: [kjctr_pbmsg:badbmsg2], [0x9FFFFFFFFC996B58], [0x9FFFFFFFFC9976B8], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/oracle/diag/rdbms/cdrdb/orcl/incident/incdir_1104178/orcl_lms1_12379_i1104178.trc
Mon Aug 11 23:53:12 2014
Dumping diagnostic data in directory=[cdmp_20140811235312], requested by (instance=1, osid=12379 (LMS1)), summary=[incident=1104178].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Aug 11 23:53:13 2014
Sweep [inc][1104178]: completed
Sweep [inc2][1104178]: completed
Errors in file /oracle/app/oracle/diag/rdbms/cdrdb/orcl/trace/orcl_lms1_12379.trc:
ORA-00600: internal error code, arguments: [kjctr_pbmsg:badbmsg2], [0x9FFFFFFFFC996B58], [0x9FFFFFFFFC9976B8], [], [], [], [], [], [], [], [], []
LMS1 (ospid: 12379): terminating the instance due to error 484
Mon Aug 11 23:53:22 2014
ORA-1092 : opitsk aborting process
orcl_lms1_12379_i1104178.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /oracle/app/oracle/product/11.2.0/dbhome_1
System name: HP-UX
Node name: h7sd05da
Release: B.11.31
Version: U
Machine: ia64
Instance name: orcl
Redo thread mounted by this instance: 1
Oracle process number: 14
Unix process pid: 12379, image: oracleh7sd05da (LMS1)
Dump continued from file: /oracle/app/oracle/diag/rdbms/cdrdb/orcl/trace/orcl_lms1_12379.trc
ORA-00600: internal error code, arguments: [kjctr_pbmsg:badbmsg2], [0x9FFFFFFFFC996B58], [0x9FFFFFFFFC9976B8], [], [], [], [], [], [], [], [], []
========= Dump for incident 1104178 (ORA 600 [kjctr_pbmsg:badbmsg2]) ========
*** 2014-08-11 23:53:10.339
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- SQL Statement (None) -----
Current SQL information unavailable - no cursor.
----- Call Stack Trace -----
skdstdst <- ksedst <- dbkedDefDump <- ksedmp <- ksfdmp
<- $cold_dbgexPhaseII <- dbgexProcessError <- dbgeExecuteForError <- dbgePostErrorKGE <- 2352
<- dbkePostKGE_kgsf <- 128 <- kgeadse <- kgerinv_internal <- kgerinv
<- kgeasnmierr <- kjctr_pbmsg <- kjctr_rksxp <- kjctrcv <- kjcsrmg
<- kjmsm <- ksbrdp <- opirip <- opidrv <- sou2o
<- opimai_real <- ssthrdmain <- main <- main_opd_entry
--------------------- Binary Stack Dump ---------------------
2. 檢查patch資訊,目前的版本是11.2.0.2.1
$ opatch lsinventory
Installed Top-level Products (1):
Oracle Database 11g 11.2.0.2.0
Patch 10248523 : applied on Fri Mar 25 09:33:02 GMT+08:00 2011
3. 根據這個錯誤搜尋相關的文檔和BUG,列出下面的相關bug和描述
Bug 18015296 : ORA-600 [KJCTR_PBMSG:BADBMSG2] in 11.2.0.3
The assert is trigerred because the batch message is invalid/corrupt. This looks like some form of underlying infrastructure/network issue, Please work with customer to have this checked and tested.
Bug 18771858 : LMS0 TERMINATING THE INSTANCE DUE TO ERROR 484 (ORA-00600 [KJCTR_PBMSG:BADBMSG2] in 11.2.0.3
From the past bug 16240464 & bug 18015296 , both were closed by dev as not a product defect.
It was suggested that problem was outside Oracle stack at network level. So please check with CT on same lines to identify network problems (if any) with help from there OS/Net support. Refer Doc ID 563566.1 Troubleshooting gc block lost and Poor Network Performance in a RAC Environment
Bug 16240464 : INSTANCE CRASH WITH ORA-00600 [KJCTR_PBMSG:BADBMSG2] in 11.2.0.3
This looks like some form of underlying infrastructure/network issue, please work with customer to have this checked and tested.
Bug 17452853 : LNX64-12.1-EF,DB INST CRASH WITH LMS4 HIT ORA-600 [KJCTR_PBMSG:BADBMSG2] in 12.1.0.2
Bug 17049773 Diagnostic enhancement to give additional parameter in error ORA-600 [ kjctr_pbmsg:badbmsg2] in 12.1.0.1
Note: This fix will not address the root cause of the error but the additional information may help with diagnosis of the cause.
Bug 13917456 : LNX64-12.1-UD: ASM LMD HIT ORA-00600 KJCTR_PBMSG:BADBMSG2 IN NON-UPGRADED NODES in 12.1.0.0.2
It may occurred in upgrading stage from 11.2.0.3 to 12.1 . Not related with this SR.
4. 至此,我需要檢查問題發生時的AWR,oswatcher和全部的LMS, LMD, LMON,LMHB and DIAG日誌,看是否有跟多的資訊記錄。
同時也通過cluvfy和ORAchk來檢查RAC的整體環境。
--. AWR report 22:00~23:00 on Aug 11 from both nodes.
--. Deploy the oswatcher, then collect the current OS information, when the database workload is high.
--. All the LMS, LMD, LMON,LMHB and DIAG from both nodes.
--. CVU output:
cluvfy stage -pre crsinst -n <node1,node2> -verbose
--. Please run oraCheck as root.
ORAchk - Health Checks for the Oracle Stack (Doc ID 1268927.2)
5. 在檢查AWR的時候,發現有"gc blocks lost",這個錯誤理論上,如果私網正常的話,是不會出現的,它的出現,基本就可以說明,私網是不穩定的
awrrpt_2_29557_29558.html
Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 29557 11-Aug-14 22:00:45 563 1.3
End Snap: 29558 11-Aug-14 23:01:00 551 1.3
Elapsed: 60.24 (mins)
DB Time: 4,835.90 (mins)
Top 5 Timed Foreground Events
Event Waits Time(s) Avg wait (ms) % DB time Wait Class
db file sequential read 6,269,185 185,621 30 63.97 User I/O
DB CPU 42,433 14.62
gc current grant 2-way 3,251,636 25,671 8 8.85 Cluster
db file scattered read 550,524 9,873 18 3.40 User I/O
gc cr multi block request 637,442 6,790 11 2.34 Cluster
Instance Activity Stats
Statistic Total per Second per Trans
gc blocks lost 269 0.07 0.01 <<<<<<<<<<<<
awrrpt_1_29557_29558.html
Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 29557 11-Aug-14 22:00:44 2470 1.0
End Snap: 29558 11-Aug-14 23:00:59 2500 1.0
Elapsed: 60.25 (mins)
DB Time: 4,549.47 (mins)
Top 5 Timed Foreground Events
Event Waits Time(s) Avg wait (ms) % DB time Wait Class
db file sequential read 8,180,795 154,504 19 56.60 User I/O
DB CPU 44,994 16.48
gc current grant 2-way 3,699,003 29,357 8 10.75 Cluster
db file scattered read 677,065 10,190 15 3.73 User I/O
gc cr multi block request 718,327 7,856 11 2.88 Cluster
Statistic Total per Second per Trans
gc blocks lost 410 0.11 0.01 <<<<<<<<<<<<
6. 對於這個錯誤,更加證明私網的問題可能性,最終結論如下
The Bugs 16240464 and 18015296 are raised for the similar issue and both the bugs are closed as "Vendor OS Problem".
The bug confirmed that this issue is cause because of logical block corruption during network transfer over the interconnect or Infrastructure issue.
The ORA-00600 [kjctr_pbmsg:badbmsg2] error is purely a result of unstable network.
From the AWR reports it is confirmed that we were seeing block lost during the problematic time frame. This is one of the evidence that network is either saturated or causing packets to be corrupted.
By the way, Checked the AWR report. Found "gc blocks lost".
Please involve the OS team and Network team to identify the root cause of the issue. The below note will helpful for the network issue.
Troubleshooting gc block lost and Poor Network Performance in a RAC Environment (Doc ID 563566.1)
7. 這個問題的處理其實還缺少更有力的證據,就是oswatcher日誌,如果有問題出現時的oswatcher日 志,會讓私網問題暴露的更清晰,畢竟整個問題分析過程中遇到的"gc blocks lost"和ORA-00600 [kjctr_pbmsg:badbmsg2]錯誤,都是oracle database角度報出的,並不能讓OS的工程師信服,如果oswatcher日誌記錄當時的TCP和UDP丟包的話,會問題更清晰,責任更明確。
oswatcher的安裝使用,請參考文檔: OSWatcher (Doc ID 301137.1)
--------------------------------------分割線 --------------------------------------
Oracle 11g 在RedHat Linux 5.8_x64平台的安裝手冊
Linux-6-64下安裝Oracle 12C筆記
在CentOS 6.4下安裝Oracle 11gR2(x64)
Oracle 11gR2 在VMWare虛擬機器中安裝步驟
Debian 下 安裝 Oracle 11g XE R2
--------------------------------------分割線 --------------------------------------