這裡簡單記錄一下,此次國慶加班恢複的某客戶的2套Oracle RAC資料庫,整個恢複過程中,2套rac差不多,因此這裡以其中一套資料庫的恢複過程為例進行簡單分析說明。資料庫由於為非歸檔,由於掉電導致重啟之後系統無法正常open。
在正常open的過程中,報錯如下錯誤:
SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [kcratr1_lastbwr], [], [], [], [],
[], [], []
SQL> shutdown immediate
ORA-01109: database not open
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
Total System Global Area 4294967296 bytes
Fixed Size 2089576 bytes
Variable Size 2751466904 bytes
Database Buffers 1526726656 bytes
Redo Buffers 14684160 bytes
Database mounted.
ORA-00600: internal error code, arguments: [kcratr1_lastbwr], [], [], [], [],
[], [], []
對於該錯誤,網上的解決方案也很多,可惜都不管用。這種情況之下,往往都是需要強制開啟資料庫的,首先需要做一個不完全恢複,如下:
SQL> recover database
ORA-00279: change 236912204 generated at 09/29/2015 12:49:13 needed for thread
1
ORA-00289: suggestion : /xxxx/1_5112_877094801.dbf
ORA-00280: change 236912204 for thread 1 is in sequence #5112
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
auto
ORA-00308: cannot open archived log '/xxxx/1_5112_877094801.dbf'
ORA-27037: unable to obtain file status
IBM AIX RISC System/6000 Error: 2: No such file or directory
Additional information: 3
在進行相關操作之後,我備份了一下當前的控制檔案資訊,便於後面如果有問題,方便處理。強制open的過程中,發現報如下錯誤:
Sat Oct 3 11:49:31 2015
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Sat Oct 3 11:49:33 2015
Errors in file /oracle/oracle/admin/cwdb/udump/cwdb1_ora_6029586.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [1], [18446744072394632417], [1], [18446744072392296306], [], [], []
Sat Oct 3 11:49:34 2015
Errors in file /oracle/oracle/admin/xxxx/udump/xxxx1_ora_6029586.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [1], [18446744072394632417], [1], [18446744072392296306], [], [], []
Sat Oct 3 11:49:34 2015
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600
Instance terminated by USER, pid = 6029586
這個錯誤已經處理過多次了。同樣,百度一下,會發現很多人都寫過相關的文章,包括Oracle mos的文章解釋也是說這是臨時塊的scn過大導致,通過drop tempfile即可繞過該問題。實際上,這種情況之下,根本不會起作用。
但是不管如何,這個問題很明顯都是跟block的scn有關係。既然是跟scn有關係,那麼處理就不難了,通過推進scn即可。
通過推進scn 之後,再次open resetlogs成功開啟資料庫,可惜的是alert log報了一堆錯誤,如下所示:
Sat Oct 3 13:10:34 2015
Errors in file /oracle/oracle/admin/xxxx/bdump/xxxx1_smon_10420246.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Opening with internal Resource Manager plan
where NUMA PG = 1, CPUs = 40
Sat Oct 3 13:10:35 2015
ORACLE Instance xxxx1 (pid = 25) - Error 600 encountered while recovering transaction (23, 85).
Sat Oct 3 13:10:35 2015
Errors in file /oracle/oracle/admin/xxxx/bdump/xxxx1_smon_10420246.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sat Oct 3 13:10:35 2015
Trace dumping is performing id=[cdmp_20151003131035]
Sat Oct 3 13:10:35 2015
replication_dependency_tracking turned off (no async multimaster replication found)
Sat Oct 3 13:10:36 2015
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Sat Oct 3 13:10:36 2015
Errors in file /oracle/oracle/admin/xxxx/bdump/xxxx1_smon_10420246.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sat Oct 3 13:10:37 2015
Starting background process QMNC
Sat Oct 3 13:10:37 2015
ORACLE Instance xxxx1 (pid = 25) - Error 600 encountered while recovering transaction (23, 85).
Sat Oct 3 13:10:37 2015
Errors in file /oracle/oracle/admin/xxxx/bdump/xxxx1_smon_10420246.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
QMNC started with pid=53, OS id=7536816
Sat Oct 3 13:10:41 2015
LOGSTDBY: Validating controlfile with logical metadata
Sat Oct 3 13:10:41 2015
LOGSTDBY: Validation complete
Sat Oct 3 13:10:46 2015
Errors in file /oracle/oracle/admin/xxxx/bdump/xxxx1_mmon_9110004.trc:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
Sat Oct 3 13:10:48 2015
Errors in file /oracle/oracle/admin/xxxx/udump/xxxx1_ora_6619434.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Completed: alter database open resetlogs
這部分錯誤處理其實都不難。對於第一個ora-00600 [4137] 錯誤,很明顯這是跟undo有關係的,其中(23,85)中的23表現第23號復原段;通過屏蔽第23號復原段可以很容易解決該錯誤,當然,這會兒導致事務的不一致性,這是沒辦法的,已經undo異常,Oracle 已經沒有辦法進行正常的事務恢複了。
其次,對於第2個ora-00600 [qertbFetchByRowID] 錯誤,處理也很簡單,其大致意思是通過rowid訪問擷取資料有異常,很明顯這是跟index有關係,通過重建index 可以解決該問題,其次最後一個[kdsgrp1] 錯誤就更常見了,通常也是Index的問題,重建即可。
看上去一切的恢複過程都很簡單,很順利,然而這裡真正的難題,真正的問題才開始。
也就是最後一個看似很簡單的錯誤ora-00600 [kdsgrp1]錯誤,對我們產生了極大的困難。首先我們來看下產生該錯誤時涉及到那些對象:
Validate domain 0
Validated domain 0, flags = 0x0
kwqmnich: current time:: 13: 31: 34
kwqmnich: instance no 0 check_only flag 1
kwqmnich: initialized job cache structure
row 0041edda.2e continuation at
file# 1 block# 126426 slot 47 not found
**************************************************
KDSTABN_GET: 0 ..... ntab: 1
curSlot: 47 ..... nrows: 175
**************************************************
*** 2015-10-03 13:31:40.864
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Current SQL statement for this session:
SELECT OWNER,NAME,TYPE,COUNT(*) FROM DBA_SOURCE WHERE SUBSTR(OWNER,1,4)='FMIS' GROUP BY OWNER,NAME,TYPE HAVING COUNT(*)>1000
----- PL/SQL Call Stack -----
Object id on Block? Y
seg/obj: 0x12 csc: 0x01.b1957474 itc: 3 flg: O typ: 1 - DATA
fsl: 0 fnx: 0x0 ver: 0x01
Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0009.00e.0003968d 0x0c07f50f.1ea7.5d --U- 1 fsc 0x0053.b1957475
0x02 0x0017.005.0006755c 0x0c0774de.1be3.38 C--- 0 scn 0x0001.b1957451
0x03 0x0002.042.00010d39 0x0080b268.0cf1.22 C--- 0 scn 0x0001.b1957429
data_block_dump,data header at 0x110f46074
===============
tsiz: 0x3f88
hsiz: 0x170
pbl: 0x110f46074
bdba: 0x0041edda
76543210
flag=--------
ntab=1
nrow=175
frre=0
fsbo=0x170
fseo=0x783
avsp=0x3cbe
tosp=0x3d13
0xe:pti[0] nrow=175 offs=0
0x12:pri[0] sfll=1
0x14:pri[1] sfll=2
我們可以發現,除開其他的非核心對象之後,這裡還涉及到一個obj#=18,也就是obj$ 這個核心的資料字典表。而該資料字典表上的幾個Index,
i_obj1,i_obj2,i_obj3 都是object_id 小於57的核心對象,這部分對象是屬於bootstrap$ 的核心資料字典對象。即是Index也無法通過
rebuild,38003 event或在upgrade模式下進行重建。
當然,這裡也不是說完全無法去重建上述資料字典表,我後面有一篇文章會相信講解如何去重建。
在分析過程中,我發現其中的前面2個Index都有問題,如下:
SQL> analyze table obj$ validate structure;
Table analyzed.
SQL> select index_name from dba_indexes where table_name='OBJ$';
INDEX_NAME
------------------------------
I_OBJ1
I_OBJ2
I_OBJ3
SQL> analyze index I_OBJ1 validate structure;
analyze index I_OBJ1 validate structure
*
ERROR at line 1:
ORA-08100: index is not valid - see trace file for diagnostics
SQL> analyze index I_OBJ2 validate structure;
analyze index I_OBJ2 validate structure
*
ERROR at line 1:
ORA-08100: index is not valid - see trace file for diagnostics
SQL> analyze index I_OBJ3 validate structure;
Index analyzed.
不過這裡我們也要注意的時,雖然前面2個index都有問題,然而上述錯誤產生時涉及到的index並不是2個都用到了,其實只是用到了第一個index就
報錯ora-00600錯誤了。由於客戶想通過expdp schema的方式去匯出資料,然而發現執行時報錯ora-00600 [kdsgrp1],包括exp執行時也報該
錯誤,不過exp tables的方式,不會報錯;由於對象太多,將近50萬個對象(包括表,index以及其他)。很明顯,只能通過使用者層級的匯出。
那麼也就在意味著我們必須修複這個錯誤才。
通過dump 相關的block,我們發現錯誤是很奇怪的,如下:
*** SESSION ID:(1052.1243) 2015-10-03 17:56:59.784
Block Checking: DBA = 4684328, Block Type = KTB-managed data block
*** previous block dba dba: 477a2adoes not match my previous block dba dba: 477a26
**** row 0: key out of order
---- end index block validation
---- for block 4684328 = 0x00477a28
Block header dump: 0x00477a28
Object id on Block? Y
seg/obj: 0x24 csc: 0x02.80026908 itc: 19 flg: O typ: 2 - INDEX
fsl: 0 fnx: 0x477a1c ver: 0x01
Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0015.01f.00020a0a 0x00867c38.0862.04 C--- 0 scn 0x0001.83f32656
0x02 0x002e.044.00002415 0x0082989a.0477.52 C--- 0 scn 0x0001.74d90109
0x03 0x002c.00a.000012c5 0x0b804b78.012b.70 C--- 0 scn 0x0001.74d9011c
0x04 0x0002.001.00003a72 0x00861aab.024b.0a C--- 0 scn 0x0001.74fa8cf1
0x05 0x002f.02e.000065eb 0x0b816866.0217.5b C--- 0 scn 0x0001.74fb0205
0x06 0x0026.011.00002000 0x0b814f02.016d.14 C--- 0 scn 0x0001.74fc3dbf
0x07 0x0006.011.00003b1c 0x0084c65b.01f9.10 C--- 0 scn 0x0001.74fde478
0x08 0x0029.05f.00003258 0x0084faa6.017c.2f C--- 0 scn 0x0001.74fdedad
0x09 0x001b.023.00003fe6 0x008a8718.025b.3b C--- 0 scn 0x0001.74fdfedc
0x0a 0x001b.006.00004057 0x0b804f24.025c.33 C--- 0 scn 0x0001.751a25ed
0x0b 0x001b.02a.00004039 0x0b804f24.025c.49 C--- 0 scn 0x0001.751a2609
0x0c 0x001b.024.0000404f 0x0b804f24.025c.63 C--- 0 scn 0x0001.751a2627
0x0d 0x000a.018.00021b0b 0x0b821c8b.04cc.45 C--- 0 scn 0x0001.751a263f
0x0e 0x000a.055.00021b0a 0x0b821c8b.04cc.5e C--- 0 scn 0x0001.751a265c
0x0f 0x000a.031.00021aec 0x0b821c8c.04cc.09 C--- 0 scn 0x0001.751a2678
0x10 0x0022.05e.00001dbd 0x008a88df.01e0.2e C--- 0 scn 0x0001.74d8ff09
0x11 0x0025.010.00001ead 0x00860446.01ac.20 C--- 0 scn 0x0001.74d8ff7d
0x12 0x002b.043.00001228 0x0084dc65.0164.11 C--- 0 scn 0x0001.74d8ffe2
0x13 0x001f.015.00001de8 0x0b81a76a.0183.4e C--- 0 scn 0x0001.74d90071
Leaf block dump
===============
header address 504403185228956148=0x70000064725c1f4
kdxcolev 0
KDXCOLEV Flags = - - -
kdxcolok 0
kdxcoopc 0x80: opcode=0: iot flags=--- is converted=Y
kdxconco 1
kdxcosdc 2
kdxconro 0
kdxcofbo 36=0x24
kdxcofeo 15816=0x3dc8
kdxcoavs 15780
kdxlespl 0
kdxlende 0
kdxlenxt 4684330=0x477a2a
kdxleprv 4684326=0x477a26
kdxledsz 6
kdxlebksz 15816
*** dummy key ***
row#0[9549] flag: ---D--, lock: 0, len=14, data:(6): 00 44 03 10 00 7c
col 0; len 5; (5): c4 04 24 25 2f
----- end of leaf block dump -----
dumping parent of corrupted subtree, row # = 1202
dumping parent of corrupted subtree, row # = 3
對於上述這個index 的錯誤,我是第一次遇見,跟老熊討論了一下,他認為可能是index split情況之下出現的。在遠程時,我也用過bbed
對前後將近10個index block進行了分析,通過比較index 的鏈表,發現確實不匹配。
對於這種情況之下,想通過bbed去修複 index,難度可想而知,因此果斷放棄這種方式。最後無奈之下,只能通過處理資料字典表
的方式來處理掉i_obj1,i_obj2 這2個index。最後再讓客戶進行exp 使用者層級的匯出,只不過這個匯出的時間比較漫長了。