oracle資料庫掉電導致系統崩潰的恢複過程

最後更新：2017-01-13 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

這裡簡單記錄一下，此次國慶加班恢複的某客戶的2套Oracle RAC資料庫，整個恢複過程中，2套rac差不多，因此這裡以其中一套資料庫的恢複過程為例進行簡單分析說明。資料庫由於為非歸檔，由於掉電導致重啟之後系統無法正常open。
在正常open的過程中，報錯如下錯誤：

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [kcratr1_lastbwr], [], [], [], [],
[], [], []

SQL> shutdown immediate
ORA-01109: database not open

Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area 4294967296 bytes
Fixed Size                  2089576 bytes
Variable Size            2751466904 bytes
Database Buffers         1526726656 bytes
Redo Buffers               14684160 bytes
Database mounted.
ORA-00600: internal error code, arguments: [kcratr1_lastbwr], [], [], [], [],
[], [], []

對於該錯誤，網上的解決方案也很多，可惜都不管用。這種情況之下，往往都是需要強制開啟資料庫的，首先需要做一個不完全恢複，如下：

SQL> recover database
ORA-00279: change 236912204 generated at 09/29/2015 12:49:13 needed for thread
1
ORA-00289: suggestion : /xxxx/1_5112_877094801.dbf
ORA-00280: change 236912204 for thread 1 is in sequence #5112

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
auto
ORA-00308: cannot open archived log '/xxxx/1_5112_877094801.dbf'
ORA-27037: unable to obtain file status
IBM AIX RISC System/6000 Error: 2: No such file or directory
Additional information: 3

在進行相關操作之後，我備份了一下當前的控制檔案資訊，便於後面如果有問題，方便處理。強制open的過程中，發現報如下錯誤：

Sat Oct 3 11:49:31 2015
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Sat Oct 3 11:49:33 2015
Errors in file /oracle/oracle/admin/cwdb/udump/cwdb1_ora_6029586.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [1], [18446744072394632417], [1], [18446744072392296306], [], [], []
Sat Oct 3 11:49:34 2015
Errors in file /oracle/oracle/admin/xxxx/udump/xxxx1_ora_6029586.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [1], [18446744072394632417], [1], [18446744072392296306], [], [], []
Sat Oct 3 11:49:34 2015
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600
Instance terminated by USER, pid = 6029586
這個錯誤已經處理過多次了。同樣，百度一下，會發現很多人都寫過相關的文章，包括Oracle mos的文章解釋也是說這是臨時塊的scn過大導致，通過drop tempfile即可繞過該問題。實際上，這種情況之下，根本不會起作用。
但是不管如何，這個問題很明顯都是跟block的scn有關係。既然是跟scn有關係，那麼處理就不難了，通過推進scn即可。
通過推進scn 之後，再次open resetlogs成功開啟資料庫，可惜的是alert log報了一堆錯誤，如下所示：

Sat Oct 3 13:10:34 2015
Errors in file /oracle/oracle/admin/xxxx/bdump/xxxx1_smon_10420246.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Opening with internal Resource Manager plan
where NUMA PG = 1, CPUs = 40
Sat Oct 3 13:10:35 2015
ORACLE Instance xxxx1 (pid = 25) - Error 600 encountered while recovering transaction (23, 85).
Sat Oct 3 13:10:35 2015
Errors in file /oracle/oracle/admin/xxxx/bdump/xxxx1_smon_10420246.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sat Oct 3 13:10:35 2015
Trace dumping is performing id=[cdmp_20151003131035]
Sat Oct 3 13:10:35 2015
replication_dependency_tracking turned off (no async multimaster replication found)
Sat Oct 3 13:10:36 2015
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Sat Oct 3 13:10:36 2015
Errors in file /oracle/oracle/admin/xxxx/bdump/xxxx1_smon_10420246.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sat Oct 3 13:10:37 2015
Starting background process QMNC
Sat Oct 3 13:10:37 2015
ORACLE Instance xxxx1 (pid = 25) - Error 600 encountered while recovering transaction (23, 85).
Sat Oct 3 13:10:37 2015
Errors in file /oracle/oracle/admin/xxxx/bdump/xxxx1_smon_10420246.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
QMNC started with pid=53, OS id=7536816
Sat Oct 3 13:10:41 2015
LOGSTDBY: Validating controlfile with logical metadata
Sat Oct 3 13:10:41 2015
LOGSTDBY: Validation complete
Sat Oct 3 13:10:46 2015
Errors in file /oracle/oracle/admin/xxxx/bdump/xxxx1_mmon_9110004.trc:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
Sat Oct 3 13:10:48 2015
Errors in file /oracle/oracle/admin/xxxx/udump/xxxx1_ora_6619434.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Completed: alter database open resetlogs
這部分錯誤處理其實都不難。對於第一個ora-00600 [4137] 錯誤，很明顯這是跟undo有關係的，其中(23,85)中的23表現第23號復原段；通過屏蔽第23號復原段可以很容易解決該錯誤，當然，這會兒導致事務的不一致性，這是沒辦法的，已經undo異常，Oracle 已經沒有辦法進行正常的事務恢複了。
其次，對於第2個ora-00600 [qertbFetchByRowID] 錯誤，處理也很簡單，其大致意思是通過rowid訪問擷取資料有異常，很明顯這是跟index有關係，通過重建index 可以解決該問題，其次最後一個[kdsgrp1] 錯誤就更常見了，通常也是Index的問題，重建即可。
看上去一切的恢複過程都很簡單，很順利，然而這裡真正的難題，真正的問題才開始。
也就是最後一個看似很簡單的錯誤ora-00600 [kdsgrp1]錯誤，對我們產生了極大的困難。首先我們來看下產生該錯誤時涉及到那些對象：

Validate domain 0
Validated domain 0, flags = 0x0
kwqmnich: current time:: 13: 31: 34
kwqmnich: instance no 0 check_only flag 1
kwqmnich: initialized job cache structure
            row 0041edda.2e continuation at
            file# 1 block# 126426 slot 47 not found
**************************************************
KDSTABN_GET: 0 ..... ntab: 1
curSlot: 47 ..... nrows: 175
**************************************************
*** 2015-10-03 13:31:40.864
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Current SQL statement for this session:
SELECT OWNER,NAME,TYPE,COUNT(*) FROM DBA_SOURCE WHERE SUBSTR(OWNER,1,4)='FMIS' GROUP BY OWNER,NAME,TYPE HAVING COUNT(*)>1000
----- PL/SQL Call Stack -----

Object id on Block? Y
seg/obj: 0x12 csc: 0x01.b1957474 itc: 3 flg: O typ: 1 - DATA
     fsl: 0 fnx: 0x0 ver: 0x01

Itl           Xid                  Uba         Flag Lck        Scn/Fsc
0x01   0x0009.00e.0003968d 0x0c07f50f.1ea7.5d --U-    1 fsc 0x0053.b1957475
0x02   0x0017.005.0006755c 0x0c0774de.1be3.38 C---    0 scn 0x0001.b1957451
0x03   0x0002.042.00010d39 0x0080b268.0cf1.22 C---    0 scn 0x0001.b1957429

data_block_dump,data header at 0x110f46074
===============
tsiz: 0x3f88
hsiz: 0x170
pbl: 0x110f46074
bdba: 0x0041edda
     76543210
flag=--------
ntab=1
nrow=175
frre=0
fsbo=0x170
fseo=0x783
avsp=0x3cbe
tosp=0x3d13
0xe:pti[0]      nrow=175        offs=0
0x12:pri[0]     sfll=1
0x14:pri[1]     sfll=2

我們可以發現，除開其他的非核心對象之後，這裡還涉及到一個obj#=18，也就是obj$ 這個核心的資料字典表。而該資料字典表上的幾個Index，
i_obj1，i_obj2，i_obj3 都是object_id 小於57的核心對象，這部分對象是屬於bootstrap$ 的核心資料字典對象。即是Index也無法通過
rebuild，38003 event或在upgrade模式下進行重建。
當然，這裡也不是說完全無法去重建上述資料字典表，我後面有一篇文章會相信講解如何去重建。
在分析過程中，我發現其中的前面2個Index都有問題，如下：

SQL> analyze table obj$ validate structure;

Table analyzed.

SQL> select index_name from dba_indexes where table_name='OBJ$';

INDEX_NAME
------------------------------
I_OBJ1
I_OBJ2
I_OBJ3

SQL> analyze index I_OBJ1 validate structure;
analyze index I_OBJ1 validate structure
*
ERROR at line 1:
ORA-08100: index is not valid - see trace file for diagnostics

SQL> analyze index I_OBJ2 validate structure;
analyze index I_OBJ2 validate structure
*
ERROR at line 1:
ORA-08100: index is not valid - see trace file for diagnostics

SQL> analyze index I_OBJ3 validate structure;

Index analyzed.

不過這裡我們也要注意的時，雖然前面2個index都有問題，然而上述錯誤產生時涉及到的index並不是2個都用到了，其實只是用到了第一個index就
報錯ora-00600錯誤了。由於客戶想通過expdp schema的方式去匯出資料，然而發現執行時報錯ora-00600 [kdsgrp1],包括exp執行時也報該
錯誤，不過exp tables的方式，不會報錯；由於對象太多，將近50萬個對象（包括表，index以及其他）。很明顯，只能通過使用者層級的匯出。
那麼也就在意味著我們必須修複這個錯誤才。
通過dump 相關的block，我們發現錯誤是很奇怪的，如下：

*** SESSION ID:(1052.1243) 2015-10-03 17:56:59.784
Block Checking: DBA = 4684328, Block Type = KTB-managed data block
*** previous block dba dba: 477a2adoes not match my previous block dba dba: 477a26
**** row 0: key out of order
---- end index block validation
---- for block 4684328 = 0x00477a28
Block header dump: 0x00477a28
Object id on Block? Y
seg/obj: 0x24 csc: 0x02.80026908 itc: 19 flg: O typ: 2 - INDEX
     fsl: 0 fnx: 0x477a1c ver: 0x01

Itl           Xid                  Uba         Flag Lck        Scn/Fsc
0x01   0x0015.01f.00020a0a 0x00867c38.0862.04 C---    0 scn 0x0001.83f32656
0x02   0x002e.044.00002415 0x0082989a.0477.52 C---    0 scn 0x0001.74d90109
0x03   0x002c.00a.000012c5 0x0b804b78.012b.70 C---    0 scn 0x0001.74d9011c
0x04   0x0002.001.00003a72 0x00861aab.024b.0a C---    0 scn 0x0001.74fa8cf1
0x05   0x002f.02e.000065eb 0x0b816866.0217.5b C---    0 scn 0x0001.74fb0205
0x06   0x0026.011.00002000 0x0b814f02.016d.14 C---    0 scn 0x0001.74fc3dbf
0x07   0x0006.011.00003b1c 0x0084c65b.01f9.10 C---    0 scn 0x0001.74fde478
0x08   0x0029.05f.00003258 0x0084faa6.017c.2f C---    0 scn 0x0001.74fdedad
0x09   0x001b.023.00003fe6 0x008a8718.025b.3b C---    0 scn 0x0001.74fdfedc
0x0a   0x001b.006.00004057 0x0b804f24.025c.33 C---    0 scn 0x0001.751a25ed
0x0b   0x001b.02a.00004039 0x0b804f24.025c.49 C---    0 scn 0x0001.751a2609
0x0c   0x001b.024.0000404f 0x0b804f24.025c.63 C---    0 scn 0x0001.751a2627
0x0d   0x000a.018.00021b0b 0x0b821c8b.04cc.45 C---    0 scn 0x0001.751a263f
0x0e   0x000a.055.00021b0a 0x0b821c8b.04cc.5e C---    0 scn 0x0001.751a265c
0x0f   0x000a.031.00021aec 0x0b821c8c.04cc.09 C---    0 scn 0x0001.751a2678
0x10   0x0022.05e.00001dbd 0x008a88df.01e0.2e C---    0 scn 0x0001.74d8ff09
0x11   0x0025.010.00001ead 0x00860446.01ac.20 C---    0 scn 0x0001.74d8ff7d
0x12   0x002b.043.00001228 0x0084dc65.0164.11 C---    0 scn 0x0001.74d8ffe2
0x13   0x001f.015.00001de8 0x0b81a76a.0183.4e C---    0 scn 0x0001.74d90071

Leaf block dump
===============
header address 504403185228956148=0x70000064725c1f4
kdxcolev 0
KDXCOLEV Flags = - - -
kdxcolok 0
kdxcoopc 0x80: opcode=0: iot flags=--- is converted=Y
kdxconco 1
kdxcosdc 2
kdxconro 0
kdxcofbo 36=0x24
kdxcofeo 15816=0x3dc8
kdxcoavs 15780
kdxlespl 0
kdxlende 0
kdxlenxt 4684330=0x477a2a
kdxleprv 4684326=0x477a26
kdxledsz 6
kdxlebksz 15816
*** dummy key ***
row#0[9549] flag: ---D--, lock: 0, len=14, data:(6): 00 44 03 10 00 7c
col 0; len 5; (5): c4 04 24 25 2f
----- end of leaf block dump -----
dumping parent of corrupted subtree, row # = 1202
dumping parent of corrupted subtree, row # = 3

對於上述這個index 的錯誤，我是第一次遇見，跟老熊討論了一下，他認為可能是index split情況之下出現的。在遠程時，我也用過bbed
對前後將近10個index block進行了分析，通過比較index 的鏈表，發現確實不匹配。
對於這種情況之下，想通過bbed去修複 index，難度可想而知，因此果斷放棄這種方式。最後無奈之下，只能通過處理資料字典表
的方式來處理掉i_obj1，i_obj2 這2個index。最後再讓客戶進行exp 使用者層級的匯出，只不過這個匯出的時間比較漫長了。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More