這是一個網友的問題,10201的windows環境,非歸檔,無備份,資料庫open的時候報常見的ORA-00600 4193錯誤。
這個錯誤很常見,我們也分析過多次,這裡再次來看下網友這裡的情況。其中alert log如下:
Fri Nov 07 13:58:18 2014
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Fri Nov 07 13:58:18 2014
SMON: enabling cache recovery
Fri Nov 07 13:58:19 2014
Errors in file e:\oracle\product\10.2.0\admin\orcl\udump\orcl_ora_7420.trc:
ORA-00600: 內部錯誤碼, 參數: [4193], [65], [71], [], [], [], [], []
Fri Nov 07 13:58:21 2014
Doing block recovery for file 1 block 415
Block recovery from logseq 6910, block 3 to scn 262520099
Fri Nov 07 13:58:21 2014
Recovery of Online Redo Log: Thread 1 Group 3 Seq 6910 Reading mem 0
Mem# 0 errs 0: E:\ORACLE\PRODUCT\10.2.0\ORADATA\ORCL\REDO03.LOG
Block recovery stopped at EOT rba 6910.5.16
Block recovery completed at rba 6910.5.16, scn 0.262520098
Doing block recovery for file 1 block 9
Block recovery from logseq 6910, block 3 to scn 262520097
Fri Nov 07 13:58:21 2014
Recovery of Online Redo Log: Thread 1 Group 3 Seq 6910 Reading mem 0
Mem# 0 errs 0: E:\ORACLE\PRODUCT\10.2.0\ORADATA\ORCL\REDO03.LOG
Block recovery completed at rba 6910.5.16, scn 0.262520098
Fri Nov 07 13:58:21 2014
Errors in file e:\oracle\product\10.2.0\admin\orcl\udump\orcl_ora_7420.trc:
ORA-00604: 遞迴 SQL 層級 1 出現錯誤
ORA-00607: 當更改資料區塊時出現內部錯誤
ORA-00600: 內部錯誤碼, 參數: [4193], [65], [71], [], [], [], [], []
Error 604 happened during db open, shutting down database
USER: terminating instance due to error 604
對於ORA-00600 4193錯誤,Oracle docs是這樣解釋的:
ERROR:
ORA-600 [4193] [a] [b]
VERSIONS:
versions 6.0 to 10.1
DESCRIPTION:
A mismatch has been detected between Redo records and Rollback (Undo)
records.
We are validating the Undo block sequence number in the undo block against
the Redo block sequence number relating to the change being applied.
This error is reported when this validation fails.
通過查看網友傳的alert log資訊,發現這哥們進行了大量的操作,幾乎把所有的恢複動作都試了一遍。如下:
System parameters with non-default values:
processes = 300
sessions = 335
event = 10513 trace name context forever,level 2 : 10512 trace name context forever,level 1: 10511 trace name context forever,level 2: 10510 trace name context forever,level 1
sga_max_size = 1619001344
__shared_pool_size = 150994944
__large_pool_size = 8388608
__java_pool_size = 8388608
__streams_pool_size = 0
sga_target = 1619001344
control_files = E:\ORACLE\PRODUCT\10.2.0\ORADATA\ORCL\CONTROL01.CTL, E:\ORACLE\PRODUCT\10.2.0\ORADATA\ORCL\CONTROL02.CTL, E:\ORACLE\PRODUCT\10.2.0\ORADATA\ORCL\CONTROL03.CTL
db_block_size = 8192
__db_cache_size = 1442840576
compatible = 10.2.0.1.0
db_file_multiblock_read_count= 16
db_recovery_file_dest = E:\oracle\product\10.2.0/flash_recovery_area
db_recovery_file_dest_size= 2147483648
_allow_resetlogs_corruption= TRUE
undo_management = MANUAL
undo_tablespace = UNDOTBS1
_smu_debug_mode = 4
remote_login_passwordfile= EXCLUSIVE
db_domain =
dispatchers = (PROTOCOL=TCP) (SERVICE=orclXDB)
job_queue_processes = 10
audit_file_dest = E:\ORACLE\PRODUCT\10.2.0\ADMIN\ORCL\ADUMP
background_dump_dest = E:\ORACLE\PRODUCT\10.2.0\ADMIN\ORCL\BDUMP
user_dump_dest = E:\ORACLE\PRODUCT\10.2.0\ADMIN\ORCL\UDUMP
core_dump_dest = E:\ORACLE\PRODUCT\10.2.0\ADMIN\ORCL\CDUMP
db_name = orcl
open_cursors = 300
pga_aggregate_target = 203423744
我們可以看到他使用了event 來屏蔽smon的復原段,使用了隱含參數強制開啟,使用了undo參數來修改資料表空間。據說還使用了強制
offline 復原段的一些參數。
實際上,針對這個問題,我們先不管怎麼解決,首先我們需要分析為什麼Oracle這裡會報這個錯誤?
首先有幾個問題:
1) Oracle 在open的過程之中是執行什麼sql報錯的?
2) 為什麼會報錯
3)如果強制屏蔽復原段是否有影響?
對於第一個問題,很簡單,我們搜尋trace就很容易定位到是這個SQL在執行時報錯的:
ORA-00600: 內部錯誤碼, 參數: [4193], [65], [71], [], [], [], [], []
Current SQL statement for this session:
update undo$ set name=:2,file#=:3,block#=:4,status$=:5,user#=:6,undosqn=:7,xactsqn=:8,scnbas=:9,scnwrp=:10,inst#=:11,ts#=:12,spare1=:13 where us#=:1
進一步搜尋,我們還可以定位到Oracle在執行這個遞迴SQL時,是在操作什麼復原段的時候報錯的(這裡是US#表示復原段編號):
Cursor#5(050D0F84) state=BOUND curiob=60F2AE90
curflg=d fl2=0 par=050D0E84 ses=7AB23D28
sqltxt(7A734404)=update undo$ set name=:2,file#=:3,block#=:4,status$=:5,user#=:6,undosqn=:7,xactsqn=:8,scnbas=:9,scnwrp=:10,inst#=:11,ts#=:12,spare1=:13 where us#=:1
hash=9caba1288112094d5553173dd30cd6c3
parent=0CF83F1C maxchild=01 plk=0EEB3834 ppn=n
cursor instantiation=60F2AE90
child#0(7A7342C0) pcs=0CF84120
clk=0EED3A90 ci=0CF83774 pn=0E493BC4 ctx=0C686938
kgsccflg=0 llk[60F2AE94,60F2AE94] idx=0
xscflg=e0100666 fl2=d100400 fl3=4022218c fl4=0
Bind bytecodes
Opcode = 5 Bind Rpi Scalar Sql In (not out) Nocopy
Offsi = 36, Offsi = 0
......
Opcode = 1 Unoptimized
Offsi = 36, Offsi = 240
kkscoacd
Bind#0
oacdty=01 mxl=32(09) mxlc=00 mal=00 scl=00 pre=00
oacflg=18 fl2=0001 frm=01 csi=852 siz=32 off=0
kxsbbbfp=7a7346ba bln=32 avl=09 flg=09
value="_SYSSMU1$"
Bind#1
oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
kxsbbbfp=60f26c6c bln=24 avl=02 flg=05
value=2
..........
value=1
Bind#12
oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
kxsbbbfp=60f26c90 bln=22 avl=02 flg=05
value=1
Frames pfr 60F27024 siz=4832 efr 60F2706C siz=4820
Cursor frame dump
enxt: 3.0x00000710 enxt: 2.0x00000104 enxt: 1.0x00000ac0
pnxt: 2.0x00000004 pnxt: 1.0x00000008
kxscphp 05471F28 siz=1000 inu=0 nps=416
kxscehp 05472450 siz=1000 inu=0 nps=744
----------------------------------------
很明顯,我們可以看到,本質上Oracle是在對復原段_SYSSMU1$ 進行update時出現問題了。
現在我們來回答第2個問題,Oracle為什麼會報錯呢? 我們再來看下這個ORA-00600錯誤:
ORA-00600: 內部錯誤碼, 參數: [4193], [65], [71], [], [], [], [], []
根據文檔的解釋,這個錯誤的意思是redo record的seq和undo record的seq不匹配導致。
那麼這裡的65和71到底是什麼意思呢 ? 這個錯誤的格式是這樣: ORA-600 [4193] [a] [b]
a 即 65,表示undo record seq
b 即 71,表示redo record seq.
那麼Oracle這裡為什麼會得出一個不一致的結果呢?很明顯,65 是不等於71的。
從網友提供的Trace檔案,我們可以看到這樣一段資訊:
ktudb redo: siz: 252 spc: 6372 flg: 0x0012 seq: 0x0047 rec: 0x08
xid: 0x0000.026.00000052
ktubl redo: slt: 38 rci: 0 opc: 11.1 objn: 15 objd: 15 tsn: 0
Undo type: Regular undo Begin trans Last buffer split: No
Temp Object: No
Tablespace Undo: No
0x00000000 prev ctl uba: 0x0040019f.0047.07
prev ctl max cmt scn: 0x0000.0f92fecd prev tx cmt scn: 0x0000.0f92fecf
很明顯,這是redo相關資訊,這裡的seq 為0×0047,轉換之後即為71. 原來,這就是ora-00600錯誤的b 值的來源.
那麼ora-00600錯誤的a值,65又是哪兒來的呢 ? 從前面的UBA資訊,我們可以知道,這個事務對應的undo block是:0x0040019f
那麼我們來看下這個undo block中的內容是什麼樣的? 搜尋block地址,我們發現這是一個system的block,顯然是system 復原段
的block,如下:
Block after image is corrupt:
buffer tsn: 0 rdba: 0x0040019f (1/415)
scn: 0x0000.0f468845 seq: 0x01 flg: 0x04 tail: 0x88450201
frmt: 0x02 chkval: 0x911f type: 0x02=KTU UNDO BLOCK
Hex dump of block: st=0, typ_found=1
Dump of memory from 0x08DB6000 to 0x08DB8000
8DB6000 0000A202 0040019F 0F468845 04010000 [......@.E.F.....]
.........
8DB7FF0 2E3004C1 332E02C1 323202C1 88450201 [..0....3..22..E.]
********************************************************************************
UNDO BLK:
xid: 0x0000.043.00000047 seq: 0x41 cnt: 0x22 irb: 0x22 icl: 0x0 flg: 0x0000
Rec Offset Rec Offset Rec Offset Rec Offset Rec Offset
---------------------------------------------------------------------------
0x01 0x1f2c 0x02 0x1e30 0x03 0x1d34 0x04 0x1c38 0x05 0x1b3c
0x06 0x1a40 0x07 0x1944 0x08 0x1848 0x09 0x174c 0x0a 0x1650
0x0b 0x1554 0x0c 0x1458 0x0d 0x135c 0x0e 0x1260 0x0f 0x1164
0x10 0x1068 0x11 0x0f6c 0x12 0x0e70 0x13 0x0d74 0x14 0x0c78
0x15 0x0b7c 0x16 0x0a80 0x17 0x09c4 0x18 0x0908 0x19 0x084c
0x1a 0x0790 0x1b 0x06d4 0x1c 0x0618 0x1d 0x055c 0x1e 0x04a0
0x1f 0x03e4 0x20 0x02e8 0x21 0x01ec 0x22 0x00f0
......
.......
*-----------------------------
* Rec #0x7 slt: 0x22 objn: 15(0x0000000f) objd: 15 tblspc: 0(0x00000000)
* Layer: 11 (Row) opc: 1 rci 0x00
Undo type: Regular undo Begin trans Last buffer split: No
Temp Object: No
Tablespace Undo: No
rdba: 0x00000000
*-----------------------------
uba: 0x0040019f.0041.06 ctl max scn: 0x0000.0f3d6a95 prv tx scn: 0x0000.0f3d6a97
txn start scn: scn: 0x0000.0f44c775 logon user: 0
prev brb: 4194714 prev bcl: 0
KDO undo record:
KTB Redo
op: 0x04 ver: 0x01
op: L itl: xid: 0x0000.031.00000047 uba: 0x0040019f.0041.06
flg: C--- lkc: 0 scn: 0x0000.0f44c774
KDO Op code: URP row dependencies Disabled
xtype: XA flags: 0x00000000 bdba: 0x0040006a hdba: 0x00400069
itli: 1 ispac: 0 maxfr: 4863
tabn: 0 slot: 3(0x3) flag: 0x2c lock: 0 ckix: 93
ncol: 17 nnew: 12 size: 0
col 1: [ 9] 5f 53 59 53 53 4d 55 33 24
col 2: [ 2] c1 02
col 3: [ 2] c1 03
col 4: [ 2] c1 2a
col 5: [ 6] c5 03 39 11 3a 05
col 6: [ 1] 80
col 7: [ 4] c3 13 4c 39
col 8: [ 4] c3 04 52 5b
col 9: [ 1] 80
col 10: [ 2] c1 03
col 11: [ 2] c1 02
col 16: [ 2] c1 02
我們根據前面的UBA:uba: 0x0040019f.0047.07 定位到該事務的資訊應該是undo block的第7個record中,當我們定位到第7個record時,
我們可以看到,這裡的seq其實卻是41,轉換為10進位就是65. 這也就是ORA-00600這個錯誤的由來.
最後我們來回答第3個問題,那麼就如果強制屏蔽復原段是否有影響呢?很明顯,網友這裡的對象是obj 15,這是一個核心對象。
強制的屏蔽復原段肯定是不妥的。其實處理方法有很多種,針對類似的問題我已經在道森Oracle培訓的特殊恢複課程中講過多次了。