oracle資料庫1.4TB ASM(RAC) 磁碟損壞恢複案例

來源:互聯網
上載者:User

這周折騰了2天的時間幫客戶成功恢複了一套近1.4TB的10.2.0.5 RAC(ASM). 該庫在3月4號直接crash了。
大家可以看到,該庫在開始報錯讀取redo,controlfile報錯,本質原因是DISKGROUP dismount了,資訊如下:

Tue Mar 04 18:09:59 CST 2014
Errors in file /home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_lgwr_15943.trc:
ORA-00345: redo log write error block 68145 count 5
ORA-00312: online log 6 thread 2: '+DATA/xxxx/onlinelog/o2_t2_redo3.log'
ORA-15078: ASM diskgroup was forcibly dismounted
Tue Mar 04 18:09:59 CST 2014
SUCCESS: diskgroup DATA was dismounted
SUCCESS: diskgroup DATA was dismounted
Tue Mar 04 18:10:00 CST 2014
Errors in file /home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_lmon_15892.trc:
ORA-00202: control file: '+DATA/xxxx/controlfile/o1_mf_4g1zr1yo_.ctl'
ORA-15078: ASM diskgroup was forcibly dismounted
Tue Mar 04 18:10:00 CST 2014
KCF: write/open error block=0x1f41e online=1
 file=31 +DATA/xxxx/datafile/apps_ts_queues.310.692585175
 error=15078 txt: ''
Tue Mar 04 18:10:00 CST 2014
KCF: write/open error block=0x47d5d online=1
 file=51 +DATA/xxx/datafile/apps_ts_tx_data.353.692593409
 error=15078 txt: ''
Tue Mar 04 18:10:00 CST 2014
Errors in file /home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_dbw2_15939.trc:
ORA-00202: control file: '+DATA/prod/controlfile/o1_mf_4g1zr1yo_.ctl'
ORA-15078: ASM diskgroup was forcibly dismounted
Tue Mar 04 18:10:00 CST 2014
KCF: write/open error block=0x47d5b online=1
 file=51 +DATA/prod/datafile/apps_ts_tx_data.353.692593409
 error=15078 txt: ''

Tue Mar 04 18:10:00 CST 2014

資料庫執行個體掛了之後,我們來看下ASM執行個體的alert log資訊,如下:

Tue Mar 04 18:10:04 CST 2014
NOTE: SMON starting instance recovery for group 1 (mounted)
Tue Mar 04 18:10:04 CST 2014
WARNING: IO Failed.  au:0 diskname:/dev/raw/raw5
 rq:0x200000000207b518 buffer:0x200000000235c600 au_offset(bytes):0 iosz:4096 operation:0
 status:2
WARNING: IO Failed.  au:0 diskname:/dev/raw/raw5
 rq:0x200000000207b518 buffer:0x200000000235c600 au_offset(bytes):0 iosz:4096 operation:0
 status:2
NOTE: F1X0 found on disk 0 fcn 0.160230519
WARNING: IO Failed.  au:33 diskname:/dev/raw/raw5
 rq:0x60000000002d64f0 buffer:0x400405df000 au_offset(bytes):0 iosz:4096 operation:0
 status:2
WARNING: cache failed to read gn 1 fn 3 blk 10752 count 1 from disk 2
ERROR: cache failed to read fn=3  blk=10752 from disk(s): 2
ORA-15081: failed to submit an I/O operation to a disk
NOTE: cache initiating offline of disk 2  group 1
WARNING: process 12863 initiating offline of disk 2.2526420198 (DATA_0002) with mask 0x3 in group 1
NOTE: PST update: grp = 1, dsk = 2, mode = 0x6
Tue Mar 04 18:10:04 CST 2014
ERROR: too many offline disks in PST (grp 1)
Tue Mar 04 18:10:04 CST 2014
ERROR: PST-initiated MANDATORY DISMOUNT of group DATA
Tue Mar 04 18:10:04 CST 2014
WARNING: Disk 2 in group 1 in mode: 0x7,state: 0x2 was taken offline
Tue Mar 04 18:10:05 CST 2014
NOTE: halting all I/Os to diskgroup DATA
NOTE: active pin found: 0x0x40045bb0fd0
Tue Mar 04 18:10:05 CST 2014
Abort recovery for domain 1
Tue Mar 04 18:10:05 CST 2014
NOTE: cache dismounting group 1/0xD916EC16 (DATA)
Tue Mar 04 18:10:06 CST 2014

大家可以看到,ASM報了一個ORA-15081錯誤,在該錯誤之前是報對其中一個盤/dev/raw/raw5的IO操作錯誤。
細心的朋友可以看到,這裡由於IO 操作異常後,該disk被offline了。最後磁碟組無法mount。
我們測試使用kfed read無法讀取該disk,dd也無法操作。但是卻可以直接dd 該disk對應的物理盤。
磁碟組無法mount,從其中trace來看顯然是磁碟頭損壞,如下:


WARNING: cache read a corrupted block gn=1 dsk=2 blk=1 from disk 2
OSM metadata block dump:
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: T=0 NUMB=0x0
kfbh.block.obj:                       0 ; 0x008: TYPE=0x0 NUMB=0x0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
 CE: (0x0x400417ee4e0)  group=1 (DATA) obj=2 (disk)  blk=1
 hashFlags=0x0002  lid=0x0002  lruFlags=0x0000  bastCount=1
 redundancy=0x11  fileExtent=-2147483648  AUindex=0 blockIndex=1
 copy #0:  disk=2  au=0
 BH: (0x0x40041795000)  bnum=4586 type=reading state=reading chgSt=not modifying
 flags=0x00000000  pinmode=excl  lockmode=share  bf=0x0x40041400000
 kfbh_kfcbh.fcn_kfbh = 0.0  lowAba=655.8572  highAba=0.0
 last kfcbInitSlot return code=null cpkt lnk is null

大家知道Oracle ASM 10.2.0.5版本開始會對ASM disk header 進行自動備份,如果如果僅僅是盤頭
損壞那麼恢複是很easy的。但是其實並不是這麼簡單,通過dd判斷,該盤的前面幾個block其實被損壞。
最後我們通過ODU 直接將資料檔案從磁碟拷貝到檔案系統,然後起庫,最後完成整個恢複過程。
備忘:在恢複過程中,發現ODU無法直接拷貝test201402.dbf 這樣的檔案,然而通過檢查
asm alias directory發現,其實是完好的,這裡可能odu處理還有點小問題,我們通過手工將該中繼資料
的AU 讀取出來,然後匹配將剩下的檔案全部抽取出來了,包括redo,controlfile,直接順利開啟資料庫。
不得不說,熊哥的ODU太強大了,秒殺各種Oracle ASM的資料庫恢複Case!

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.