This week, after two days of hard work, the customer successfully recovered a set of 10.2.0.5 RAC (ASM) of nearly TB. The database crash directly on April 9.
As you can see, when the database starts to report an error, read redo and controlfile reports an error because DISKGROUP dismount is used. The information is as follows:
Tue Mar 04 18:09:59 CST 2014
Errors in file/home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_lgwr_15943.trc:
ORA-00345: redo log write error block 68145 count 5
ORA-00312: online log 6 thread 2: '+ DATA/xxxx/onlinelog/o2_t2_redo3.log'
ORA-15078: ASM diskgroup was forcibly dismounted.
Tue Mar 04 18:09:59 CST 2014
SUCCESS: diskgroup DATA was dismounted
SUCCESS: diskgroup DATA was dismounted
Tue Mar 04 18:10:00 CST 2014
Errors in file/home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_lmon_15892.trc:
ORA-00202: control file: '+ DATA/xxxx/controlfile/o1_mf_4g1zr1yo _. ctl'
ORA-15078: ASM diskgroup was forcibly dismounted.
Tue Mar 04 18:10:00 CST 2014
KCF: write/open error block = 0x1f41e online = 1
File = 31 + DATA/xxxx/datafile/pai_ts_queues.310.692585175
Error = 15078 txt :''
Tue Mar 04 18:10:00 CST 2014
KCF: write/open error block = 0x47d5d online = 1
File = 51 + DATA/xxx/datafile/pai_ts_tx_data.353.692593409
Error = 15078 txt :''
Tue Mar 04 18:10:00 CST 2014
Errors in file/home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_dbw2_15939.trc:
ORA-00202: control file: '+ DATA/prod/controlfile/o1_mf_4g1zr1yo _. ctl'
ORA-15078: ASM diskgroup was forcibly dismounted.
Tue Mar 04 18:10:00 CST 2014
KCF: write/open error block = 0x47d5b online = 1
File = 51 + DATA/prod/datafile/pai_ts_tx_data.353.692593409
Error = 15078 txt :''
Tue Mar 04 18:10:00 CST 2014
After the database instance crashes, let's take a look at the alert log information of the ASM instance, as shown below:
Tue Mar 04 18:10:04 CST 2014
NOTE: SMON starting instance recovery for group 1 (mounted)
Tue Mar 04 18:10:04 CST 2014
WARNING: IO Failed. au: 0 diskname:/dev/raw/raw5
Rq: 0x200000000207b518 buffer: 0x200000000235c600 au_offset (bytes): 0 iosz: 4096 operation: 0
Status: 2
WARNING: IO Failed. au: 0 diskname:/dev/raw/raw5
Rq: 0x200000000207b518 buffer: 0x200000000235c600 au_offset (bytes): 0 iosz: 4096 operation: 0
Status: 2
NOTE: F1X0 found on disk 0 fcn 0.160230519
WARNING: IO Failed. au: 33 diskname:/dev/raw/raw5
Rq: 0x60000000002d64f0 buffer: 0x400405df000 au_offset (bytes): 0 iosz: 4096 operation: 0
Status: 2
WARNING: cache failed to read gn 1 fn 3 blk 10752 count 1 from disk 2
ERROR: cache failed to read fn = 3 blk = 10752 from disk (s): 2
ORA-15081: failed to submit an I/O operation to a disk
NOTE: cache initiating offline of disk 2 group 1
WARNING: process 12863 initiating offline of disk 2.2526420198 (DATA_0002) with mask 0x3 in group 1
NOTE: PST update: grp = 1, dsk = 2, mode = 0x6
Tue Mar 04 18:10:04 CST 2014
ERROR: too offline disks in PST (grp 1)
Tue Mar 04 18:10:04 CST 2014
ERROR: PST-initiated mandatory dismount of group DATA
Tue Mar 04 18:10:04 CST 2014
WARNING: Disk 2 in group 1 in mode: 0x7, state: 0x2 was taken offline
Tue Mar 04 18:10:05 CST 2014
NOTE: halting all I/OS to diskgroup DATA
NOTE: active pin found: 0x0x40045bb0fd0
Tue Mar 04 18:10:05 CST 2014
Abort recovery for domain 1
Tue Mar 04 18:10:05 CST 2014
NOTE: cache dismounting group 1/0xD916EC16 (DATA)
Tue Mar 04 18:10:06 CST 2014
We can see that ASM reported a ORA-15081 error, before this error is reported on one of the disk/dev/raw/raw5 IO operation error.
Careful friends can see that the disk is offline because of an exception in the IO operation. The last disk cannot be mounted.
We tested that kfed read could not read the disk, and dd could not operate. However, you can directly dd the physical disk corresponding to the disk.
The disk cannot be mounted. trace indicates that the disk header is damaged, as shown below:
WARNING: cache read a previous upted block gn = 1 dsk = 2 blk = 1 from disk 2
OSM metadata block dump:
Kfbh. endian: 0; 0x000: 0x00
Kfbh. hard: 0; 0x001: 0x00
Kfbh. type: 0; 0x002: KFBTYP_INVALID
Kfbh. datfmt: 0; 0x003: 0x00
Kfbh. block. blk: 0; 0x004: T = 0 NUMB = 0x0
Kfbh. block. obj: 0; 0x008: TYPE = 0x0 NUMB = 0x0
Kfbh. check: 0; 0x00c: 0x00000000
Kfbh. fcn. base: 0; 0x10: 0x00000000
Kfbh. fcn. wrap: 0; 0x014: 0x00000000
Kfbh. spare1: 0; 0x018: 0x00000000
Kfbh. spare2: 0; 0x01c: 0x00000000
CE: (0x0x400417ee4e0) group = 1 (DATA) obj = 2 (disk) blk = 1
HashFlags = 0x0002 lid = 0x0002 lruFlags = 0x0000 bastCount = 1
Redundancy = 0x11 fileExtent =-2147483648 AUindex = 0 blockIndex = 1
Copy #0: disk = 2 au = 0
BH: (0x0x40041795000) bnum = 4586 type = reading state = reading chgSt = not modifying
Flags = 0x00000000 pinmode = excl lockmode = share bf = 0x0x40041400000
Kfbh_kfcbh.fcn_kfbh = 0.0 lowAba = 655.8572 highAba = 0.0
Last kfcbInitSlot return code = null cpkt lnk is null
We all know that the ASM disk header will be automatically backed up from Oracle ASM 10.2.0.5.
Therefore, it is easy to recover from corruption. However, it is not that simple. dd determines that the first block of the disk is actually damaged.
Finally, we directly copy data files from the disk to the file system through ODU, then start the database, and finally complete the recovery process.
Note: During the recovery process, ODU cannot directly copy files such as test201402.dbf.
Asm alias directory found that the metadata is actually intact. Here there may be a small problem with odu processing. We can manually
And then extract all the remaining files, including redo and controlfile, to open the database smoothly.
I have to say that brother Xiong's ODU is so powerful that he can kill all kinds of Oracle ASM databases to restore the Case!