If a 50 tb asm of a customer fails, it will be recovered after joint efforts. Please record it here! In fact, it was much simpler than I thought. The following is a detailed description of the fault.
-Db alert log information
Mon Jul 6 15:17:43 2015
Errors in file/oraclehome/admin/xxxx/udump/xxxx_ora_262606.trc:
ORA-27063: number of bytes read/written is incorrect
Ibm aix risc System/6000 Error: 6: No such device or address
Additional information:-1
Additional information: 1048576
ORA-19502: write error on file "+ DG_xxxx/xxxx/datafile/xxxx_load_1.679.847025081", blockno 3136352 (blocksize = 32768)
ORA-27063: number of bytes read/written is incorrect
......
......
Errors in file/oraclehome/admin/xxxx/bdump/xxxx_lgwr_185246.trc:
ORA-00340: IO error processing online log 9 of thread 1
ORA-00345: redo log write error block 394389 count 2
ORA-00312: online log 9 thread 1: '+ DG_xxxx/xxxx/onlinelog/group_9.433.789664647'
ORA-15078: ASM diskgroup was forcibly dismounted.
Mon Jul 6 15:18:46 2015
LGWR: terminating instance due to error 340.
Instance terminated by LGWR, pid = 185246
From the alert log of the db, I/O exception occurs, causing the lgwr process to write logs, and finally the lgwr process forcibly terminates the database instance. Obviously, here we need to analyze why the lgwr process cannot write logs? Check the asm log as follows:
Thu Jul 3 08:24:26 2014
NOTE: ASMB process exiting due to lack of ASM file activity
Mon Jul 6 15:18:44 2015
Errors in file/oraclehome/product/10.2.0/admin/+ ASM/udump/+ asm_ora_353008.trc:
ORA-27091: unable to queue I/O
ORA-27072: File I/O error.
Ibm aix risc System/6000 Error: 6: No such device or address
Additional information: 7
Additional information:-939091968
Additional information:-1
Mon Jul 6 15:18:46 2015
NOTE: cache initiating offline of disk 32 group 1
WARNING: offlining disk 32.1115675731 (DG_xxxx_0032) with mask 0x3
NOTE: PST update: grp = 1, dsk = 32, mode = 0x6
Mon Jul 6 15:18:46 2015
ERROR: too offline disks in PST (grp 1)
Mon Jul 6 15:18:46 2015
ERROR: PST-initiated mandatory dismount of group DG_xxxx
From the above information, it is obvious that the IO problem occurs in disk 32 of the asm disk group, which causes the disk group to be forcibly offline and the database instance to crash. The following error is reported when the customer tries to manually mount diskgroup:
SQL> alter diskgroup datadg mount
Mon Jul 6 15:33:50 2015
Errors in file/oraclehome/product/10.2.0/admin/+ ASM/bdump/+ asm_dbw0_275092.trc:
ORA-15066: offlining disk "DG_xxxx_0032" may result in a data loss
ORA-15066: offlining disk "DG_xxxx_0032" may result in a data loss
ORA-15066: offlining disk "DG_xxxx_0032" may result in a data loss
ORA-15066: offlining disk "DG_xxxx_0032" may result in a data loss
......
......
ORA-15066: offlining disk "DG_xxxx_0032" may result in a data loss
OR
Mon Jul 6 15:33:51 2015
Errors in file/oraclehome/product/10.2.0/admin/+ ASM/bdump/+ asm_b000_360654.trc:
ORA-00600: internal error code, arguments: [kfcDismount02], [], [], [], [], [], [], [], []
Mon Jul 6 15:33:52 2015
NOTE: cache dismounting group 1/0xDDDF2CC7 (DG_xxxx)
NOTE: dbwr not being msg 'd to dismount
Mon Jul 6 15:33:52 2015
This error is most likely due to a bug. After installing the patch, the system tries to mount again and still reports an error. But the error has already occurred.
SQL> alter diskgroup dg_xxxx mount
Tue Jul 7 05:49:29 2015
NOTE: cache registered group DG_xxxx number = 1 incarn = 0x000061a1f
Tue Jul 7 05:49:31 2015
NOTE: when at: instance first (grp 1)
Tue Jul 7 05:49:36 2015
NOTE: start heartbeating (grp 1)
NOTE: cache opening disk 0 of grp 1: DG_xxxx_0000 path:/dev/rhdiskpower41
NOTE: cache opening disk 1 of grp 1: DG_xxxx_0001 path:/dev/rhdiskpower42
NOTE: cache opening disk 2 of grp 1: DG_xxxx_0002 path:/dev/rhdiskpower43
NOTE: cache opening disk 3 of grp 1: DG_xxxx_0003 path:/dev/rhdiskpower44
NOTE: cache opening disk 4 of grp 1: DG_xxxx_0004 path:/dev/rhdiskpower45
......
......
NOTE: cache opening disk 33 of grp 1: DG_xxxx_0033 path:/dev/rhdiskpower15
NOTE: cache opening disk 34 of grp 1: DG_xxxx_0034 path:/dev/rhdiskpower14
NOTE: cache opening disk 35 of grp 1: DG_xxxx_0035 path:/dev/rhdiskpower13
NOTE: cache mounting (first) group 1/0x000061a1f (DG_xxxx)
NOTE: starting recovery of thread = 1 ckpt = 6295.7329 group = 1
NOTE: crash recovery signalled OER-15131
ERROR: ORA-15131 signalled during mount of diskgroup DG_xxxx
NOTE: cache dismounting group 1/0x000061a1f (DG_xxxx)
ERROR: diskgroup DG_xxxx was not mounted
Tue Jul 7 05:50:10 2015
Shutting down instance: further logons disabled
We can see that crash recovery is required when Oracle ASM is mounted, where the checkpoint is 6295.7329. Check the trace and find that the position read by the checkpoint is as follows:
NOTE: starting recovery of thread = 1 ckpt = 6295.7329 group = 1
CE: (0x70000000000c9640) group = 1 (DG_xxxx) obj = 625 blk = 256 (indirect)
HashFlags = 0x0100 lid = 0x0002 lruFlags = 0x0000 bastCount = 1
Redundancy = 0x11 fileExtent = 0 AUindex = 0 blockIndex = 0
Copy #0: disk = 32 au = 1638611.
BH: (0x70000001079c360) bnum = 143 type = rcv reading state = rcvRead chgSt = not modifying
Flags = 0x00000000 pinmode = excl lockmode = null bf = 0x70000001048e000
Kfbh_kfcbh.fcn_kfbh = 0.0 lowAba = 0.0 highAba = 0.0
Last kfcbInitSlot return code = null cpkt lnk is null
* ** 05:26:12. 382
As you can see, oracle needs to read the No. 1,638,611st AU of the No. 32 disk, and the default value of the 10g AU is 1 M, so this location is roughly 1.6T. In fact, this checkpoint location, it is easy to find. kfed can directly read the data as follows:
[Xxxx:/oraclehome] $ kfed read/xxx/rhdiskpower13 aun = 3 blkn = 0 | more
Kfbh. endian: 0; 0x000: 0x00
Kfbh. hard: 130; 0x001: 0x82
Kfbh. type: 7; 0x002: KFBTYP_ACDC
Kfbh. datfmt: 1; 0x003: 0x01
Kfbh. block. blk: 0; 0x004: T = 0 NUMB = 0x0
Kfbh. block. obj: 3; 0x008: TYPE = 0x0 NUMB = 0x3
Kfbh. check: 1350450563; 0x00c: 0x507e3d83
Kfbh. fcn. base: 0; 0x10: 0x00000000
Kfbh. fcn. wrap: 0; 0x014: 0x00000000
Kfbh. spare1: 0; 0x018: 0x00000000
Kfbh. spare2: 0; 0x01c: 0x00000000
Kfracdc. eyec [0]: 65; 0x000: 0x41
Kfracdc. eyec [1]: 67; 0x001: 0x43
Kfracdc. eyec [2]: 68; 0x002: 0x44
Kfracdc. eyec [3]: 67; 0x003: 0x43
Kfracdc. thread: 1; 0x004: 0x00000001
Kfracdc. lastAba. seq: 4294967295; 0x008: 0 xffffffff
Kfracdc. lastAba. blk: 4294967295; 0x00c: 0 xffffffff
Kfracdc. blk0: 1; 0x10: 0x00000001
Kfracdc. blks: 10751; 0x014: 0x000029ff
Kfracdc. ckpt. seq: 6295; 0x018: 0x00001897 --- ckpt value
Kfracdc. ckpt. blk: 7329; 0x01c: 0x00001ca1
Kfracdc. fcn. base: 297751371; 0x020: 0x11bf534b
Kfracdc. fcn. wrap: 0; 0x024: 0x00000000
Kfracdc. bufBlks: 64; 0x028: 0x00000040
After various attempts, the customer still reports the following error when mounting the disk group:
Tue Jul 7 18:03:03 2015
Errors in file/oraclehome/product/10.2.0/admin/+ ASM/udump/+ asm_ora_438636.trc:
ORA-00600: internal error code, arguments: [kfcChkAio01], [], [], [], [], [], [], [], []
ORA-15196: invalid ASM block header [kfc. c: 5553] [blk_kfbl] [625] [2147483904] [2147483905! = 2147483904]
NOTE: crash recovery signalled OER-600
ERROR: ORA-600 signalled during mount of diskgroup DG_xxxx
NOTE: cache dismounting group 1/0xE70AB6D0 (DG_xxxx)
ERROR: diskgroup DG_xxxx was not mounted
Tue Jul 7 18:05:38 2015
This is exactly the same as the Oracle MOS documentation understanding and fixing errors ORA-600 [kfcChkAio01] and ORA-15196 (Doc ID 757529.1), so we recommend that you simply process it based on the description in this document, in fact, the processing method is very simple. This document provides a shell script. You only need to modify the block number. After processing, the disk is successfully mounted as follows:
Tue Jul 7 18:05:38 2015
SQL> alter diskgroup dg_xxxx mount
Tue Jul 7 18:05:38 2015
NOTE: cache registered group DG_xxxx number = 1 incarn = 0xce0ab6d3
Tue Jul 7 18:05:38 2015
NOTE: when at: instance first (grp 1)
Tue Jul 7 18:05:43 2015
NOTE: start heartbeating (grp 1)
......
......
NOTE: cache mounting (first) group 1/0xCE0AB6D3 (DG_xxxx)
NOTE: starting recovery of thread = 1 ckpt = 6295.7329 group = 1
NOTE: advancing ckpt for thread = 1 ckpt = 6295.8649
NOTE: cache recovered group 1 to fcn 0.297779775
Tue Jul 7 18:05:43 2015
NOTE: opening chunk 1 at fcn 0.297779775 ABA
NOTE: seq = 6296 blk = 8650
Tue Jul 7 18:05:43 2015
NOTE: cache mounting group 1/0xCE0AB6D3 (DG_xxxx) succeeded
SUCCESS: diskgroup DG_xxxx was mounted
Tue Jul 7 18:05:45 2015
NOTE: recovering COD for group 1/0xce0ab6d3 (DG_xxxx)
SUCCESS: completed COD recovery for group 1/0xce0ab6d3 (DG_xxxx)
Finally, let's go back and explain, why is this happening? In fact, the root cause is that the operation is not standard when the customer adds a disk before, as shown below:
Tue May 20 15:43:26 2014
SQL> alter diskgroup dg_xxxx add disk '/xxx/rhdiskpower24' size 1677721 M ,......
. '/Xxx/rhdiskpower16 'size 1677721 M,'/xxx/rhdiskpower15' size 167772
1 M, '/xxx/rhdiskpower14' size 1677721 M, '/xxx/rhdiskpower13' size 1677721 M rebalance power 8
Wed May 21 08:45:13 2014
Starting background process ASMB
ASMB started with pid = 13, OS id = 267028
Wed May 21 08:45:14 2014
NOTE: ASMB process exiting due to lack of ASM file activity
Wed May 21 12:24:34 2014
NOTE: stopping process ARB5
NOTE: stopping process ARB2
NOTE: stopping process ARB6
NOTE: stopping process ARB1
NOTE: stopping process ARB3
NOTE: stopping process ARB7
NOTE: stopping process ARB4
NOTE: stopping process ARB0
Wed May 21 12:24:38 2014
SUCCESS: rebalance completed for group 1/0x595ad46e (DG_xxxx)
Wed May 21 12:24:38 2014
SUCCESS: rebalance completed for group 1/0x595ad46e (DG_xxxx)
The disk with the above problem is disk 32nd, and its size is 1677721 M. In fact, we found that the physical size of the disk is 1638400 M.
In other words, when adding a disk, I wrote a relatively large value, which Oracle thought was so large, but it was not so big. Of course, this only means that the asm verification in Oracle 10g is not strict enough.
Therefore, the problem is very clear. The error "AU number 1638611" is greater than 1638400, so this is a non-existent location, so the asm crash.
Note: The asm diskgroup has 36 disks, each of which is 1.6 TB and about 53 TB. Basically, all of them are used up. Fortunately, it can be easily repaired, otherwise, the recovery difficulty and workload will be too large. There is no denying that Yun and emo are still the most powerful companies in China!