A friend came to me and said they had switched the previously stored to AIX direct-attached storage to a storage network with a fibre switch, and the RAC failed to start, allowing me to support it. The analysis was due to the incorrect start of the disk sequence after the change of link, and the maintenance staff set the Pvid on their ASM disk The disk group failed to mount properly so that the ASM disk of the Votedisk DG could not be accessed properly, thus the CSSD process of the RAC could not be started, the same disk group for the data file could not be mount, and the data 0 loss was achieved through kfed repair.
Platform version information (2 node RAC)
The code is as follows |
Copy Code |
$ sqlplus-v sql*plus:release 11.2.0.4.0 Production $ uname-a AIX DB2 1 7 00f9733e4c00 GI log error message 2014-12-20 16:44:08.769: [OHASD (6946818)]crs-2769:unable to failover resource ' Ora.diskmon '. 2014-12-20 16:44:11.775: [CSSD (9502756)]crs-1714:unable to discover any voting files, retrying discovery in seconds; Details at (: CSSNM00070:) in/u01/app/11.2.0/grid/log/db1/cssd/ocssd.log 2014-12-20 16:44:26.791: [CSSD ( 9502756)]crs-1714:unable to discover any voting files, retrying discovery in seconds; , Details at (: CSSNM00070:) in/u01/app/11.2.0/grid/log/db1/cssd/ocssd.log 2014-12-20 16:44:41.812: [CSSD ( 9502756)]crs-1714:unable to discover any voting files, retrying discovery in seconds; Details at (: CSSNM00070:) in/u01/app/11.2.0/grid/log/db1/cssd/ocssd.log |
It can be seen from here that because the RAC does not get votedisk during the boot process so that it does not start properly, the analysis log to find out Votedisk related disk
The code is as follows |
Copy Code |
2014-12-15 17:36:15.424: [CSSD (10027070)] CRS-1605:CSSD voting file is online:/dev/rhdisk4; Details In/u01/app/11.2.0/grid/log/db1/cssd/ocssd.log 2014-12-15 17:36:15.433: [CSSD (10027070)] CRS-1605:CSSD voting file is online:/dev/rhdisk5; Details In/u01/app/11.2.0/grid/log/db1/cssd/ocssd.log 2014-12-15 17:36:15.445: [CSSD (10027070)] CRS-1605:CSSD voting file is online:/dev/rhdisk6; Details In/u01/app/11.2.0/grid/log/db1/cssd/ocssd.log
|
From here you can see that rhdisk4,5,6 is the votedisk corresponding disk, using kfed to view disk header information
The code is as follows |
Copy Code |
$kfed READ/DEV/RHDISK4
kfbh.endian:201; 0x000:0xc9
kfbh.hard:194; 0x001:0xc2
kfbh.type:212; 0X002: * * * Unknown Enum * * *
kfbh.datfmt:193; 0x003:0xc1
kfbh.block.blk:0; 0x004:blk=0
kfbh.block.obj:0; 0x008:file=0
kfbh.check:0; 0x00c:0x00000000
kfbh.fcn.base:0; 0x010:0x00000000
kfbh.fcn.wrap:0; 0x014:0x00000000
kfbh.spare1:0; 0x018:0x00000000
kfbh.spare2:0; 0x01c:0x00000000
1102bee00 C9c2d4c1 00000000 00000000 00000000 [...]......
1102bee10 00000000 00000000 00000000 00000000 [...]......
Repeat 6 times
1102bee80 00f9733d 67553e0a 00000000 00000000 [. S=gu> ......
1102bee90 00000000 00000000 00000000 00000000 [...]......
Repeat 246 Times
Kfed-00322:invalid content encountered during block traversal: [kfbttraverseblock][invalid OSM block type][][212]
$kfed Read/dev/rhdisk4 Blkn=1
kfbh.endian:0; 0x000:0x00
kfbh.hard:130; 0x001:0x82
Kfbh.type:2; 0x002:kfbtyp_freespc
Kfbh.datfmt:2; 0x003:0x02
Kfbh.block.blk:1; 0x004:blk=1
kfbh.block.obj:2147483648; 0x008:disk=0
kfbh.check:3883664132; 0x00c:0xe77c0304
kfbh.fcn.base:0; 0x010:0x00000000
kfbh.fcn.wrap:0; 0x014:0x00000000
kfbh.spare1:0; 0x018:0x00000000
kfbh.spare2:0; 0x01c:0x00000000
kfdfsb.aunum:0; 0x000:0x00000000
kfdfsb.max:254; 0x004:0x00fe
kfdfsb.cnt:23; 0x006:0x0017
kfdfsb.bound:0; 0x008:0x0000
Kfdfsb.flag:1; 0x00a:b=1
kfdfsb.ub1spare:0; 0x00b:0x00
Kfdfsb.spare[0]: 0; 0x00c:0x00000000
KFDFSB.SPARE[1]: 0; 0x010:0x00000000
KFDFSB.SPARE[2]: 0; 0x014:0x00000000
kfdfse[0].fse:119; 0x018:free=0x7 frag=0x7
kfdfse[1].fse:16; 0x019:free=0x0 frag=0x1
............
$kfed Read/dev/rhdisk4 blkn=510
kfbh.endian:0; 0x000:0x00
kfbh.hard:130; 0x001:0x82
Kfbh.type:1; 0x002:kfbtyp_diskhead
Kfbh.datfmt:1; 0x003:0x01
kfbh.block.blk:254; 0x004:blk=254
kfbh.block.obj:2147483648; 0x008:disk=0
kfbh.check:3460116983; 0x00c:0xce3d31f7
kfbh.fcn.base:0; 0x010:0x00000000
kfbh.fcn.wrap:0; 0x014:0x00000000
kfbh.spare1:0; 0x018:0x00000000
kfbh.spare2:0; 0x01c:0x00000000
Kfdhdb.driver.provstr:ORCLDISK; 0x000:length=8
Kfdhdb.driver.reserved[0]: 0; 0x008:0x00000000
KFDHDB.DRIVER.RESERVED[1]: 0; 0x00c:0x00000000
KFDHDB.DRIVER.RESERVED[2]: 0; 0x010:0x00000000
KFDHDB.DRIVER.RESERVED[3]: 0; 0x014:0x00000000
KFDHDB.DRIVER.RESERVED[4]: 0; 0x018:0x00000000
KFDHDB.DRIVER.RESERVED[5]: 0; 0x01c:0x00000000
kfdhdb.compat:186646528; 0x020:0x0b200000
kfdhdb.dsknum:0; 0x024:0x0000
Kfdhdb.grptyp:2; 0x026:kfdgtp_normal
Kfdhdb.hdrsts:3; 0x027:kfdhdr_member
kfdhdb.dskname:CRS_0000; 0x028:length=8
Kfdhdb.grpname:CRS; 0x048:length=3
kfdhdb.fgname:CRS_0000; 0x068:length=8
............
|
The above analysis can be basically determined to be the ASM disk header being destroyed, further analyzing the cause of the damage
The code is as follows |
Copy Code |
[DB2/DEV#]LSPV
Hdisk0 00f9733ef7cf27e9 ROOTVG Active
Hdisk1 00f9733e21b953e6 ROOTVG Active
Hdisk2 00f9733e21b97a83 APPVG Active
HDISK3 00f9733e21b98434 APPVG Active
HDISK4 00f9733d67553e0a None
Hdisk5 00f9733d67553f31 None
HDISK6 00f9733d67554011 None
Hdisk7 00f9733d67554165 None
HDISK8 00f9733d675541e5 None
HDISK9 00f9733d675542e4 None
Hdisk10 None None
[Db2/dev#]ls-l rhdisk*
CRW-------2 root system, 1 Oct 11:45 Rhdisk0
CRW-------1 root system, 3 Oct 13:27 Rhdisk1
CRW-------1 root system, 5 Dec 20:02 Rhdisk10
CRW-------1 root system, 2 Oct 13:32 Rhdisk2
CRW-------1 root system, 0 Oct 13:32 rhdisk3
CRW-RW----1 grid asmadmin, 8 Dec 20:02 RHDISK4
CRW-RW----1 grid asmadmin, 9 Dec 20:02 rhdisk5
CRW-RW----1 grid asmadmin 20:02 RHDISK6
CRW-RW----1 grid asmadmin, 4 Dec 20:02 Rhdisk7
CRW-RW----1 grid asmadmin, 6 Dec 20:02 rhdisk8
CRW-RW----1 grid asmadmin, 7 Dec 20:02 RHDISK9
|
As you can see from here, the ASM disk header is corrupted due to the pvid of the disc header. Further analyze the ASM log to determine which disks are used as ASM disk
The code is as follows |
Copy Code |
sql> CREATE diskgroup CRS NORMAL redundancy DISK '/dev/rhdisk4 ',
'/dev/rhdisk5 ',
'/dev/rhdisk6 ' ATTRIBUTE ' compatible.asm ' = ' 11.2.0.0.0 ', ' au_size ' = ' 1M '/* ASMCA *
Note:assigning number (1,0) to disk (/DEV/RHDISK4)
Note:assigning number (1,1) to disk (/DEV/RHDISK5)
Note:assigning number (1,2) to disk (/DEV/RHDISK6)
Note:initializing header on GRP 1 disk crs_0000
Note:initializing header on GRP 1 disk crs_0001
Note:initializing header on GRP 1 disk crs_0002
sql> CREATE diskgroup DATA EXTERNAL Redundancy DISK
'/DEV/RHDISK9 ' SIZE 614400M ATTRIBUTE ' compatible.asm ' = ' 11.2.0.0.0 ', ' au_size ' = ' 1M '/* ASMCA *
Note:assigning number (2,0) to disk (/DEV/RHDISK9)
Note:initializing header on GRP 2 disk data_0000
sql> CREATE diskgroup FBA EXTERNAL Redundancy DISK
'/dev/rhdisk8 ' SIZE 204800M ATTRIBUTE ' compatible.asm ' = ' 11.2.0.0.0 ', ' au_size ' = ' 1M '/* ASMCA *
Note:assigning number (3,0) to disk (/DEV/RHDISK8)
Note:initializing header on GRP 3 disk fba_0000
sql> CREATE diskgroup ARCH EXTERNAL Redundancy DISK
'/dev/rhdisk7 ' SIZE 102400M ATTRIBUTE ' compatible.asm ' = ' 11.2.0.0.0 ', ' au_size ' = ' 1M '/* ASMCA *
Note:assigning number (4,0) to disk (/DEV/RHDISK7)
Note:initializing header on GRP 4 disk arch_0000
|
Here you can determine that ASM disk is rhdisk[4-9], through the kfed analysis of all and rhdisk4 the same problem, also in line with LSPV query results, using kfed repair to repair the ASM disk header
The code is as follows |
Copy Code |
Sql> alter DiskGroup data mount;
DiskGroup altered.
Sql> alter DiskGroup FBA Mount;
DiskGroup altered.
Sql> alter DiskGroup Arch Mount;
DiskGroup altered.
Sql> alter DiskGroup CRS Mount;
DiskGroup altered.
Sql> select Group_number,disk_number,path from V$asm_disk;
Group_number Disk_number PATH
------------ ----------- --------------------------------------------------
2 0/DEV/RHDISK4
2 1/DEV/RHDISK5
2 2/DEV/RHDISK6
1 0/dev/rhdisk7
4 0/DEV/RHDISK8
3 0/DEV/RHDISK9
6 rows selected.
Sql> select Group_number,name from V$asm_diskgroup;
Group_number NAME
------------ ------------------------------------------------------------
1 ARCH
2 CRS
3 DATA
4 FBA
|
This proves that the ASM Disk group has all been successfully mount and the GI status is back to normal with kfed disk head repair
The code is as follows |
Copy Code |
[Db2/#]crsctl Status Res-t
--------------------------------------------------------------------------------
NAME TARGET State SERVER State_details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
Ora. Arch.dg
Online online DB1
Online online DB2
Ora. Crs.dg
Online online DB1
Online online DB2
Ora. Data.dg
Online online DB1
Online online DB2
Ora. Fba.dg
Online online DB1
Online online DB2
Ora. Listener.lsnr
Online online DB1
Online online DB2
Ora.asm
Online online DB1 started
Online online DB2 started
Ora.gsd
OFFLINE OFFLINE DB1
OFFLINE OFFLINE DB2
Ora.net1.network
Online online DB1
Online online DB2
Ora.ons
Online online DB1
Online online DB2
Ora.registry.acfs
Online online DB1
Online online DB2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
Ora. Listener_scan1.lsnr
1 Online Online DB1
Ora.cvu
1 Online Online DB1
Ora.db1.vip
1 Online Online DB1
Ora.db2.vip
1 Online Online DB2
Ora.nkora.db
1 Online online db1 Open
2 Online Online DB2 Open
Ora.oc4j
1 Online Online DB1
Ora.scan1.vip
1 Online Online DB1 |
This ignores a problem where the pvid is still stored in the ODM after the disk header is repaired and the pvid is not cleared
The code is as follows |
Copy Code |
[DB2/DEV#]LSPV
Hdisk0 00f9733ef7cf27e9 ROOTVG Active
Hdisk1 00f9733e21b953e6 ROOTVG Active
Hdisk2 00f9733e21b97a83 APPVG Active
HDISK3 00f9733e21b98434 APPVG Active
HDISK4 00f9733d67553e0a None
Hdisk5 00f9733d67553f31 None
HDISK6 00f9733d67554011 None
Hdisk7 00f9733d67554165 None
HDISK8 00f9733d675541e5 None
HDISK9 00f9733d675542e4 None
Hdisk10 None None
|
The analysis found that there were no records in the FBA disk group and that the disk group was used to clear the Pvid test directly
The code is as follows |
Copy Code |
$ sqlplus/as Sysasm
Sql*plus:release 11.2.0.4.0 Production on Sun Dec 21 03:13:31 2014
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition release 11.2.0.4.0-64bit Production
With the real application clusters and Automatic Storage Management options
Sql> alter DiskGroup FBA dismount;
DiskGroup altered.
Sql> exit
Disconnected from Oracle Database 11g Enterprise Edition release 11.2.0.4.0-64bit Production
With the real application clusters and Automatic Storage Management options
$ exit
You have mail in/usr/spool/mail/root
[Db2/#]chdev-l hdisk8-a Pv=clear
HDISK8 changed
[DB2/#]LSPV
Hdisk0 00f9733ef7cf27e9 ROOTVG Active
Hdisk1 00f9733e21b953e6 ROOTVG Active
Hdisk2 00f9733e21b97a83 APPVG Active
HDISK3 00f9733e21b98434 APPVG Active
HDISK4 00f9733d67553e0a None
Hdisk5 00f9733d67553f31 None
HDISK6 00f9733d67554011 None
Hdisk7 00f9733d67554165 None
Hdisk8 None None
HDISK9 00f9733d675542e4 None
Hdisk10 None None
[Db2/#]su-grid
$ sqlplus/as Sysasm
Sql*plus:release 11.2.0.4.0 Production on Sun Dec 21 03:15:19 2014
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition release 11.2.0.4.0-64bit Production
With the real application clusters and Automatic Storage Management options
Sql> alter DiskGroup FBA Mount;
DiskGroup altered.
Sql> exit
Disconnected from Oracle Database 11g Enterprise Edition release 11.2.0.4.0-64bit Production
With the real application clusters and Automatic Storage Management options
|
Clear Pvid ASM Disk Head is still working properly by testing, turn off GI, use Chdev to clear hdisk[4-9] all pvid, start gi all normal
The code is as follows |
Copy Code |
[Db1/#]crsctl Status Res-t
--------------------------------------------------------------------------------
NAME TARGET State SERVER State_details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
Ora. Arch.dg
Online online DB1
Online online DB2
Ora. Crs.dg
Online online DB1
Online online DB2
Ora. Data.dg
Online online DB1
Online online DB2
Ora. Fba.dg
Online online DB1
Online online DB2
Ora. Listener.lsnr
Online online DB1
Online online DB2
Ora.asm
Online online DB1 started
Online online DB2 started
Ora.gsd
OFFLINE OFFLINE DB1
OFFLINE OFFLINE DB2
Ora.net1.network
Online online DB1
Online online DB2
Ora.ons
Online online DB1
Online online DB2
Ora.registry.acfs
Online online DB1
Online online DB2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
Ora. Listener_scan1.lsnr
1 Online Online DB1
Ora.cvu
1 Online Online DB1
Ora.db1.vip
1 Online Online DB1
Ora.db2.vip
1 Online Online DB2
Ora.nkora.db
1 Online online db1 Open
2 Online Online DB2 Open
Ora.oc4j
1 Online Online DB1
Ora.scan1.vip
1 Online Online DB1
[DB1/#]LSPV
Hdisk0 00f9733df7c7a9db ROOTVG Active
Hdisk1 00f9733d21dad8fe ROOTVG Active
Hdisk2 00f9733d21dbd08b APPVG Active
HDISK3 00f9733d21dbd2ab APPVG Active
Hdisk4 None None
Hdisk5 None None
Hdisk6 None None
Hdisk7 None None
Hdisk8 None None
Hdisk9 None None
Hdisk10 None None
|
This setting Pvid the ASM Recovery of the ASM disk header corruption, resulting in 0 loss of data.
Warm tip: AIX ASM disk can not set Pvid, or it will cause the ASM disk header damage, can not mount properly
Original: http://www.xifenfei.com/5686.html