These days on business trip to Gansu, a customer cluster a little problem. The CRS detection of the nodes found a bit unusual, but it was strange that the business was not affected. The following are the results of my tests:
[Grid@tysy1 bin]$ Ocrcheck
prot-602:failed to retrieve data from the cluster registry
Proc-26:error while accessing the physical storage
[grid@tysy1 bin]$ crsctl Query css Votedisk
# # state File Universal Id file Name Disk Group
-- ----- ----------------- --------- ---------
1. ONLINE 6fe50eed90fe4f44bf68c2da8459bfc1 (ORCL:OCR_VOTE01) [CRS]
2. ONLINE 7afb21a9c8c94f27bf6aecb55d4e6e58 (ORCL:OCR_VOTE02) [CRS]
3. ONLINE 77c7c00e799a4f47bf41316f27a1d95c (ORCL:OCR_VOTE03) [CRS]
Located 3 voting disk (s).
I detected that CRSCT status res-t-init execution OK, execution crsctl status res-t error. Because the customer is using asmlib, I check that the asmlib disk is normal, as follows:
[Grid@tysy1 bin]$/etc/init.d/oracleasm Listdisks
Archvol
DATAVOL01
DATAVOL02
DATAVOL03
DATAVOL04
Fravol
Ocr_vote01
Ocr_vote02
Ocr_vote03
[Root@tysy1 disks]#/etc/init.d/oracleasm querydisk-d ocr_vote01
Disk "OCR_VOTE01" is a valid ASM disk on device [8,21]
[Root@tysy1 disks]#/etc/init.d/oracleasm querydisk-d ocr_vote02
Disk "OCR_VOTE02" is a valid ASM disk on device [8,22]
[Root@tysy1 disks]#/etc/init.d/oracleasm querydisk-d ocr_vote03
Disk "Ocr_vote03" is a valid ASM disk on device [8,23]
[Root@tysy1 disks]# ls-ltr/dev/sdb*
BRW-RW----1 root disk 8, Sep 1 20:23/dev/sdb
BRW-RW----1 root disk 8, Sep 1 20:23/dev/sdb1
BRW-RW----1 root disk 8, Sep 1 20:23/DEV/SDB5
BRW-RW----1 root disk 8, Sep 1 20:23/DEV/SDB6
BRW-RW----1 root disk 8, Sep 1 20:23/dev/sdb7
BRW-RW----1 root disk 8, Sep 1 20:23/DEV/SDB8
BRW-RW----1 root disk 8, Sep 1 20:23/DEV/SDB9
BRW-RW----1 root disk 8, Sep 1 20:23/DEV/SDB10
BRW-RW----1 root disk 8, Sep 1 20:23/DEV/SDB11
BRW-RW----1 root disk 8, Sep 1 20:23/dev/sdb12
Check that the status of the Asmlib is OK, and that the CRS Disk group is not recognized in the view of ASM disks at Node 1:
[Grid@tysy1 bin]$
NAME PATH State group_number disk_number mount_s Header_statu mode_st LABEL
---------- ----------------- -------- ------------ ----------- ------- ------------ ------- ------------
ORCL:OCR_VOTE01 NORMAL 0 0 CLOSED member ONLINE ocr_vote01
ORCL:OCR_VOTE02 NORMAL 0 1 CLOSED member ONLINE OCR_VOTE02
ORCL:OCR_VOTE03 NORMAL 0 2 CLOSED member ONLINE OCR_VOTE03
Archvol Orcl:archvol NORMAL 1 0 CACHED member ONLINE archvol
DATAVOL01 orcl:datavol01 NORMAL 5 0 CACHED member ONLINE DATAVOL01
DATAVOL02 orcl:datavol02 NORMAL 5 1 CACHED member ONLINE DATAVOL02
DATAVOL03 orcl:datavol03 NORMAL 6 0 CACHED member ONLINE DATAVOL03
DATAVOL04 orcl:datavol04 NORMAL 6 1 CACHED member ONLINE DATAVOL04
Fravol Orcl:fravol NORMAL 7 0 CACHED member ONLINE
Fravol
Viewing the CRS log also reveals that Oracle considers the CRS disk inaccessible:
015-09-02 14:06:19.302: [Ocrcheck][1069037312]ocrcheck starts ...
2015-09-02 14:06:19.369: [Ocrasm][1069037312]proprasmo:kgfocheckmount return [6]. Cannot proceed with dirty open.
2015-09-02 14:06:19.369: [Ocrasm][1069037312]proprasmo:error in Open/create file in DG [CRS]
[Ocrasm] [1069037312] Slos:slos:cat=6, Opn=kgfo, Dep=0, loc=kgfockmt03
2015-09-02 14:06:19.369: [Ocrasm][1069037312]asm Error Stack:
2015-09-02 14:06:19.391: [Ocrasm][1069037312]proprasmo:kgfocheckmount returned [6]
2015-09-02 14:06:19.391: [ocrasm][1069037312]proprasmo:the ASM Disk group CRS isn't found or not mounted
2015-09-02 14:06:19.392: [ocrraw][1069037312]proprioo:failed to open [+crs]. Returned Proprasmo () with [26]. Marking location as unavailable.
2015-09-02 14:06:19.392: [ocrraw][1069037312]proprioo:no OCR/OLR devices are usable
2015-09-02 14:06:19.392: [Ocrasm][1069037312]proprasmcl:asmhandle is NULL
2015-09-02 14:06:19.392: [ocrraw][1069037312]proprinit:could not open raw device
2015-09-02 14:06:19.392: [Ocrasm][1069037312]proprasmcl:asmhandle is NULL
2015-09-02 14:06:19.392: [default][1069037312]a_init:7!: Backend init unsuccessful: [26]
2015-09-02 14:06:19.392: [ocrcheck][1069037312]initreboot:failed to initialize OCR at REBOOT level. RETVAL:[26] Error:[proc-26:error while accessing the physical storage
]
[Ocrcheck] [1069037312]initreboot:attempting to initialize OCR at DEFAULT level and update a key so, vote information is updated .
2015-09-02 14:06:19.393: [Ocrmsg][1069037312]prom_waitconnect:conn not established (0,29,1,2)
2015-09-02 14:06:19.393: [OCRMSG][1069037312]GIPC error [+] msg [gipcretconnectionrefused]
2015-09-02 14:06:19.393: [Ocrmsg][1069037312]prom_connect:error while waiting for connection complete [24]
2015-09-02 14:06:19.394: [ocrcheck][1069037312]initreboot:failed to initialize OCR at DEFAULT level. RETVAL:[32] Error:[proc-32:cluster Ready Services on the local node isn't running messaging Error [gipcretconnectionrefu SED] [29]]
2015-09-02 14:06:19.394: [ocrcheck][1069037312]failed to access OCR repository. Retval [[] Error [proc-26:error while accessing the physical storage
]
2015-09-02 14:06:19.394: [ocrcheck][1069037312]failed to initialize OCRCHEK2
2015-09-02 14:06:19.394: [ocrcheck][1069037312]exiting [status=failed] ...
There is absolutely no train of thought here, so let's take a closer look at the ASM alert log as follows:
Tue Sep 01 23:07:03 2015
Note:process _B000_+ASM1 (19088) initiating offline of disk 0.3915955003 (OCR_VOTE01) with mask 0x7e in Group 4
Note:process _B000_+ASM1 (19088) initiating offline of disk 1.3915955004 (OCR_VOTE02) with mask 0x7e in Group 4
Note:process _B000_+ASM1 (19088) initiating offline of disk 2.3915955005 (OCR_VOTE03) with mask 0x7e in Group 4
Note:checking PST:GRP = 4
Gmon checking disk modes for Group 4 at a for PID Osid 19088
Error:no read Quorum in group:required 2, found 0 disks
Note:checking PST for GRP 4 is done.
note:initiating PST update:grp = 4, DSK = 0/0xe968bb3b, mask = 0x6a, op = clear
note:initiating PST update:grp = 4, DSK = 1/0xe968bb3c, mask = 0x6a, op = clear
note:initiating PST update:grp = 4, DSK = 2/0xe968bb3d, mask = 0x6a, op = clear
Gmon Updating disk modes for Group 4 at a for PID Osid 19088
Error:no read Quorum in group:required 2, found 0 disks
Tue Sep 01 23:07:03 2015
Note:cache dismounting (not clean) group 4/0XD7184BC4 (CRS)
Warning:offline for disk Ocr_vote01 in mode 0x7f failed.
Warning:offline for disk Ocr_vote02 in mode 0x7f failed.
Warning:offline for disk Ocr_vote03 in mode 0x7f failed.
Note:messaging CKPT to Quiesce pins Unix process pid:19090, image:oracle@tysy1 (B001)
Tue Sep 01 23:07:03 2015
Note:halting all I/Os to DiskGroup 4 (CRS)
Tue Sep 01 23:07:03 2015
NOTE:LGWR doing Non-clean dismount of Group 4 (CRS)
NOTE:LGWR sync aba=15.189 last written ABA 15.189
Tue Sep 01 23:07:03 2015
Kjbdomdet Send To Inst 2
Detach from Dom 4, sending detach message to Inst 2
Tue Sep 01 23:07:03 2015
List of instances:
1 2
Dirty Detach reconfiguration started (New Ddet Inc 1, Cluster Inc 4)
Global Resource Directory partially frozen for dirty detach
* Dirty Detach-domain 4 Invalid = TRUE
0 GCS Resources traversed, 0 cancelled
Dirty Detach Reconfiguration Complete
Tue Sep 01 23:07:03 2015
Warning:dirty detached from Domain 4
Note:cache dismounted Group 4/0XD7184BC4 (CRS)
Sql> Alter DiskGroup CRS dismount Force/ASM server:3608693700 * *
Tue Sep 01 23:07:03 2015
Note:cache deleting context for group CRS 4/0XD7184BC4
Gmon dismounting Group 4 at the for PID Osid 19090
Note:disk ocr_vote01 in mode 0x7f marked for de-assignment
Note:disk ocr_vote02 in mode 0x7f marked for de-assignment
Note:disk ocr_vote03 in mode 0x7f marked for de-assignment
Note:waiting for all pending writes to complete before De-registering:grpnum 4
Tue Sep 01 23:07:18 2015
ASM Health Checker found 1 new failures
Tue Sep 01 23:07:33 2015
Success:diskgroup CRS was dismounted
Success:alter diskgroup CRS Dismount Force/ASM server:3608693700 * *
Tue Sep 01 23:07:33 2015
Note:diskgroup resource Ora. CRS.DG is offline
Success:asm-initiated mandatory dismount of group CRS
Tue Sep 01 23:07:33 2015
Errors in FILE/U01/APP/GRID/DIAG/ASM/+ASM/+ASM1/TRACE/+ASM1_ORA_4208.TRC:
Ora-15078:asm DiskGroup was forcibly dismounted
......
......
Errors in FILE/U01/APP/GRID/DIAG/ASM/+ASM/+ASM1/TRACE/+ASM1_ORA_4208.TRC:
Ora-15078:asm DiskGroup was forcibly dismounted
Wed Sep 02 00:40:16 2015
Errors in FILE/U01/APP/GRID/DIAG/ASM/+ASM/+ASM1/TRACE/+ASM1_ORA_4208.TRC:
Ora-15078:asm DiskGroup was forcibly dismounted
Warning:requested Mirror Side 1 of virtual extent 4 logical Extent 0 offset 294912 is not allocated; I/O request failed
Warning:requested Mirror Side 2 of virtual extent 4 logical extent 1 offset 294912 is not allocated; I/O request failed
Errors in FILE/U01/APP/GRID/DIAG/ASM/+ASM/+ASM1/TRACE/+ASM1_ORA_4208.TRC:
Ora-15078:asm DiskGroup was forcibly dismounted
Ora-15078:asm DiskGroup was forcibly dismounted
Wed Sep 02 00:40:16 2015
sql> alter DiskGroup CRS Check/* Proxy *
Ora-15032:not all alterations performed
Ora-15001:diskgroup "CRS" does not exist or are not mounted
Error:alter diskgroup CRS Check/* Proxy *
Wed Sep 02 00:41:33 2015
Note:client exited [4197]
Wed Sep 02 00:41:34 2015
Note: [Crsd.bin@tysy1 (TNS v1-v3) 13821] opening OCR file
Wed Sep 02 00:41:36 2015
Note: [Crsd.bin@tysy1 (TNS v1-v3) 13834] opening OCR file
......
......
Wed Sep 02 14:06:08 2015
Note: [Ocrcheck.bin@tysy1 (TNS v1-v3) 16749] opening OCR file
Wed Sep 02 14:06:19 2015
Note: [Ocrcheck.bin@tysy1 (TNS v1-v3) 16815] opening OCR file
Wed Sep 02 14:13:58 2015
Note: [Ocrcheck.bin@tysy1 (TNS v1-v3) 18667] opening OCR file
Wed Sep 02 14:17:52 2015
Note: [Ocrcheck.bin@tysy1 (TNS v1-v3) 19633] opening OCR file
We can see that ASM made a check, found failure (ASM Health Checker found 1 new failures), and then forced the CRS disk group to dismount. We will further analyze the contents of the following trace file:
[Grid@tysy1 client]$ CAT/U01/APP/GRID/DIAG/ASM/+ASM/+ASM1/TRACE/+ASM1_ORA_4208.TRC
Trace FILE/U01/APP/GRID/DIAG/ASM/+ASM/+ASM1/TRACE/+ASM1_ORA_4208.TRC
Oracle Database 11g Enterprise Edition release 11.2.0.4.0-64bit Production
With the real application clusters and Automatic Storage Management options
Oracle_home =/u01/app/11.2.0/grid
System Name:linux
Node name:tysy1
Release:2.6.32-358.el6.x86_64
Version: #1 SMP Tue 11:47:41 EST 2013
Machine:x86_64
VM Name:vmware Version:6
Instance Name: +ASM1
Redo thread mounted by this instance:0 <none>
Oracle Process number:26
Unix process pid:4208, image:oracle@tysy1 (TNS v1-v3)
2015-09-01 23:07:33.936
Session ID: (289.3) 2015-09-01 23:07:33.936
CLIENT ID: () 2015-09-01 23:07:33.936
SERVICE NAME: () 2015-09-01 23:07:33.936
MODULE NAME: (crsd.bin@tysy1 (TNS v1-v3)) 2015-09-01 23:07:33.936
ACTION NAME: () 2015-09-01 23:07:33.936
warning:failed Xlate 1
Ora-15078:asm DiskGroup was forcibly dismounted
......
......
warning:failed Xlate 1
Ora-15078:asm DiskGroup was forcibly dismounted
2015-09-02 00:40:16.217
warning:failed Xlate 1
Ora-15078:asm DiskGroup was forcibly dismounted
Ksfdrfms:mirror Read file=+crs.255.4294967295 fob=0x9f403518 bufp=0x7f48c1381a00 blkno=1096 nbytes=4096
warning:failed Xlate 1
Warning:requested Mirror Side 1 of virtual extent 4 logical Extent 0 offset 294912 is not allocated; I/O request failed
Ksfdrfms:read failed from mirror side=1 logical extent number=0 dskno=65535
warning:failed Xlate 1
Warning:requested Mirror Side 2 of virtual extent 4 logical extent 1 offset 294912 is not allocated; I/O request failed
Ksfdrfms:read failed from mirror side=2 logical extent Number=1 dskno=65535
Ora-15078:asm DiskGroup was forcibly dismounted
Ora-15078:asm DiskGroup was forcibly dismounted
Further I analyzed the contents of the call stack in the Gmon process trace file content found as follows:
---+asm1_gmon_4091.trc
2015-09-01 22:55:11.686
Error:gmon could not validate no PST hbeat (GRP 6) for validate
2015-09-01 22:55:11.686182:kfdphbeatblk_transfer:found pending hbeat I/O in grp=6 @
Hbeatcblk=0x7f2bf4f337f8 iop=0x7f2bf4f33808 iostate=3.
Kfdphbeatcb_dump:at 0x7f2bf4f337f8 with ts=09/01/2015 22:55:11 iop=0x7f2bf4f33808, grp=6, disk=0/3915955008, isWrite=0 Hbeat #0 state=3 iostate=3
-----Abridged call Stack Trace-----
Ksedsts () +465<-kfdphbeatblk_transfer () +953<-kfdphbeat_docheck () +662<-kfdphbeat_timeout () +854<-KFDP _TIMEOUTBG () +1642<-ksbcti () +513<-ksbabs () +1735<-ksbrdp () +1045<-opirip () () +623<-opidrv Sou2o () +103<-opimai_real () +250<-ssthrdmain () +265<-main () +201<-__libc_start_main () +253
-----End of Abridged call Stack Trace-----
warning:waited secs for write IO to PST Disk 0 in group 1.
2015-09-01 23:07:03.571
warning:waited secs for write IO to PST Disk 0 in group 1.
warning:waited secs for write IO to PST Disk 0 in Group 4.
warning:waited secs for write IO to PST Disk 1 in Group 4.
warning:waited secs for write IO to PST Disk 2 in Group 4.
Note:set to IS offline flag for disk ocr_vote01 only Locally:flag 0x3211
Note:set to IS offline flag for disk OCR_VOTE02 only Locally:flag 0x3211
Note:set to IS offline flag for disk ocr_vote03 only Locally:flag 0x3211
warning:waited secs for write IO to PST Disk 0 in Group 4.
warning:waited secs for write IO to PST Disk 1 in Group 4.
warning:waited secs for write IO to PST Disk 2 in Group 4.
warning:waited secs for write IO to PST Disk 0 in Group 5.
warning:waited secs for write IO to PST Disk 0 in Group 5.
warning:waited secs for write IO to PST Disk 0 in group 6.
warning:waited secs for write IO to PST Disk 0 in group 6.
warning:waited secs for write IO to PST Disk 0 in group 7.
warning:waited secs for write IO to PST Disk 0 in group 7.
-----Abridged call Stack Trace-----
Ksedsts () +465<-kfdpgc_dotobeoflnasync () +37<-KFDPGC_CHECKTOBEOFLN () +623<-kfdpgc_timeout () +17<-kfdp_ TIMEOUTBG () +1657<-ksbcti () +513<-ksbabs () +1735<-ksbrdp () +1045<-opirip () () +623<-opidrv Sou2o () +103<-opimai_real () +250<-ssthrdmain () +265<-main () +201<-__libc_start_main () +253
。。。。。。
。。。。。。
---------------Hbeat------------------
Kfdphbeat_dump:state=1, Inst=1, ts=33023031.450354176,
rnd=1036706912.2853146286.3238582348.2977351792.
KFK Io-queue: (nil)
Kfdphbeatcb_dump:at 0x7f2bf4d28248 with ts=09/01/2015 23:06:47 iop=0x7f2bf4d28258, grp=4, disk=0/3915955003, isWrite=1 Hbeat #17 state=4 iostate=3
Kfdphbeatcb_dump:at 0x7f2bf4d28080 with ts=09/01/2015 23:06:47 iop=0x7f2bf4d28090, grp=4, disk=1/3915955004, isWrite=1 Hbeat #17 state=4 iostate=3
Kfdphbeatcb_dump:at 0x7f2bf4d27eb8 with ts=09/01/2015 23:06:47 Iop=0x7f2bf4d27ec8, grp=4, disk=2/3915955005, isWrite=1 Hbeat #17 state=4 iostate=3
NOTE:kfdp_updateInt:forceDismount GRP 4
NOTE:GMON:failed to update modes:triggering Force dismount of Group 4
2015-09-01 23:07:03.625783:kfdphbeatblk_transfer:found pending hbeat I/O in grp=4 @
hbeatcblk=0x7f2bf4d28248 iop=0x7f2bf4d28258 iostate=3.
Kfdphbeatcb_dump:at 0x7f2bf4d28248 with ts=09/01/2015 23:06:47 iop=0x7f2bf4d28258, grp=4, disk=0/3915955003, isWrite=1 Hbeat #17 state=4 iostate=3
2015-09-01 23:07:03.625823:kfdphbeatblk_transfer:found pending hbeat I/O in grp=4 @
hbeatcblk=0x7f2bf4d28080 iop=0x7f2bf4d28090 iostate=3.
Kfdphbeatcb_dump:at 0x7f2bf4d28080 with ts=09/01/2015 23:06:47 iop=0x7f2bf4d28090, grp=4, disk=1/3915955004, isWrite=1 Hbeat #17 state=4 iostate=3
2015-09-01 23:07:03.625844:kfdphbeatblk_transfer:found pending hbeat I/O in grp=4 @
Hbeatcblk=0x7f2bf4d27eb8 Iop=0x7f2bf4d27ec8 iostate=3.
Kfdphbeatcb_dump:at 0x7f2bf4d27eb8 with ts=09/01/2015 23:06:47 Iop=0x7f2bf4d27ec8, grp=4, disk=2/3915955005, isWrite=1 Hbeat #17 state=4 iostate=3
-----Abridged call Stack Trace-----
Ksedsts () +465<-kfdphbeatblk_transfer () +953<-kfdp_forcedismountasyncgmon () +619<-kfdp_updateint () +664 <-KFDP_UPDATEDSKBG () () +814<-ksbabs () +771<-ksbrdp () +1045<-opirip () +623<-opidrv () +603<-sou2o () + 103<-opimai_real () +250<-ssthrdmain () +265<-main () +201<-__libc_start_main () +253
-----End of Abridged call Stack Trace-----
2015-09-01 23:07:03.906
Gmon dismounting Group 4 at the for PID Osid 19090
NOTE:kfdp_doDismount:dismount GRP 4
2015-09-01 23:07:06.609
NOTE:KFDPUTIL_FREESLMSG:KSVTST for grp:4 KSV status 0x10
2015-09-01 23:07:36.732
NOTE:KFDPUTIL_FREESLMSG:KSVTST for grp:4 KSV status 0x10
Error:gmon could not validate no PST hbeat (GRP 1) for validate
2015-09-01 23:11:03.981
2015-09-01 23:11:03.981362:kfdphbeatblk_transfer:found pending hbeat I/O in grp=1 @
HBEATCBLK=0X7F2BF4F32F10 Iop=0x7f2bf4f32f20 iostate=3.
Kfdphbeatcb_dump:at 0x7f2bf4f32f10 with ts=09/01/2015 23:11:03 Iop=0x7f2bf4f32f20, Grp=1, disk=0/3915955002, isWrite=0 Hbeat #0 state=3 iostate=3
-----Abridged call Stack Trace-----
Ksedsts () +465<-kfdphbeatblk_transfer () +953<-kfdphbeat_docheck () +662<-kfdphbeat_timeout () +854<-KFDP _TIMEOUTBG () +1642<-ksbcti () +513<-ksbabs () +1735<-ksbrdp () +1045<-opirip () () +623<-opidrv Sou2o () +103<-opimai_real () +250<-ssthrdmain () +265<-main () +201<-__libc_start_main () +253
-----End of Abridged call Stack Trace-----
Based on the above, the analysis found exactly the same as Oracle Bug 18694414. The following is a description of the call stack for the bug:
Ksedsts () +465<-kfdphbeatblk_transfer () +953<-kfdphbeat_docheck () +662<-kfdphbeat
@ _timeout () +854<-KFDP_TIMEOUTBG () +1642<-ksbcti () +513<-ksbabs () +1735<-ksbrdp () +1
@ 045<-opirip () +623<-opidrv () +603<-sou2o () +103<-opimai_real () +250<-ssthrdmain () +
@ 265<-main () +201<-__libc_start_main () +253
According to the Oracle MOS documentation, it is recommended that you adjust the _asm_hbeatiowait parameter to 185, because the Linux environment disk defaults to timeout 180, as follows:
[root@tysy1 ~]# cat/sys/block/sdb/device/timeout
180
[root@tysy1 ~]#