Oracle database 11.2.0.4 RAC CRS diskgroup Auto Dismount problem

Source: Internet
Author: User
Tags session id stack trace oracle database

These days on business trip to Gansu, a customer cluster a little problem. The CRS detection of the nodes found a bit unusual, but it was strange that the business was not affected. The following are the results of my tests:


[Grid@tysy1 bin]$ Ocrcheck
prot-602:failed to retrieve data from the cluster registry
Proc-26:error while accessing the physical storage

[grid@tysy1 bin]$ crsctl Query css Votedisk
# # state File Universal Id file Name Disk Group
--  -----    -----------------                --------- ---------
1. ONLINE 6fe50eed90fe4f44bf68c2da8459bfc1 (ORCL:OCR_VOTE01) [CRS]
2. ONLINE 7afb21a9c8c94f27bf6aecb55d4e6e58 (ORCL:OCR_VOTE02) [CRS]
3. ONLINE 77c7c00e799a4f47bf41316f27a1d95c (ORCL:OCR_VOTE03) [CRS]
Located 3 voting disk (s).

I detected that CRSCT status res-t-init execution OK, execution crsctl status res-t error. Because the customer is using asmlib, I check that the asmlib disk is normal, as follows:




[Grid@tysy1 bin]$/etc/init.d/oracleasm Listdisks


Archvol


DATAVOL01


DATAVOL02


DATAVOL03


DATAVOL04


Fravol


Ocr_vote01


Ocr_vote02


Ocr_vote03





[Root@tysy1 disks]#/etc/init.d/oracleasm querydisk-d ocr_vote01


Disk "OCR_VOTE01" is a valid ASM disk on device [8,21]


[Root@tysy1 disks]#/etc/init.d/oracleasm querydisk-d ocr_vote02


Disk "OCR_VOTE02" is a valid ASM disk on device [8,22]


[Root@tysy1 disks]#/etc/init.d/oracleasm querydisk-d ocr_vote03


Disk "Ocr_vote03" is a valid ASM disk on device [8,23]


[Root@tysy1 disks]# ls-ltr/dev/sdb*


BRW-RW----1 root disk 8, Sep 1 20:23/dev/sdb


BRW-RW----1 root disk 8, Sep 1 20:23/dev/sdb1


BRW-RW----1 root disk 8, Sep 1 20:23/DEV/SDB5


BRW-RW----1 root disk 8, Sep 1 20:23/DEV/SDB6


BRW-RW----1 root disk 8, Sep 1 20:23/dev/sdb7


BRW-RW----1 root disk 8, Sep 1 20:23/DEV/SDB8


BRW-RW----1 root disk 8, Sep 1 20:23/DEV/SDB9


BRW-RW----1 root disk 8, Sep 1 20:23/DEV/SDB10


BRW-RW----1 root disk 8, Sep 1 20:23/DEV/SDB11


BRW-RW----1 root disk 8, Sep 1 20:23/dev/sdb12

Check that the status of the Asmlib is OK, and that the CRS Disk group is not recognized in the view of ASM disks at Node 1:




[Grid@tysy1 bin]$





NAME PATH State group_number disk_number mount_s Header_statu mode_st LABEL


---------- ----------------- -------- ------------ ----------- ------- ------------ ------- ------------


ORCL:OCR_VOTE01 NORMAL 0 0 CLOSED member ONLINE ocr_vote01


ORCL:OCR_VOTE02 NORMAL 0 1 CLOSED member ONLINE OCR_VOTE02


ORCL:OCR_VOTE03 NORMAL 0 2 CLOSED member ONLINE OCR_VOTE03


Archvol Orcl:archvol NORMAL 1 0 CACHED member ONLINE archvol


DATAVOL01 orcl:datavol01 NORMAL 5 0 CACHED member ONLINE DATAVOL01


DATAVOL02 orcl:datavol02 NORMAL 5 1 CACHED member ONLINE DATAVOL02


DATAVOL03 orcl:datavol03 NORMAL 6 0 CACHED member ONLINE DATAVOL03


DATAVOL04 orcl:datavol04 NORMAL 6 1 CACHED member ONLINE DATAVOL04


Fravol Orcl:fravol NORMAL 7 0 CACHED member ONLINE


Fravol


Viewing the CRS log also reveals that Oracle considers the CRS disk inaccessible:




015-09-02 14:06:19.302: [Ocrcheck][1069037312]ocrcheck starts ...


2015-09-02 14:06:19.369: [Ocrasm][1069037312]proprasmo:kgfocheckmount return [6]. Cannot proceed with dirty open.


2015-09-02 14:06:19.369: [Ocrasm][1069037312]proprasmo:error in Open/create file in DG [CRS]


[Ocrasm] [1069037312] Slos:slos:cat=6, Opn=kgfo, Dep=0, loc=kgfockmt03





2015-09-02 14:06:19.369: [Ocrasm][1069037312]asm Error Stack:


2015-09-02 14:06:19.391: [Ocrasm][1069037312]proprasmo:kgfocheckmount returned [6]


2015-09-02 14:06:19.391: [ocrasm][1069037312]proprasmo:the ASM Disk group CRS isn't found or not mounted


2015-09-02 14:06:19.392: [ocrraw][1069037312]proprioo:failed to open [+crs]. Returned Proprasmo () with [26]. Marking location as unavailable.


2015-09-02 14:06:19.392: [ocrraw][1069037312]proprioo:no OCR/OLR devices are usable


2015-09-02 14:06:19.392: [Ocrasm][1069037312]proprasmcl:asmhandle is NULL


2015-09-02 14:06:19.392: [ocrraw][1069037312]proprinit:could not open raw device


2015-09-02 14:06:19.392: [Ocrasm][1069037312]proprasmcl:asmhandle is NULL


2015-09-02 14:06:19.392: [default][1069037312]a_init:7!: Backend init unsuccessful: [26]


2015-09-02 14:06:19.392: [ocrcheck][1069037312]initreboot:failed to initialize OCR at REBOOT level. RETVAL:[26] Error:[proc-26:error while accessing the physical storage


]


[Ocrcheck] [1069037312]initreboot:attempting to initialize OCR at DEFAULT level and update a key so, vote information is updated .


2015-09-02 14:06:19.393: [Ocrmsg][1069037312]prom_waitconnect:conn not established (0,29,1,2)


2015-09-02 14:06:19.393: [OCRMSG][1069037312]GIPC error [+] msg [gipcretconnectionrefused]


2015-09-02 14:06:19.393: [Ocrmsg][1069037312]prom_connect:error while waiting for connection complete [24]


2015-09-02 14:06:19.394: [ocrcheck][1069037312]initreboot:failed to initialize OCR at DEFAULT level. RETVAL:[32] Error:[proc-32:cluster Ready Services on the local node isn't running messaging Error [gipcretconnectionrefu SED] [29]]


2015-09-02 14:06:19.394: [ocrcheck][1069037312]failed to access OCR repository. Retval [[] Error [proc-26:error while accessing the physical storage


]


2015-09-02 14:06:19.394: [ocrcheck][1069037312]failed to initialize OCRCHEK2


2015-09-02 14:06:19.394: [ocrcheck][1069037312]exiting [status=failed] ...

There is absolutely no train of thought here, so let's take a closer look at the ASM alert log as follows:




Tue Sep 01 23:07:03 2015


Note:process _B000_+ASM1 (19088) initiating offline of disk 0.3915955003 (OCR_VOTE01) with mask 0x7e in Group 4


Note:process _B000_+ASM1 (19088) initiating offline of disk 1.3915955004 (OCR_VOTE02) with mask 0x7e in Group 4


Note:process _B000_+ASM1 (19088) initiating offline of disk 2.3915955005 (OCR_VOTE03) with mask 0x7e in Group 4


Note:checking PST:GRP = 4


Gmon checking disk modes for Group 4 at a for PID Osid 19088


Error:no read Quorum in group:required 2, found 0 disks


Note:checking PST for GRP 4 is done.


note:initiating PST update:grp = 4, DSK = 0/0xe968bb3b, mask = 0x6a, op = clear


note:initiating PST update:grp = 4, DSK = 1/0xe968bb3c, mask = 0x6a, op = clear


note:initiating PST update:grp = 4, DSK = 2/0xe968bb3d, mask = 0x6a, op = clear


Gmon Updating disk modes for Group 4 at a for PID Osid 19088


Error:no read Quorum in group:required 2, found 0 disks


Tue Sep 01 23:07:03 2015


Note:cache dismounting (not clean) group 4/0XD7184BC4 (CRS)


Warning:offline for disk Ocr_vote01 in mode 0x7f failed.


Warning:offline for disk Ocr_vote02 in mode 0x7f failed.


Warning:offline for disk Ocr_vote03 in mode 0x7f failed.


Note:messaging CKPT to Quiesce pins Unix process pid:19090, image:oracle@tysy1 (B001)


Tue Sep 01 23:07:03 2015


Note:halting all I/Os to DiskGroup 4 (CRS)


Tue Sep 01 23:07:03 2015


NOTE:LGWR doing Non-clean dismount of Group 4 (CRS)


NOTE:LGWR sync aba=15.189 last written ABA 15.189


Tue Sep 01 23:07:03 2015


Kjbdomdet Send To Inst 2


Detach from Dom 4, sending detach message to Inst 2


Tue Sep 01 23:07:03 2015


List of instances:


1 2


Dirty Detach reconfiguration started (New Ddet Inc 1, Cluster Inc 4)


Global Resource Directory partially frozen for dirty detach


* Dirty Detach-domain 4 Invalid = TRUE


0 GCS Resources traversed, 0 cancelled


Dirty Detach Reconfiguration Complete


Tue Sep 01 23:07:03 2015


Warning:dirty detached from Domain 4


Note:cache dismounted Group 4/0XD7184BC4 (CRS)


Sql> Alter DiskGroup CRS dismount Force/ASM server:3608693700 * *


Tue Sep 01 23:07:03 2015


Note:cache deleting context for group CRS 4/0XD7184BC4


Gmon dismounting Group 4 at the for PID Osid 19090


Note:disk ocr_vote01 in mode 0x7f marked for de-assignment


Note:disk ocr_vote02 in mode 0x7f marked for de-assignment


Note:disk ocr_vote03 in mode 0x7f marked for de-assignment


Note:waiting for all pending writes to complete before De-registering:grpnum 4


Tue Sep 01 23:07:18 2015


ASM Health Checker found 1 new failures


Tue Sep 01 23:07:33 2015


Success:diskgroup CRS was dismounted


Success:alter diskgroup CRS Dismount Force/ASM server:3608693700 * *


Tue Sep 01 23:07:33 2015


Note:diskgroup resource Ora. CRS.DG is offline


Success:asm-initiated mandatory dismount of group CRS


Tue Sep 01 23:07:33 2015


Errors in FILE/U01/APP/GRID/DIAG/ASM/+ASM/+ASM1/TRACE/+ASM1_ORA_4208.TRC:


Ora-15078:asm DiskGroup was forcibly dismounted


......


......


Errors in FILE/U01/APP/GRID/DIAG/ASM/+ASM/+ASM1/TRACE/+ASM1_ORA_4208.TRC:


Ora-15078:asm DiskGroup was forcibly dismounted


Wed Sep 02 00:40:16 2015


Errors in FILE/U01/APP/GRID/DIAG/ASM/+ASM/+ASM1/TRACE/+ASM1_ORA_4208.TRC:


Ora-15078:asm DiskGroup was forcibly dismounted


Warning:requested Mirror Side 1 of virtual extent 4 logical Extent 0 offset 294912 is not allocated; I/O request failed


Warning:requested Mirror Side 2 of virtual extent 4 logical extent 1 offset 294912 is not allocated; I/O request failed


Errors in FILE/U01/APP/GRID/DIAG/ASM/+ASM/+ASM1/TRACE/+ASM1_ORA_4208.TRC:


Ora-15078:asm DiskGroup was forcibly dismounted


Ora-15078:asm DiskGroup was forcibly dismounted


Wed Sep 02 00:40:16 2015


sql> alter DiskGroup CRS Check/* Proxy *


Ora-15032:not all alterations performed


Ora-15001:diskgroup "CRS" does not exist or are not mounted


Error:alter diskgroup CRS Check/* Proxy *


Wed Sep 02 00:41:33 2015


Note:client exited [4197]


Wed Sep 02 00:41:34 2015


Note: [Crsd.bin@tysy1 (TNS v1-v3) 13821] opening OCR file


Wed Sep 02 00:41:36 2015


Note: [Crsd.bin@tysy1 (TNS v1-v3) 13834] opening OCR file


......


......


Wed Sep 02 14:06:08 2015


Note: [Ocrcheck.bin@tysy1 (TNS v1-v3) 16749] opening OCR file


Wed Sep 02 14:06:19 2015


Note: [Ocrcheck.bin@tysy1 (TNS v1-v3) 16815] opening OCR file


Wed Sep 02 14:13:58 2015


Note: [Ocrcheck.bin@tysy1 (TNS v1-v3) 18667] opening OCR file


Wed Sep 02 14:17:52 2015


Note: [Ocrcheck.bin@tysy1 (TNS v1-v3) 19633] opening OCR file

We can see that ASM made a check, found failure (ASM Health Checker found 1 new failures), and then forced the CRS disk group to dismount. We will further analyze the contents of the following trace file:




[Grid@tysy1 client]$ CAT/U01/APP/GRID/DIAG/ASM/+ASM/+ASM1/TRACE/+ASM1_ORA_4208.TRC


Trace FILE/U01/APP/GRID/DIAG/ASM/+ASM/+ASM1/TRACE/+ASM1_ORA_4208.TRC


Oracle Database 11g Enterprise Edition release 11.2.0.4.0-64bit Production


With the real application clusters and Automatic Storage Management options


Oracle_home =/u01/app/11.2.0/grid


System Name:linux


Node name:tysy1


Release:2.6.32-358.el6.x86_64


Version: #1 SMP Tue 11:47:41 EST 2013


Machine:x86_64


VM Name:vmware Version:6


Instance Name: +ASM1


Redo thread mounted by this instance:0 <none>


Oracle Process number:26


Unix process pid:4208, image:oracle@tysy1 (TNS v1-v3)





2015-09-01 23:07:33.936


Session ID: (289.3) 2015-09-01 23:07:33.936


CLIENT ID: () 2015-09-01 23:07:33.936


SERVICE NAME: () 2015-09-01 23:07:33.936


MODULE NAME: (crsd.bin@tysy1 (TNS v1-v3)) 2015-09-01 23:07:33.936


ACTION NAME: () 2015-09-01 23:07:33.936





warning:failed Xlate 1


Ora-15078:asm DiskGroup was forcibly dismounted


......


......


warning:failed Xlate 1


Ora-15078:asm DiskGroup was forcibly dismounted





2015-09-02 00:40:16.217


warning:failed Xlate 1


Ora-15078:asm DiskGroup was forcibly dismounted


Ksfdrfms:mirror Read file=+crs.255.4294967295 fob=0x9f403518 bufp=0x7f48c1381a00 blkno=1096 nbytes=4096


warning:failed Xlate 1


Warning:requested Mirror Side 1 of virtual extent 4 logical Extent 0 offset 294912 is not allocated; I/O request failed


Ksfdrfms:read failed from mirror side=1 logical extent number=0 dskno=65535


warning:failed Xlate 1


Warning:requested Mirror Side 2 of virtual extent 4 logical extent 1 offset 294912 is not allocated; I/O request failed


Ksfdrfms:read failed from mirror side=2 logical extent Number=1 dskno=65535


Ora-15078:asm DiskGroup was forcibly dismounted


Ora-15078:asm DiskGroup was forcibly dismounted

Further I analyzed the contents of the call stack in the Gmon process trace file content found as follows:




---+asm1_gmon_4091.trc


2015-09-01 22:55:11.686


Error:gmon could not validate no PST hbeat (GRP 6) for validate


2015-09-01 22:55:11.686182:kfdphbeatblk_transfer:found pending hbeat I/O in grp=6 @


Hbeatcblk=0x7f2bf4f337f8 iop=0x7f2bf4f33808 iostate=3.


Kfdphbeatcb_dump:at 0x7f2bf4f337f8 with ts=09/01/2015 22:55:11 iop=0x7f2bf4f33808, grp=6, disk=0/3915955008, isWrite=0 Hbeat #0 state=3 iostate=3


-----Abridged call Stack Trace-----


Ksedsts () +465<-kfdphbeatblk_transfer () +953<-kfdphbeat_docheck () +662<-kfdphbeat_timeout () +854<-KFDP _TIMEOUTBG () +1642<-ksbcti () +513<-ksbabs () +1735<-ksbrdp () +1045<-opirip () () +623<-opidrv Sou2o () +103<-opimai_real () +250<-ssthrdmain () +265<-main () +201<-__libc_start_main () +253





-----End of Abridged call Stack Trace-----


warning:waited secs for write IO to PST Disk 0 in group 1.





2015-09-01 23:07:03.571


warning:waited secs for write IO to PST Disk 0 in group 1.


warning:waited secs for write IO to PST Disk 0 in Group 4.


warning:waited secs for write IO to PST Disk 1 in Group 4.


warning:waited secs for write IO to PST Disk 2 in Group 4.


Note:set to IS offline flag for disk ocr_vote01 only Locally:flag 0x3211


Note:set to IS offline flag for disk OCR_VOTE02 only Locally:flag 0x3211


Note:set to IS offline flag for disk ocr_vote03 only Locally:flag 0x3211


warning:waited secs for write IO to PST Disk 0 in Group 4.


warning:waited secs for write IO to PST Disk 1 in Group 4.


warning:waited secs for write IO to PST Disk 2 in Group 4.


warning:waited secs for write IO to PST Disk 0 in Group 5.


warning:waited secs for write IO to PST Disk 0 in Group 5.


warning:waited secs for write IO to PST Disk 0 in group 6.


warning:waited secs for write IO to PST Disk 0 in group 6.


warning:waited secs for write IO to PST Disk 0 in group 7.


warning:waited secs for write IO to PST Disk 0 in group 7.


-----Abridged call Stack Trace-----


Ksedsts () +465<-kfdpgc_dotobeoflnasync () +37<-KFDPGC_CHECKTOBEOFLN () +623<-kfdpgc_timeout () +17<-kfdp_ TIMEOUTBG () +1657<-ksbcti () +513<-ksbabs () +1735<-ksbrdp () +1045<-opirip () () +623<-opidrv Sou2o () +103<-opimai_real () +250<-ssthrdmain () +265<-main () +201<-__libc_start_main () +253


。。。。。。


。。。。。。


---------------Hbeat------------------


Kfdphbeat_dump:state=1, Inst=1, ts=33023031.450354176,


rnd=1036706912.2853146286.3238582348.2977351792.


KFK Io-queue: (nil)


Kfdphbeatcb_dump:at 0x7f2bf4d28248 with ts=09/01/2015 23:06:47 iop=0x7f2bf4d28258, grp=4, disk=0/3915955003, isWrite=1 Hbeat #17 state=4 iostate=3


Kfdphbeatcb_dump:at 0x7f2bf4d28080 with ts=09/01/2015 23:06:47 iop=0x7f2bf4d28090, grp=4, disk=1/3915955004, isWrite=1 Hbeat #17 state=4 iostate=3


Kfdphbeatcb_dump:at 0x7f2bf4d27eb8 with ts=09/01/2015 23:06:47 Iop=0x7f2bf4d27ec8, grp=4, disk=2/3915955005, isWrite=1 Hbeat #17 state=4 iostate=3


NOTE:kfdp_updateInt:forceDismount GRP 4


NOTE:GMON:failed to update modes:triggering Force dismount of Group 4


2015-09-01 23:07:03.625783:kfdphbeatblk_transfer:found pending hbeat I/O in grp=4 @


hbeatcblk=0x7f2bf4d28248 iop=0x7f2bf4d28258 iostate=3.


Kfdphbeatcb_dump:at 0x7f2bf4d28248 with ts=09/01/2015 23:06:47 iop=0x7f2bf4d28258, grp=4, disk=0/3915955003, isWrite=1 Hbeat #17 state=4 iostate=3


2015-09-01 23:07:03.625823:kfdphbeatblk_transfer:found pending hbeat I/O in grp=4 @


hbeatcblk=0x7f2bf4d28080 iop=0x7f2bf4d28090 iostate=3.


Kfdphbeatcb_dump:at 0x7f2bf4d28080 with ts=09/01/2015 23:06:47 iop=0x7f2bf4d28090, grp=4, disk=1/3915955004, isWrite=1 Hbeat #17 state=4 iostate=3


2015-09-01 23:07:03.625844:kfdphbeatblk_transfer:found pending hbeat I/O in grp=4 @


Hbeatcblk=0x7f2bf4d27eb8 Iop=0x7f2bf4d27ec8 iostate=3.


Kfdphbeatcb_dump:at 0x7f2bf4d27eb8 with ts=09/01/2015 23:06:47 Iop=0x7f2bf4d27ec8, grp=4, disk=2/3915955005, isWrite=1 Hbeat #17 state=4 iostate=3


-----Abridged call Stack Trace-----


Ksedsts () +465<-kfdphbeatblk_transfer () +953<-kfdp_forcedismountasyncgmon () +619<-kfdp_updateint () +664 <-KFDP_UPDATEDSKBG () () +814<-ksbabs () +771<-ksbrdp () +1045<-opirip () +623<-opidrv () +603<-sou2o () + 103<-opimai_real () +250<-ssthrdmain () +265<-main () +201<-__libc_start_main () +253





-----End of Abridged call Stack Trace-----





2015-09-01 23:07:03.906


Gmon dismounting Group 4 at the for PID Osid 19090


NOTE:kfdp_doDismount:dismount GRP 4





2015-09-01 23:07:06.609


NOTE:KFDPUTIL_FREESLMSG:KSVTST for grp:4 KSV status 0x10





2015-09-01 23:07:36.732


NOTE:KFDPUTIL_FREESLMSG:KSVTST for grp:4 KSV status 0x10


Error:gmon could not validate no PST hbeat (GRP 1) for validate





2015-09-01 23:11:03.981


2015-09-01 23:11:03.981362:kfdphbeatblk_transfer:found pending hbeat I/O in grp=1 @


HBEATCBLK=0X7F2BF4F32F10 Iop=0x7f2bf4f32f20 iostate=3.


Kfdphbeatcb_dump:at 0x7f2bf4f32f10 with ts=09/01/2015 23:11:03 Iop=0x7f2bf4f32f20, Grp=1, disk=0/3915955002, isWrite=0 Hbeat #0 state=3 iostate=3


-----Abridged call Stack Trace-----


Ksedsts () +465<-kfdphbeatblk_transfer () +953<-kfdphbeat_docheck () +662<-kfdphbeat_timeout () +854<-KFDP _TIMEOUTBG () +1642<-ksbcti () +513<-ksbabs () +1735<-ksbrdp () +1045<-opirip () () +623<-opidrv Sou2o () +103<-opimai_real () +250<-ssthrdmain () +265<-main () +201<-__libc_start_main () +253





-----End of Abridged call Stack Trace-----

Based on the above, the analysis found exactly the same as Oracle Bug 18694414. The following is a description of the call stack for the bug:


Ksedsts () +465<-kfdphbeatblk_transfer () +953<-kfdphbeat_docheck () +662<-kfdphbeat
@ _timeout () +854<-KFDP_TIMEOUTBG () +1642<-ksbcti () +513<-ksbabs () +1735<-ksbrdp () +1
@ 045<-opirip () +623<-opidrv () +603<-sou2o () +103<-opimai_real () +250<-ssthrdmain () +
@ 265<-main () +201<-__libc_start_main () +253
According to the Oracle MOS documentation, it is recommended that you adjust the _asm_hbeatiowait parameter to 185, because the Linux environment disk defaults to timeout 180, as follows:


[root@tysy1 ~]# cat/sys/block/sdb/device/timeout
180
[root@tysy1 ~]#

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.