My colleague's set of databases is in the AIX system. After the OS patch is installed, the ASM instance cannot mount the disk group at startup, and the hang occurs. Then we made an hanganalyze for the hang process. The hang analysis file is as follows: Chainsmostlikelytohavecausedthehang: [a] Chain1Signature: rdbm
My colleague's set of databases is in the AIX system. After the OS patch is installed, the ASM instance cannot mount the disk group at startup, and the hang occurs. Then we made an hang analyze for the hang process. the hang analysis file is as follows: Chains most likely to have caused the hang: [a] Chain 1 Signature: = 'rdbm
My colleague's set of databases is in the AIX system. After the OS patch is installed, the ASM instance cannot mount the disk group at startup, and the hang occurs. Then we made an hang analyze for the hang process. The hang analysis file is as follows:
Chains most likely to have caused the hang: [a] Chain 1 Signature: <='rdbms ipc reply'<='enq: DD - contention' Chain 1 Signature Hash: 0x7bd12357 [b] Chain 2 Signature: <='rdbms ipc reply'<='enq: DD - contention' Chain 2 Signature Hash: 0x7bd12357===============================================================================Non-intersecting chains:-------------------------------------------------------------------------------Chain 1:------------------------------------------------------------------------------- Oracle session identified by: { instance: 1 (+asm.+asm) os id: 5898472 process id: 20, oracle@NG-CSDB (TNS V1-V3) session id: 1261 session serial #: 1 } is waiting for 'enq: DD - contention' with wait info: { p1: 'name|mode'=0x44440006 p2: 'disk group'=0x0 p3: 'type'=0x1 time in wait: 2 min 13 sec timeout after: never wait id: 10 blocking: 0 sessions current sql: ALTER DISKGROUP WPSDG MOUNT /* asm agent *//* {0:0:324} */ short stack: ksedsts()+360<-ksdxfstk()+44<-ksdxcb()+3384<-sspuser()+116<-48b8<-sskgpwwait()+32<-skgpwwait()+180<-ksliwat()+11032<-kslwaitctx()+180<-ksqcmi()+5604<-ksqgtlctx()+2904<-ksqgelctx()+660<-kfgUseDmt()+424<-kfxdrvMount()+1092<-kfxdrvEntry()+2296<-opiexe()+18568<-opiosq0()+2864<-kpooprx()+316<-kpoal8()+692<-opiodr()+720<-ttcpip()+1028<-opitsk()+1508<-opiino()+940<-opiodr()+720<-opidrv()+1132<-sou2o()+136<-opimai_real()+608<-ssthrdmain()+268<-main()+204<-__start()+112 wait history: * time between current wait and wait #1: 0.000979 sec 1. event: 'SQL*Net message from client' time waited: 0.000099 sec wait id: 9 p1: 'driver id'=0x62657100 p2: '#bytes'=0x1 * time between wait #1 and #2: 0.000002 sec 2. event: 'SQL*Net message to client' time waited: 0.000000 sec wait id: 8 p1: 'driver id'=0x62657100 p2: '#bytes'=0x1 * time between wait #2 and #3: 0.000021 sec 3. event: 'SQL*Net message from client' time waited: 0.000815 sec wait id: 7 p1: 'driver id'=0x62657100 p2: '#bytes'=0x1 } and is blocked by => Oracle session identified by: { instance: 1 (+asm.+asm) os id: 5112446 process id: 18, oracle@NG-CSDB (TNS V1-V3) session id: 1135 session serial #: 1 } which is waiting for 'rdbms ipc reply' with wait info: { p1: 'from_process'=0xd p2: 'timeout'=0x7fffff7b time in wait: 1.626263 sec timeout after: 0.373737 sec wait id: 86 blocking: 2 sessions current sql: ALTER DISKGROUP UATDG MOUNT /* asm agent *//* {0:0:324} */ short stack: ksedsts()+360<-ksdxfstk()+44<-ksdxcb()+3384<-sspuser()+116<-48b8<-sskgpwwait()+32<-skgpwwait()+180<-ksliwat()+11032<-kslwaitctx()+180<-kslwait()+112<-ksarcr()+348<-ksbwcoex()+136<-kfgbSendWithPin()+564<-kfgbSendShallow()+244<-kfgDiscoverShallow()+392<-kfgGlobalOpen()+388<-kfgDiscoverDeep()+332<-kfgDiscoverGroup()+1160<-kfgFinalizeMount()+1236<-kfgscFinalize()+1552<-kfgForEachKfgsc()+272<-kfgsoFinalize()+200<-kfgFinalize()+500<-kfxdrvMount()+3436<-kfxdrvEntry()+2296<-opiexe()+18568<-opiosq0()+2864<-kpooprx()+316<-k wait history: * time between current wait and wait #1: 0.000040 sec 1. event: 'rdbms ipc reply' time waited: 2.000011 sec wait id: 85 p1: 'from_process'=0xd p2: 'timeout'=0x7fffff7d * time between wait #1 and #2: 0.000045 sec 2. event: 'rdbms ipc reply' time waited: 2.000012 sec wait id: 84 p1: 'from_process'=0xd p2: 'timeout'=0x7fffff7f * time between wait #2 and #3: 0.000038 sec 3. event: 'rdbms ipc reply' time waited: 2.000010 sec wait id: 83 p1: 'from_process'=0xd p2: 'timeout'=0x7fffff81 } and is blocked by => Oracle session identified by: { instance: 1 (+asm.+asm) os id: 4653624 process id: 13, oracle@NG-CSDB (RBAL) session id: 820 session serial #: 1 } which is not in a wait: { last wait: 2 min 13 sec ago blocking: 3 sessions current sql: short stack: ksedsts()+360<-ksdxfstk()+44<-ksdxcb()+3384<-sspuser()+116<-48b8<-skgfrdevstat()+444<-skgfrdscdevs()+1100<-ksfddscdevs()+24<-kfk_dscv_ufs_init()+208<-kfk_discover_disks()+564<-kfdDiscoverString()+100<-kfdDiscoverShallow()+1764<-kfgbDriver()+1908<-ksbabs()+2800<-ksbrdp()+2044<-opirip()+1624<-opidrv()+608<-sou2o()+136<-opimai_real()+188<-ssthrdmain()+268<-main()+204<-__start()+112 wait history: 1. event: 'CSS operation: action' time waited: 0.003640 sec wait id: 3 p1: 'function_id'=0x43 * time between wait #1 and #2: 0.000428 sec 2. event: 'GPnP Get Item' time waited: 0.091329 sec wait id: 2 * time between wait #2 and #3: 0.003109 sec 3. event: 'CSS initialization' time waited: 0.046782 sec wait id: 1 }
We can see that the waiting event enq: DD-contention occurs during the execution of the disk group mount in our session, and this process is blocked by the rbal process of asm, the historical RBAL wait condition is 'css initialization' => 'gpnp Get item' => 'css operation: action'. The hang analysis file also contains the stack information of the RBAL process. In general, we need to get the relevant stack to support for a search, so I found a document, which is exactly the same as waiting for events and stack information. The document content is as follows: ASM Instance Hangs During The Diskgroup Mount Stage After aix OS Patch Install (Document ID 1633273.1 ). Let's take a look at some of the descriptions in this document. This is caused by a bug 16288155, but it has been fixed on 11.2.0.3. But it is strange that our version is 11.2.0.3. In this document, we need to collect the truss information of the RBAL process, as shown below:
1. Create a script called rbal.sh on /home/oracle directory. Add the following lines:#!/bin/csh/usr/bin/truss -o /tmp/rbal_strace.log -aefldD /bin/oraclechmod +x /home/oracle/rbal.sh2.sqlplus / as sysasmSQL>create pfile='/tmp/xx.ora' from spfile;3. Edit xx.ora and add the following parameter_dbg_proc_startup=TRUE4. Shutdown ASM5. Startup ASM using the following commands:# csh# setenv RBAL /home/oracle/rbal.sh# sqlplus / as sysasm# SQL> startup pfile='/tmp/xx.ora'
We collected the truss process according to the requirements of the document. The content is as follows:
3866976: 20447683: 0.0002: statx("/oracle/app/grid_base/diag/asm/+asm/+ASM/alert/log.xml", 0x0FFFFFFFFFFF7FD0, 176, 01) = 0 3866976: 20447683: 0.0002: statx("/oracle/app/grid_base/diag/asm/+asm/+ASM/alert/log.xml", 0x0FFFFFFFFFFF8100, 176, 0) = 0 3866976: 20447683: 0.0004: kopen("/oracle/app/grid_base/diag/asm/+asm/+ASM/alert/log.xml", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP) = 7 3866976: 20447683: 0.0002: kfcntl(7, F_SETFD, 0x0000000000000001) = 0 3866976: 20447683: kwrite(7, " < m s g t i m e = ' 2".., 283) = 283 3866976: 20447683: 0.0002: close(7) = 0 3866976: 20447683: 0.0002: statx("/oracle/app/grid_base/diag/asm/+asm/+ASM/alert/log.xml", 0x0FFFFFFFFFFF7620, 176, 0) = 0 3866976: 20447683: 0.0002: kopen("/oracle/app/grid_base/diag/asm/+asm/+ASM/trace/alert_+ASM.log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP) = 7 3866976: 20447683: 0.0002: kfcntl(7, F_SETFD, 0x0000000000000001) = 0 3866976: 20447683: kwrite(7, " W e d J u l 2 3 0".., 25) = 25 3866976: 20447683: kwrite(7, " R B A L s t a r t e d".., 41) = 41 <<<<<<<<<< 3866976: 20447683: 0.0002: close(7) = 0 3866976: 20447683: 2.0001: _thread_wait(1152921504606830016) (sleeping...) 3866976: 20447683: 2.0001: _thread_wait(1152921504606830016) = 1 3866976: 20447683: 1.0002: kopen("/proc/4653566/psinfo", O_RDONLY|O_LARGEFILE) = 7 3866976: 20447683: kread(7, "\0 \0010204\001\0\0\001".., 448) = 448 3866976: 20447683: 0.0002: close(7) = 0 3866976: 20447683: 2.0001: _thread_wait(1152921504606830016) (sleeping...) 3866976: 20447683: 2.0001: _thread_wait(1152921504606830016) = 1 3866976: 20447683: 2.0001: _thread_wait(1152921504606830016) (sleeping...) 3866976: 20447683: 2.0001: _thread_wait(1152921504606830016) = 1 3866976: 20447683: 2.0001: _thread_wait(1152921504606830016) (sleeping...)
We can see that RBAL has been started and is always in Sleeping state. It proves that the RBAL process has no problems. The MOS document shows a problem with RBAL access to the disk. Our situation is different. So what is the problem? This is really strange. We can only roll back the OS patch and re-relink the software according to the final solution described in this article. There is no special progress in opening the SR for this problem, and it is suspected that it is a conflict that has not yet been found. After all, Oracle software is developed first, while IBM's new patches are developed later, which may cause incompatibility. It is best to install the operating system patches and then reinstall the latest database software.
Reference: ASM Instance Hangs During The Diskgroup Mount Stage After aix OS Patch Install (Document ID 1633273.1)
Original article address: After the AIX operating system patch is installed, the ASM hang disk cannot be mounted. Thank you for sharing your thanks.