Zookeeper Pre-11gR2: "crsctl check crs" command hangs at EVMD check (Document ID 1578875.1)
Applies:
Oracle Database-Enterprise Edition-Version 10.2.0.3 to 11.1.0.7 [Release 10.2 to 11.1]
Information in this document applies to any platform.
SYMPTOMS
In a 2 node RAC environment, with 11.1.0.7 CRS, execution of the command "crsctl check crs" hangs at EVMD check only in Node 1
[Oracle @ srv03401 bin] $./crsctl check crs
Cluster Synchronization Services appears healthy
Cluster Ready Services appears healthy
From Node1, below is the output of strace for the command "crsctl check crs"
# Strace-f-t-o/tmp/crschk. trc crsctl check crs
Content of the generated output file:/tmp/crschk. trc is as follows:
28268 11:47:03 execve ("./crsctl", ["./crsctl", "check", "crs"], [/* 23 vars */]) = 0
28268 11:47:03 brk (0) = 0x193d2000
28268 11:47:03 mmap (NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS,-1, 0) = 0x2b35b9436000
28268 11:47:03 uname ({sys = "Linux", node = "srv03401.metra.com",...}) = 0
28268 11:47:03 access ("/etc/ld. so. preload", R_ OK) =-1 ENOENT (No such file or directory)
28268 11:47:03 open ("/etc/ld. so. cache", O_RDONLY) = 3
28268 11:47:03 fstat (3, {st_mode = S_IFREG | 0644, st_size = 92563,...}) = 0
28268 11:47:03 mmap (NULL, 92563, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2b35b9437000
28268 11:47:03 close (3) = 0
28268 11:47:03 open ("/lib64/libtermcap. so.2", O_RDONLY) = 3
28268 11:47:03 read (3, "\ 177ELF \ 2 \ 1 \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 3 \ 0> \ 0 \ 1 \ 0 \ 0 \ 0 \ 0' \ 20 \ 300z2 \ 0 \ 0 \ 0 "..., 832) = 832
28268 11:47:03 fstat (3, {st_mode = S_IFREG | 0755, st_size = 15840,...}) = 0
28268 11:47:03 mmap (NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS,-1, 0) = 0x2b35b944e000
28268 11:47:03 mmap (0x327ac00000, 2108944, PROT_READ | PROT_EXEC, MAP_PRIVATE | MAP_DENYWRITE, 3, 0) = 0x327ac00000
28268 11:47:03 mprotect (0x327ac03000, 2093056, PROT_NONE) = 0
28268 11:47:03 mmap (0x327ae02000, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_FIXED | MAP_DENYWRITE, 3, 0x2000) = 0x327ae02000
28268 11:47:03 close (3) = 0
28268 11:47:03 open ("/lib64/libdl. so.2", O_RDONLY) = 3
..
..
28268 11:47:03 close (3) = 0
28268 11:47:03 write (1, "Cluster Ready Services appears h"..., 39) = 39
28268 11:47:03 socket (PF_INET6, SOCK_DGRAM, IPPROTO_IP) = 3
28268 11:47:03 setsockopt (3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
28268 11:47:03 bind (3, {sa_family = AF_INET6, sin6_port = htons (0), inet_pton (AF_INET6, ": 1", & sin6_addr), sin6_flowinfo = 0, sin6_scope_id = 0}, 28) = 0
28268 11:47:03 getsockname (3, {sa_family = AF_INET6, sin6_port = htons (42027), inet_pton (AF_INET6, ": 1", & sin6_addr), sin6_flowinfo = 0, sin6_scope_id = 0 },[ 140733193388060]) = 0
28268 11:47:03 getpeername (3, 0x7fff5f19e1e0, [140733193388060]) =-1 ENOTCONN (Transport endpoint is not connected)
28268 11:47:03 getsockopt (3, SOL_SOCKET, SO_SNDBUF, [5536382933839118336], [4]) = 0
28268 11:47:03 getsockopt (3, SOL_SOCKET, SO_RCVBUF, [5536382933843050496], [4]) = 0
28268 11:47:03 fcntl (3, F_SETFD, FD_CLOEXEC) = 0
28268 11:47:03 fcntl (3, F_SETFL, O_RDONLY | O_NONBLOCK) = 0
28268 11:47:03 geteuid () = 700
28268 11:47:03 times ({tms_utime = 1, tms_stime = 2, tms_cutime = 0, tms_cstime = 0}) = 7422615891
28268 11:47:03 socket (PF_FILE, SOCK_STREAM, 0) = 4
28268 11:47:03 access ("/var/tmp/. oracle/sSYSTEM. evm. acceptor. auth", F_ OK) = 0
28268 11:47:03 connect (4, {sa_family = AF_FILE, path = "/var/tmp/. oracle/sSYSTEM. evm. acceptor. auth"...}, 110
CAUSE
Analysing the strace output, looks like it was trying to write to a socket.
==========
28268 11:47:03 socket (PF_FILE, SOCK_STREAM, 0) = 4
28268 11:47:03 access ("/var/tmp/. oracle/sSYSTEM. evm. acceptor. auth", F_ OK) = 0
28268 11:47:03 connect (4, {sa_family = AF_FILE, path = "/var/tmp /. oracle/sSYSTEM. evm. acceptor. auth "...}, 110 <
==========
This, indicates a problem with the network socket file.
SOLUTION
Get the PID of evmd. bin process and kill it
$ Ps-ef | grep 'd. bin'
Oracle 21046 21045 0 2012? 00:07:46/u01/app/ract/crs/bin/evmd. bin
Root 21054 15845 0 2012? 11:34:47/u01/app/ract/crs/bin/crsd. bin reboot
Oracle 22072 21453 0 2012? 05:44:50/u01/app/ract/crs/bin/ocssd. bin
Root 22135, 1 0 2012? 00:00:00/u01/app/ract/crs/bin/oclskd. bin
Oracle 22410 1 0 2012? 00:00:00/u01/app/ract/crs/bin/oclskd. bin
Oracle 29834 27854 0 00:00:00 pts/8 egrep d. bin
$ Kill-9 21046
After killing evmd. bin process, the command "crsctl check crs" returns the complete output without any hangs.
[Oracle @ srv03401 bin] $./crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy