Background: This is a windows rac system. Database background log reporting ORA-00474: SMONprocessterminatedwitherror. Then reported ORA-00603, ORA-27501, ORA-27300, ORA-27301, ORA-27302 and other errors. Problem Analysis: 1. Database alert Log FriFeb2804: 12: 092014 Reconfigurationstarte
Background: This is a windows rac system. Database background log reported ORA-00474: SMON process terminated with error. Then reported ORA-00603, ORA-27501, ORA-27300, ORA-27301, ORA-27302 and other errors. Problem Analysis: 1. Database alert Log Fri Feb 28 04:12:09 2014 Reconfiguration starte
Background:
This is a windows rac system. Database background log reported ORA-00474: SMON process terminated with error. Then reported ORA-00603, ORA-27501, ORA-27300, ORA-27301, ORA-27302 and other errors.
Problem Analysis:
1. Database alert logs
Fri Feb 28 04:12:09 2014Reconfiguration started (old inc 32, new inc 34)List of nodes: 0 Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned outFri Feb 28 04:12:09 2014 LMS 2: 0 GCS shadows cancelled, 0 closedFri Feb 28 04:12:10 2014 LMS 1: 0 GCS shadows cancelled, 0 closedFri Feb 28 04:12:11 2014 LMS 3: 0 GCS shadows cancelled, 0 closedFri Feb 28 04:12:12 2014 LMS 0: 2 GCS shadows cancelled, 2 closed Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IRFri Feb 28 04:12:12 2014Instance recovery: looking for dead threadsFri Feb 28 04:12:12 2014 LMS 3: 10102 GCS shadows traversed, 0 replayedFri Feb 28 04:12:12 2014 LMS 1: 10083 GCS shadows traversed, 0 replayedFri Feb 28 04:12:13 2014Beginning instance recovery of 1 threadsFri Feb 28 04:12:14 2014Errors in file d:\oracle\product\10.2.0\admin\uaprac\bdump\uaprac1_smon_6456.trc:ORA-00603: ORACLE server session terminated by fatal errorORA-27501: IPC error creating a portORA-27300: OS system dependent operation:IPC_CreateNamedSocket failed with status: 10049ORA-27301: OS failure message: The requested address is not valid in its context.ORA-27302: failure occurred at: initport_1
From the alert log of the database, we can see that the rac instance has been reconfigured (another node is restarted), and the SMON process of Node 1 starts to resume transactions, an error occurred while creating the ipc Port during transaction recovery. During the process of creating the port, the Oracle database needs to call some functions of the operating system, when an error occurs while calling the function of the operating system, error code 10049 is returned. Let's further look at the log Content of SMON trace.
*** 2014-02-28 04:10:26.269*** SERVICE NAME:(SYS$BACKGROUND) 2014-02-28 04:10:26.238*** SESSION ID:(1091.1) 2014-02-28 04:10:26.238IPCSendMsg: could not initiate send on conn 0x1f157b40 to node [uap : 4920 : 6200 : 223515], err 10054IPCGetRequestInfo: failed a request rqh(0x13582080), type(6), status(2), bytes(0)*** 2014-02-28 04:12:12.928Start recovery for domain 0, valid = 0, flags = 0x0
IPC_CreateNamedSocket: bind failed for [10.0.0.3: 11716], err (10049)
IPCInitPort: could not create listening socket err 10049
In this trace, we can see some information more clearly. For example, bind failed for [10.0.0.3: 11716], err (10049 ). What does this mean? To understand this part, you must have a certain understanding of socket programming. In socket programming, there is a very important function called bind. During network communication, a socket must be associated with an address. This process is the process of Address binding. In many cases, the kernel will bind an address to us. However, sometimes users may need to complete the binding process on their own to meet their actual needs. The following is the definition of the bind function.
#includeint bind(int sockfd, struct sockaddr* addr, socklen_t addrlen)
Here we will not describe the meanings of these parameters one by one. We can see in our trace that the bind function has a problem trying to bind the ports 10.0.0.3 and 11716, and reports 10049. In this case, what is the 10049 error in the operating system. Enter
C:\>net helpmsg 10049The requested address is not valid in its context.
Here we can see that the request address is invalid in its context. From this point, we assume that the IP address 10.0.0.3 requested has changed. In this case, the NIC may be faulty or the vswitch may be faulty. The IP address is invalid.
2. CRS log
[ CSSD]2014-02-28 04:10:12.800 [5928] >TRACE: clssgmPeerDeactivate: node 2 (uap-sgs-db10), death 0, state 0x80000001 connstate 0xf[ CSSD]2014-02-28 04:10:12.800 [6080] >WARNING: clssnmeventhndlr: Receive failure with node 2 (uap-sgs-db10), state 3, con(00000000022453A0), probe(0000000000000000), rc=11[ CSSD]2014-02-28 04:10:12.800 [6080] >TRACE: clssnmDiscHelper: uap-sgs-db10, node(2) connection failed, con (00000000022453A0), probe(0000000000000000)[ CSSD]2014-02-28 04:10:35.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 50 1.135363e-320artbeat fatal, eviction in 29.594 seconds[ CSSD]2014-02-28 04:10:35.801 [5940] >TRACE: clssnmPollingThread: node uap-sgs-db10 (2) is impending reconfig, flag 1, misstime 30406[ CSSD]2014-02-28 04:10:35.801 [5940] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)[ CSSD]2014-02-28 04:10:36.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 50 1.138327e-320artbeat fatal, eviction in 28.594 seconds[ CSSD]2014-02-28 04:10:50.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 75 1.145244e-320artbeat fatal, eviction in 14.594 seconds[ CSSD]2014-02-28 04:10:51.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 75 1.146232e-320artbeat fatal, eviction in 13.594 seconds[ CSSD]2014-02-28 04:10:59.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 90 1.151173e-320artbeat fatal, eviction in 5.594 seconds[ CSSD]2014-02-28 04:11:00.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 90 1.152161e-320artbeat fatal, eviction in 4.594 seconds[ CSSD]2014-02-28 04:11:01.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 90 1.153149e-320artbeat fatal, eviction in 3.594 seconds[ CSSD]2014-02-28 04:11:02.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 90 1.154137e-320artbeat fatal, eviction in 2.594 seconds[ CSSD]2014-02-28 04:11:03.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 90 1.155125e-320artbeat fatal, eviction in 1.594 seconds[ CSSD]2014-02-28 04:11:04.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 90 1.158090e-320artbeat fatal, eviction in 0.594 seconds[ CSSD]2014-02-28 04:11:05.411 [5940] >TRACE: clssnmPollingThread: Eviction started for node uap-sgs-db10 (2), flags 0x0001, state 3, wt4c 0[ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmDoSyncUpdate: Initiating sync 109[ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (57000)ms[ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmSetupAckWait: Ack message type (11) [ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmSetupAckWait: node(1) is ALIVE[ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmSendSync: syncSeqNo(109)[ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmWaitForAcks: Ack message type(11), ackCount(1)[ CSSD]2014-02-28 04:11:05.411 [6080] >TRACE: clssnmHandleSync: diskTimeout set to (57000)ms[ CSSD]2014-02-28 04:11:05.411 [6080] >TRACE: clssnmHandleSync: Acknowledging sync: src[1] srcName[uap-sgs-db09] seq[1] sync[109][ CSSD]2014-02-28 04:11:05.411 [5956] >USER: NMEVENT_SUSPEND [00][00][00][06][ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmWaitForAcks: done, msg type(11)[ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmDoSyncUpdate: Terminating node 2, uap-sgs-db10, misstime(60015) state(5)[ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmSetupAckWait: Ack message type (13) [ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmSetupAckWait: node(1) is ACTIVE[ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmWaitForAcks: Ack message type(13), ackCount(1)[ CSSD]2014-02-28 04:11:05.411 [6080] >TRACE: clssnmSendVoteInfo: node(1) syncSeqNo(109)[ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmWaitForAcks: done, msg type(13)[ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmCheckDskInfo: Checking disk info...[ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmCheckDskInfo: node 2, uap-sgs-db10, state 5 with leader 2 has smaller cluster size 1; my cluster size 1 with leader 1[ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmEvict: Start[ CSSD]2014-02-28 04:11:05.411 [5920] >TRACE: clssnmEvict: Evicting node 2, uap-sgs-db10, birth 96, death 109, impendingrcfg 1, stateflags 0x1
From the CRS log, we can see that node Eviction occurs here. Because Node 1 finds that node 2 cannot communicate, node 2 also finds that it cannot communicate with node 1. In the end, the entire Instance decides to remove Node 2. Here we can see that the misscount time of our network heartbeat is 60 seconds. At around 50%, a prompt appears. CRS software further confirmed that there was a problem at the operating system network layer.
3. Operating System Logs
4:10:12 Tcpip information without 4202 N/A UAP-SGS-DB09 system detected network card Broadcom BCM5709C NetXtreme II GigE (ndis vbd Client) #2 and network disconnected, and the network card network configuration has been released. If the network card is not disconnected, it may cause a fault. Please contact your vendor for updated drivers. 4:10:06 l2nd warning No 4 N/A UAP-SGS-DB09 Broadcom BCM5709C: The network link is down. Check to make sure the network cable is properly connected.
In the end, we found that the network card went down at 4:10:06 on the operating system. However, our CRS generates a 50% heartbeat alarm at 4:10:36. In the end, node 2 is removed at 4:11:05, that is, 60 seconds after the misscount setting is reached.
Conclusion
We have analyzed the entire process. Under normal circumstances, simply remove a node when the network card goes down. The biggest problem here is that the IP address is lost instantly, although it causes Node 2 to be removed smoothly, however, during the recovery process, the SMON process of Node 1 still calls the bind function in socket programming. This function needs to re-bind the IP address and port. During the binding process, if the IP address is different from the previous IP address (for example, if the NIC is disabled, the IP address changes from 10.0.0.3 to 0.0.0.0 ). In the end, bind will fail, bind will fail, smon will fail to recover the transaction, and node 1 will also be down. So this case tells us that it is best to use multiple redundancy measures to avoid this problem.
Original address: ORA-00603, ORA-27501, ORA-27300, ORA-27301, ORA-27302 so, thanks to the original author to share.