Diagnosis and repair of a rac vip drift
Background
The customer's 10 Gb database VIP is down, causing VIP load to another node
Event support details
04:29:56. went OFFLINE unexpectedly occurs in the VIP of machine 1 of 378. Check the cause of VIP downtime after the VIP drift fault occurs on the same day,
Start the DEBUG 5 mode for VIP resources:./crsctl debug log res "orahostname1.vip: 5"
At 04:38:36, went OFFLINE unexpectedly is displayed in the VIP address of Node 1.
According to the log of ora. hostname. vip. log, the cause of VIP downtime is determined to be caused by poor communication between the public IP address and the default network administrator.
Modify the parameters in the racgvip program from FAIL_WHEN_DEFAULTGW_NO_FOUND = 1
FAIL_WHEN_DEFAULTGW_NO_FOUND = 0
However, after the adjustment, the fault still persists.
04:17:37. 822: [CRSRES] [11025] 32ora. hostname1.vip on hostname1 went OFFLINE unexpectedly
To identify the cause, collect the information of ora. hostname1.vip. log and racgvip again for analysis.
The analysis results are as follows:
The following code is available in the racgvip program:
# Check the status of the interface thro 'pinging gateway
If [-n "$ DEFAULTGW"]
Then
_ RET = 1
# Get base IP address of the interface
TmpIP = '$ LSATTR-El $ {_ IF}-a netaddr | $ AWK' {print $2 }''
# Get RX packets numbers (bug8341569, 9157855-> bug9743421)
_ O1 = '$ NETSTAT-n-I $ _ IF | $ AWK "{if (/^ $ _ IF/) {print \ $(NF-4 ); exit }}"'
X = $ CHECK_TIMES
While [$ x-gt 0]
Do
If [-n "$ tmpIP"]
Then
Logx "About to execute command: $ PING-S $ tmpIP $ PING_TIMEOUT $ DEFAULTGW"
$ PING-S $ tmpIP $ PING_TIMEOUT $ DEFAULTGW>/dev/null 2> & 1
Else
Logx "About to execute command: $ PING $ PING_TIMEOUT $ DEFAULTGW"
$ PING $ PING_TIMEOUT $ DEFAULTGW>/dev/null 2> & 1
Fi
_ O2 = '$ NETSTAT-n-I $ _ IF | $ AWK "{if (/^ $ _ IF/) {print \ $(NF-4 ); exit }}"'
If ["$ _ O1 "! = "$ _ O2"]
Then
# RX packets numbers changed
_ RET = 0
Break
Fi
$ SLEEP 1
X = '$ EXPR $ x-1'
Done
If [$ _ RET-ne 0]
Then
Logx "IsIfAlive: RX packets checked if =$ _ IF failed"
Else
Logx "IsIfAlive: RX packets checked if =$ _ if OK"
Fi
Else
Logx "IsIfAlive: Default gateway is not defined (host = $ HOSTNAME )"
If [$ FAIL_WHEN_DEFAULTGW_NO_FOUND-eq 1]
Then
_ RET = 1
Else
_ RET = 0
Fi
Fi
Check the processing logic of the default gateway from the source code.
1. If the default gateway is detected, perform the network management check Logic
2. _ 01 network packet collection
3. $ PING-S $ tmpIP $ PING_TIMEOUT $ DEFAULTGW ping Network Management
4. _ 02 collect network packets again
5. If the number of network packets of the _ 01 Nic is different from that of the _ 02 Nic, the communication between the nic and the default Nic is normal. _ RET returns a code of 0.
6. If the number of network packets of the _ 01 Nic is the same as that of the _ 02 Nic, The _ RET return code is not specified, and 1 is returned by default, and the log logx "IsIfAlive: RX packets checked if =$ _ IF failed "indicates that the NIC fails.
The FAIL_WHEN_DEFAULTGW_NO_FOUND parameter is changed from 1 to 0 to skip the gateway ping detection. From the source code, we can see that the FAIL_WHEN_DEFAULTGW_NO_FOUND parameter takes effect only when the NIC parameter $ DEFAULTGW is null, that is, if the gateway is not configured on the host and the FAIL_WHEN_DEFAULTGW_NO_FOUND parameter is set to non-1, The Returned Code RET is 0.
Because the DEFAULTGW in our environment can be obtained successfully and the DEFAULTGW is not empty, the program does not enter the FAIL_WHEN_DEFAULTGW_NO_FOUND process to determine whether it is 1.
The DEBUG error message during the fault is as follows:
04:17:37. 776: [RACG] [1] [18219068] [1] [ora. s9lp1. vip]: Wed Nov 6 04:17:33 CST 2013 [6422696] checkIf: start for if = en5
Wed Nov 6 04:17:33 CST 2013 [6422696] IsIfAlive: start for if = en5
Wed Nov 6 04:17:33 CST 2013 [6422696] defaultgw: started
04:17:37. 776: [RACG] [1] [18219068] [1] [ora. s9lp1. vip]: Wed Nov 6 04:17:33 CST 2013 [6422696] defaultgw: completed with 10.0.241.254 (the gateway is obtained successfully, and the gateway is 10.0.241.254)
Wed Nov 6 04:17:33 CST 2013 [6422696] About to execute command:/usr/sbin/ping-S 10.0.241.150-c 1-w 1 10.0.241.254
04:17:37. 777: [RACG] [1] [18219068] [1] [ora. s9lp1. vip]: Wed Nov 6 04:17:35 CST 2013 [6422696] About to execute command:/usr/sbin/ping-S 10.0.241.150-c 1-w 1 10.0.241.254 (PING gateway)
Wed Nov 6 04:17:37 CST 2013 [6422696] IsIfAlive: RX packets checked if = en5 failed, failed to judge as en5)
1. Each fault occurs at around 04 a.m.. The time is as follows:
04:29:56
04:38:36
04:17:37
2. From the source code analysis, the network packet of en5 of the network adapter remains unchanged for one second in a row during the fault
Possible causes:
Ping-S 10.0.241.150-c 1-w 1 10.0.241.254
When Oracle detects network management, ping cannot return results within 1 second due to poor network quality.
This causes no network packet changes after the en5 ping.
Based on the above analysis, we suggest:
1. Modify the racgvip source code to skip Network Management Detection
Before modification:
# Check the status of the interface thro 'pinging gateway
If [-n "$ DEFAULTGW"]
After modification:
# Check the status of the interface thro 'pinging gateway
If [-n "$ DEFAULTGW"-a $ FAIL_WHEN_DEFAULTGW_NO_FOUND-eq 1]
Check the RACGVIP code of oracle11.2.0.3 and modify it again.
The following is the racgvip code of Oracle11G.
If [-n "$ DEFAULTGW"-a $ FAIL_WHEN_DEFAULTGW_NOT_FOUND-eq 1]
Then
_ RET = 1
# Get RX packets numbers
_ O1 = '$ IFCONFIG $ _ IF | $ AWK' {if (/RX packets:/) {sub ("packets:", "", $2 ); print $2 }}''
X = $ CHECK_TIMES
While [$ x-gt 0]
Do
Logx "About to execute $ PING-r-I $ _ IF $ DEFAULTGW $ PING_TIMEOUT"
$ PING-r-I $ _ IF $ DEFAULTGW $ PING_TIMEOUT>/dev/null 2> & 1
Rc = $?
If [$ rc-eq 0]
Then
_ RET = 0
Break
Else
Echo "ping to $ DEFAULTGW via $ _ IF failed, rc = $ rc (host = $ HOSTNAME )"
Fi
X = $ ($ x-1 ))
Done
Conclusion and Solution
Modify the racgvip code
After modification, observe the following information in ora. s9lp1. vip. log:
IsIfAlive: Default gateway is not defined (host = $ HOSTNAME)
Indicates that the modification is invalid.