Diagnosis and repair of a rac vip drift

Source: Internet
Author: User
Tags ping network

Diagnosis and repair of a rac vip drift

Background

The customer's 10 Gb database VIP is down, causing VIP load to another node

Event support details

04:29:56. went OFFLINE unexpectedly occurs in the VIP of machine 1 of 378. Check the cause of VIP downtime after the VIP drift fault occurs on the same day,

Start the DEBUG 5 mode for VIP resources:./crsctl debug log res "orahostname1.vip: 5"

At 04:38:36, went OFFLINE unexpectedly is displayed in the VIP address of Node 1.

According to the log of ora. hostname. vip. log, the cause of VIP downtime is determined to be caused by poor communication between the public IP address and the default network administrator.

Modify the parameters in the racgvip program from FAIL_WHEN_DEFAULTGW_NO_FOUND = 1

FAIL_WHEN_DEFAULTGW_NO_FOUND = 0

However, after the adjustment, the fault still persists.

04:17:37. 822: [CRSRES] [11025] 32ora. hostname1.vip on hostname1 went OFFLINE unexpectedly

To identify the cause, collect the information of ora. hostname1.vip. log and racgvip again for analysis.

 

The analysis results are as follows:

 

The following code is available in the racgvip program:


# Check the status of the interface thro 'pinging gateway

If [-n "$ DEFAULTGW"]

Then

_ RET = 1

# Get base IP address of the interface

TmpIP = '$ LSATTR-El $ {_ IF}-a netaddr | $ AWK' {print $2 }''

# Get RX packets numbers (bug8341569, 9157855-> bug9743421)

_ O1 = '$ NETSTAT-n-I $ _ IF | $ AWK "{if (/^ $ _ IF/) {print \ $(NF-4 ); exit }}"'

X = $ CHECK_TIMES

While [$ x-gt 0]

Do

If [-n "$ tmpIP"]

Then

Logx "About to execute command: $ PING-S $ tmpIP $ PING_TIMEOUT $ DEFAULTGW"

$ PING-S $ tmpIP $ PING_TIMEOUT $ DEFAULTGW>/dev/null 2> & 1

Else

Logx "About to execute command: $ PING $ PING_TIMEOUT $ DEFAULTGW"

$ PING $ PING_TIMEOUT $ DEFAULTGW>/dev/null 2> & 1

Fi

_ O2 = '$ NETSTAT-n-I $ _ IF | $ AWK "{if (/^ $ _ IF/) {print \ $(NF-4 ); exit }}"'

If ["$ _ O1 "! = "$ _ O2"]

Then

# RX packets numbers changed

_ RET = 0

Break

Fi

$ SLEEP 1

X = '$ EXPR $ x-1'

Done

If [$ _ RET-ne 0]

Then

Logx "IsIfAlive: RX packets checked if =$ _ IF failed"

Else

Logx "IsIfAlive: RX packets checked if =$ _ if OK"

Fi

Else

Logx "IsIfAlive: Default gateway is not defined (host = $ HOSTNAME )"

If [$ FAIL_WHEN_DEFAULTGW_NO_FOUND-eq 1]

Then

_ RET = 1

Else

_ RET = 0

Fi

Fi
 

Check the processing logic of the default gateway from the source code.

1. If the default gateway is detected, perform the network management check Logic

2. _ 01 network packet collection

3. $ PING-S $ tmpIP $ PING_TIMEOUT $ DEFAULTGW ping Network Management

4. _ 02 collect network packets again

5. If the number of network packets of the _ 01 Nic is different from that of the _ 02 Nic, the communication between the nic and the default Nic is normal. _ RET returns a code of 0.

6. If the number of network packets of the _ 01 Nic is the same as that of the _ 02 Nic, The _ RET return code is not specified, and 1 is returned by default, and the log logx "IsIfAlive: RX packets checked if =$ _ IF failed "indicates that the NIC fails.

The FAIL_WHEN_DEFAULTGW_NO_FOUND parameter is changed from 1 to 0 to skip the gateway ping detection. From the source code, we can see that the FAIL_WHEN_DEFAULTGW_NO_FOUND parameter takes effect only when the NIC parameter $ DEFAULTGW is null, that is, if the gateway is not configured on the host and the FAIL_WHEN_DEFAULTGW_NO_FOUND parameter is set to non-1, The Returned Code RET is 0.

 

Because the DEFAULTGW in our environment can be obtained successfully and the DEFAULTGW is not empty, the program does not enter the FAIL_WHEN_DEFAULTGW_NO_FOUND process to determine whether it is 1.

 

The DEBUG error message during the fault is as follows:

 


04:17:37. 776: [RACG] [1] [18219068] [1] [ora. s9lp1. vip]: Wed Nov 6 04:17:33 CST 2013 [6422696] checkIf: start for if = en5

Wed Nov 6 04:17:33 CST 2013 [6422696] IsIfAlive: start for if = en5

Wed Nov 6 04:17:33 CST 2013 [6422696] defaultgw: started

 

04:17:37. 776: [RACG] [1] [18219068] [1] [ora. s9lp1. vip]: Wed Nov 6 04:17:33 CST 2013 [6422696] defaultgw: completed with 10.0.241.254 (the gateway is obtained successfully, and the gateway is 10.0.241.254)

Wed Nov 6 04:17:33 CST 2013 [6422696] About to execute command:/usr/sbin/ping-S 10.0.241.150-c 1-w 1 10.0.241.254

 

04:17:37. 777: [RACG] [1] [18219068] [1] [ora. s9lp1. vip]: Wed Nov 6 04:17:35 CST 2013 [6422696] About to execute command:/usr/sbin/ping-S 10.0.241.150-c 1-w 1 10.0.241.254 (PING gateway)

Wed Nov 6 04:17:37 CST 2013 [6422696] IsIfAlive: RX packets checked if = en5 failed, failed to judge as en5)
 

1. Each fault occurs at around 04 a.m.. The time is as follows:

 

04:29:56

04:38:36

04:17:37
2. From the source code analysis, the network packet of en5 of the network adapter remains unchanged for one second in a row during the fault

 

Possible causes:

Ping-S 10.0.241.150-c 1-w 1 10.0.241.254

When Oracle detects network management, ping cannot return results within 1 second due to poor network quality.

This causes no network packet changes after the en5 ping.

Based on the above analysis, we suggest:

 

1. Modify the racgvip source code to skip Network Management Detection

Before modification:


# Check the status of the interface thro 'pinging gateway

If [-n "$ DEFAULTGW"]
 

 

After modification:


# Check the status of the interface thro 'pinging gateway
If [-n "$ DEFAULTGW"-a $ FAIL_WHEN_DEFAULTGW_NO_FOUND-eq 1]
 

 

 

Check the RACGVIP code of oracle11.2.0.3 and modify it again.
 

The following is the racgvip code of Oracle11G.


If [-n "$ DEFAULTGW"-a $ FAIL_WHEN_DEFAULTGW_NOT_FOUND-eq 1]

Then

_ RET = 1

# Get RX packets numbers

_ O1 = '$ IFCONFIG $ _ IF | $ AWK' {if (/RX packets:/) {sub ("packets:", "", $2 ); print $2 }}''

X = $ CHECK_TIMES

While [$ x-gt 0]

Do

Logx "About to execute $ PING-r-I $ _ IF $ DEFAULTGW $ PING_TIMEOUT"

$ PING-r-I $ _ IF $ DEFAULTGW $ PING_TIMEOUT>/dev/null 2> & 1

Rc = $?

If [$ rc-eq 0]

Then

_ RET = 0

Break

Else

Echo "ping to $ DEFAULTGW via $ _ IF failed, rc = $ rc (host = $ HOSTNAME )"

Fi

X = $ ($ x-1 ))

Done
 

Conclusion and Solution

Modify the racgvip code

After modification, observe the following information in ora. s9lp1. vip. log:

IsIfAlive: Default gateway is not defined (host = $ HOSTNAME)

Indicates that the modification is invalid.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.