Record an analysis of RAC node restart failure caused by Nic heartbeat failure. rac Node

Source: Internet
Author: User

Record an analysis of RAC node restart failure caused by Nic heartbeat failure. rac Node

Database and CRS version: 10.2.0.4

Down Process Analysis

Serial number

Node

Time

Action

Log Source

1

II

Jul 22:48:15

XXdb2 kernel: netdev watchdog: eth1: transmit timed out

Bnx2: fw sync timeout, reset code = 1020015

OS

2

II

Jul 22:48:29

--

Jul 4

CRS-1612: node XXdb1 (1) at 50% heartbeat fatal, eviction in 29.118 seconds

] CRS-1610: node XXdb1 (1) at 90% heartbeat fatal, eviction in 5.128 seconds

CRS

3

II

Jul 22:54:14

XXdb2 syslogd 1.4.1: restart

OS

4

II

Jul 22:54:14

XXdb2 ifup: Device eth1 has different MAC address than expected, ignoring.

XXdb2 network: Bringing up interface eth1: failed

OS

5

II

Jul 5 01:22:27 -- Jul 5 01:58:49

XXdb2 logger: Cluster Ready Services waiting on dependencies. Diagnostics in/tmp/crsctl.5659

OS

6

II

Jul 5 01:59:30

XXdb2 shutdown: shutting down for system reboot

OS

7

II

Jul 5 03:00:08

CRS-1605: CSSD voting file is online:/dev/raw/raw18. Details in/home/oracle/product/10.2.0/crs/log/XXdb2/cssd/ocssd. log

CRS

8

I

Jul 23:00:00

CRS-1612: node XXdb2 (2) at 50% heartbeat fatal, eviction in 29.144 seconds

CRS

9

I

Jul 23:04:55

XXdb1 syslogd 1.4.1: restart

OS

From the above logs, the entire fault process is as follows:

(1) The operating system finds that eth1 (Heartbeat network card) has timed out. Then, the database connection times out at the first node. After the times out, the database forces the operating system to restart at the second node.

(2) After the restart at the second point, eth1 will not start, causing CRS to wait for the resource to start, and it will not start, the records in/tmp/crsctl.5659 In the CRS log are waiting for the start of the internal heartbeat Nic.

(3) After being restarted at the second node, the first node connects to the second node and the heartbeat times out. The first node forcibly restarts the operating system.

(4) the source of the problem is the failure of the heartbeat network at the second node, in addition, because the running mac address of the eth1 Nic does not match the actual mac address, the eth1 Nic cannot start after the server is restarted.


Author: stepping on, he is engaged in "system architecture, operating system, storage device, database, middleware, and application" system performance optimization.

Join the system performance optimization professional group to discuss performance optimization technologies. GROUP: 258187244



The lab linux (suse) will automatically reboot every hour and ask how to check the exception information and cancel Automatic restart.

Various logs, including system logs, application logs, database logs, and automatic system restart, may be caused by hardware problems (such as motherboard problems or magnetic array connection problems), System Custom tasks, application faults or bugs (such as memory overflow occupied by applications), Database faults (such as rac heartbeat network connection failure, causing rac to automatically call and restart .). This can only be investigated step by step.

The computer needs to be switched on twice before it can be started. It takes more than 10 minutes to start the first boot. It cannot be displayed. It can be switched off after a hot boot.

Hello, this is a typical hardware fault, but don't worry, it's not a big fault. This fault is caused by insufficient CPU startup voltage. There are two possible reasons:
1. the CPU and power supply capacitor on the motherboard are damaged, and the startup voltage is insufficient due to insufficient capacitor power. The computer cannot start normally. It takes a long time to start because it takes longer to charge the capacitor to reach the normal start voltage after it is damaged. Restart is normal because the capacitor is heated after the first start, and is easily filled when it is re-charged, basically reaching the pressure voltage. However, the heating process needs to be similar as long as the cold start.
2. If the power supply is damaged and the output voltage of the power supply is insufficient, that is to say, the fault described above may occur on the power supply, rather than on the motherboard. The reason is the same as above.
In short, this fault is caused by insufficient CPU startup voltage. It is not a major fault, so don't worry.
Solution: replace the damaged capacitor. Generally, there are multiple.
If you do not have a wealth of computer disassembly and assembly experience and basic electrical knowledge, we suggest you take it to the repair site for replacement. The cost is not high, and you can do it with 10 or 20 yuan (one capacitor is only one or two cents, this money is actually a manual fee. Don't get it done ).
If you have a wealth of computer and installation experience and basic electrical management knowledge, you can do it as described below:
1. Identify the fault. By changing the power supply of the same or similar computer, try whether the computer is properly started. If yes, your power supply is faulty. You can choose to buy a new power supply or self-repair power supply. If no, follow the steps below to do it.
2. Check all the capacitors around the CPU on the computer motherboard (if you want to ask me what a capacitor is, hurry and hold the computer at the maintenance point. Don't look down again, huh, huh ), you will see that some capacitors are bulging or even violent. These are all capacitors that need to be replaced. Remember these capacitor models and go to the electronic market to buy the same one, change to another one. Make sure that the soldering iron is hot enough during welding. It should be melted once and cannot be put on for too long. In particular, the new capacitor cannot be heated for too long. Otherwise, the capacitor will be damaged.
If it is a power failure, it is similar to the above. You just need to turn on the power and replace it. There may be another kind of capacitor on the power supply, that is, the one that looks like a stone. You should check carefully.
The solution is as above. Please weigh it yourself.
Final suggestions:
1. If you do it yourself, the cost is a few dollars, but there is a certain risk, because we are not professionals after all.
2. We recommend that you take it to the repair point to repair it. It will not cost much for insurance.

I hope my answers can solve your problem.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.