Dual-node RAC each node host frequently restarts automatically to solve the fault

Source: Internet
Author: User
Recently, an oracle10gRAC dual-node experimental platform was set up in vmware, And the oracleRAC was upgraded from 10.2.0.1 to 10.2.0.5. Later, it was found that

Recently, an oracle10g RAC dual-node experimental platform was built in vmware and oracle RAC was upgraded from 10.2.0.1 to 10.2.0.5. Later, it was found that

1) Background:

Recently, an Oracle10g RAC dual-node experimental platform was built in vmware and oracle RAC was upgraded from 10.2.0.1 to 10.2.0.5. Later, it was found that the two linux systems often restarted automatically;

2) platform information:
Vmware7 + OEL5.7X64 + ASMLib2.0 + ORACLE10.2.0.5

3)/var/log/message log:
NODE1: Linux1
Apr 18 20:44:18 Linux1 syslogd 1.4.1: restart.
Apr 18 20:44:18 Linux1 kernel: klogd 1.4.1, log source =/proc/kmsg started.
Apr 18 20:44:18 Linux1 kernel: Initializing cgroup subsys cpuset
Apr 18 20:44:18 Linux1 kernel: Initializing cgroup subsys cpu
Apr 18 20:44:18 Linux1 kernel: Linux version 2.6.32-200.13.1.el5uek (mockbuild@ca-build9.us.oracle.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50) #1 SMP Wed Jul 27 21:02:33 EDT 2011
Apr 18 20:44:18 Linux1 kernel: Command line: ro root =/dev/VolGroup00/LogVol00 rhgb quiet
Apr 18 20:44:18 Linux1 kernel: KERNEL supported cpus:
Apr 18 20:44:18 Linux1 kernel: Intel GenuineIntel
Apr 18 20:44:18 Linux1 kernel: AMD AuthenticAMD
Apr 18 20:44:18 Linux1 kernel: Centaur CentaurHauls
Apr 18 20:44:18 Linux1 kernel: BIOS-provided physical RAM map:
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000000000000000-000000000009f800 (usable)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 000000000009f800-00000000000a0000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000000ca000-00000000000cc000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000000dc000-00000000000e4000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 201710000000e8000-0000000000100000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000000000100000-00000000bfef0000 (usable)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 100000000bfef0000-00000000bfeff000 (ACPI data)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 100000000bfeff000-00000000bff00000 (acpi nvs)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000bff00000-00000000c0000000 (usable)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 201710000e0000000-201710000f0000000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 201710000fec00000-201710000fec10000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 201710000fee00000-201710000fee01000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 201710000fffe0000-0000000100000000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000000100000000-0000000140000000 (usable)
Apr 18 20:44:18 Linux1 kernel: DMI present.
NODE2: Linux2
Apr 18 20:43:35 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131: 7777 has been idle for 30.0 seconds, shutting it down.
Apr 18 20:43:35 Linux2 kernel: (swapper, 1498): o2net_idle_timer: 1334752985.559806 here are some times that might help debug the situation: (tmr 1334753015.306532 now 1334752985.559360 dr 1334752985.559806 adv: 1334752985.559807 func (b651ea27: 504) 1334752951.27068: 1334752951.27323)
Apr 18 20:43:35 Linux2 kernel: o2net: no longer connected to node Linux1 (num0) at 192.168.3.131: 7777
Apr 18 20:43:56 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131: 7777 shutdown, state 7
Apr 18 20:44:05 Linux2 kernel: (o2net, 3480,0): o2net_connect_expired: 1659 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors.
Apr 18 20:44:24 Linux2 avahi-daemon [4341]: Registering new address record for 192.168.0.136 on eth0.
Apr 18 20:44:26 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131: 7777 shutdown, state 7
Apr 18 20:44:28 Linux2 last message repeated 2 times
Apr 18 20:44:28 Linux2 kernel: (o2hb-9938799A41, 3564,1): o2dlm_eviction_cb: 267 o2dlm has evicted node 0 from group 9938799A418642218A66FE77029DE473
Apr 18 20:44:28 Linux2 kernel: (ocfs2rec, 19793,1): ocfs2_replay_journal: 1605 Recovering node 0 from slot 0 on device (8, 65)
Apr 18 20:44:30 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131: 7777 shutdown, state 8
Apr 18 20:44:31 Linux2 kernel: (ocfs2rec, 19793,0): ocfs2_begin_quota_recovery: 407 Beginning quota recovery in slot 0
Apr 18 20:44:31 Linux2 kernel: (ocfs2_wq, 3567,1): ocfs2_finish_quota_recovery: 598 Finishing quota recovery in slot 0
Apr 18 20:44:31 Linux2 kernel: (dlm_reco_thread, 3573,0): dlm_get_lock_resource: 836 Timeout: $ RECOVERY: at least one node (0) to recover before lock mastery can begin
Apr 18 20:44:31 Linux2 kernel: (dlm_reco_thread, 3573,0): dlm_get_lock_resource: 870 9938799A418642218A66FE77029DE473: recovery map is not empty, but must master $ RECOVERY lock
Apr 18 20:44:31 Linux2 kernel: (dlm_reco_thread, 3573,0): dlm_do_recovery: 523 (3573) Node 1 is the Recovery Master for the Dead Node 0 for Domain 9938799A418642218A66FE77029DE473
The above information will be exchanged between the two machines, indicating that it is not always fixed that one machine times out on the other.


4) An error is reported based on the message. The error is caused by the expiration of the idle time of o2cb. The O2CB service status in the system is as follows:
[Oracle @ Linux1] service o2cb status
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Heartbeat dead threshold = 301
Network idle timeout: 30000/unit: millisecond. 30 seconds reported in the Official message
Network keepalive delay: 2000
Network reconnect delay: 2000
Checking O2CB heartbeat: Active

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.