In the application of heartbeat to the production environment, there are many places to pay attention to, inadvertently may lead to heartbeat can not switch or brain fissure situation, the following to introduce the phenomenon of iptables caused by brain fissure.
Master: 192.168.3.218
192.168.4.218 Heartbeat IP
usvr-218 Host name
Preparation: 192.168.3.128
192.168.4.128 Heartbeat IP
USVR-128 Host Name
Phenomenon: When the heartbeat Master is started, the VIP takes effect on 218, and then the heartbeat is activated, and the VIP is also in effect at 128;
Solution Ideas:
1. View the logs of the host and the standby machine
Host 218 logs are as follows (only partial logs are listed):
HEARTBEAT[27330]: 2015/01/27_09:05:29 error:message hist queue is filling up ($ messages in queue)
HEARTBEAT[27330]: 2015/01/27_09:05:30 error:message hist queue is filling up ($ messages in queue)
HEARTBEAT[27330]: 2015/01/27_09:05:30 error:message hist queue is filling up ($ messages in queue)
HEARTBEAT[27330]: 2015/01/27_09:05:31 error:message hist queue is filling up ($ messages in queue)
HEARTBEAT[27330]: 2015/01/27_09:05:32 error:message hist queue is filling up ($ messages in queue)
HEARTBEAT[27330]: 2015/01/27_09:05:32 error:message hist queue is filling up ($ messages in queue)
HEARTBEAT[27330]: 2015/01/27_09:05:33Warn:node Usvr-128:is Dead
HEARTBEAT[27330]: 2015/01/27_09:05:33 info:cancelling pending standby operation
HEARTBEAT[27330]: 2015/01/27_09:05:33 info:dead node usvr-128 gave up resources.
HEARTBEAT[27330]: 2015/01/27_09:05:33 info:all clients is now resumed
HEARTBEAT[27330]: 2015/01/27_09:05:33 error:lowseq cannnot be greater than ackseq
HEARTBEAT[27330]: 2015/01/27_09:05:33 info:hist->ackseq =74575, old_ackseq=0
HEARTBEAT[27330]: 2015/01/27_09:05:33 info:hist->lowseq =74576, hist->hiseq=74824, send_cluster_msg_level=1
HEARTBEAT[27333]: 2015/01/27_09:05:34 crit:emergency shutdown:master Control process died.
HEARTBEAT[27333]: 2015/01/27_09:05:34 crit:killing pid 27330 with SIGTERM
HEARTBEAT[27333]: 2015/01/27_09:05:34 crit:killing pid 27334 with SIGTERM
HEARTBEAT[27333]: 2015/01/27_09:05:34 crit:killing pid 27335 with SIGTERM
HEARTBEAT[27333]: 2015/01/27_09:05:34 crit:killing pid 27336 with SIGTERM
HEARTBEAT[27333]: 2015/01/27_09:05:34 crit:killing pid 27337 with SIGTERM
HEARTBEAT[27333]: 2015/01/27_09:05:34 crit:emergency Shutdown (MCP dead): killing ourselves.
Standby 128 logs are as follows (only partial logs are listed):
Jan 10:11:35 Heartbeat: [15999]: Info:glib:ucast:bound receive socket to Device:eth0
Jan 10:11:35 Heartbeat: [15999]: Info:glib:ucast:set so_reuseport (W)
Jan 10:11:35 Heartbeat: [15999]: info:glib:ucast:started on Ports 694 interface eth0 to 192.168.4.218
Jan 10:11:35 Heartbeat: [15999]: info:glib:ping Heartbeat started.
Jan 10:11:35 Heartbeat: [15999]: info:G_main_add_TriggerHandler:Added Signal Manual Handler
Jan 10:11:35 Heartbeat: [15999]: info:G_main_add_TriggerHandler:Added Signal Manual Handler
Jan 10:11:35 Heartbeat: [15999]: info:G_main_add_SignalHandler:Added signal handler for signal 17
Jan 10:11:35 Heartbeat: [15999]: info:local status now set to: ' Up '
Jan 10:11:35 Heartbeat: [15999]: Info:link 192.168.3.1:192.168.3.1 up.
Jan 10:11:35 Heartbeat: [15999]: Info:status Update for node 192.168.3.1:status Ping
Jan 10:13:35 Heartbeat: [15999]:Warn:node Usvr-218:is Dead
Jan 10:13:35 Heartbeat: [15999]: info:comm_now_up (): Updating status to Active
Jan 10:13:35 Heartbeat: [15999]: info:local status now set to: ' Active '
Jan 10:13:35 Heartbeat: [15999]: info:starting child Client "/usr/lib64/heartbeat/ipfail" (498,498)
Jan 10:13:35 Heartbeat: [15999]: Warn:no STONITH device configured.
Jan 10:13:35 Heartbeat: [15999]: warn:shared disks is not protected.
Jan 10:13:35 Heartbeat: [15999]: Info:resources being acquired from localsv218.
As shown above, both sides check the other's node dead, thus taking over the VIP, causing the brain to crack.
2. Preliminary determination is due to the main and prepare the two sides can not communicate or network delay caused by the time is not synchronized, although the time is different heartbeat less impact, but a lot of difference, there will certainly be problems, so the two sides on the time.
/usr/sbin/ntpdate ntp.api.bz&&hwclock-w
echo "0 * * * * root/usr/sbin/ntpdate ntp.api.bz&&hwclock-w >/dev/null 2>&1" >>/etc/crontab
3. When finished, still reported errors in the log, check the main configuration file again, found that there is no problem, the only difference is that there is a firewall on the main standby, because the heartbeat is set by the UDP 694 port communication, so UDP 694
The port was spared in the fire wall.
On the main 218, add:
/sbin/iptables-a INPUT-I eth0 - p udp-s 192.168.4.128--dport 694 -m comment--comment "heart Beat-slave "-j ACCEPT
On standby 128, add:
/sbin/iptables-a INPUT-I eth0 - p udp-s 192.168.4.218--dport 694 -m comment--comment "heart Beat-master "-j ACCEPT
Note: 1. If the firewall policy is strict, the heartbeat IP should be spared, or the UDP communication will still fail
2. Network adapter for the heartbeat IP
After the firewall configuration, the main standby can communicate normally, the main node takes over the VIP work, when the primary node down or the primary node of the heartbeat service is stopped, the standby node will take over the VIP
Iptables causes a firewall to crack the brain