Iptables causes Heartbeat split-brain
When applying heartbeat to the production environment, there are still many things to pay attention to. If you are not careful, heartbeat may fail to be switched or split-brain, next we will introduce the split-brain problem caused by iptables.
MASTER: 192.168.3.218
192.168.4.218 heartbeat ip
Usvr-218 Host Name
Backup: 192.168.3.128
192.168.4.128 heartbeat ip
Usvr-128 Host Name
Symptom: After the heartbeat master is started, the VIP takes effect on 218, and then the heartbeat slave is started, and the VIP takes effect on 128. This split-brain occurs, leading to access exceptions.
Solution:
1. View logs of the host and slave
The host 218 log is as follows (only some logs are listed ):
Heartbeat [27330]: 2015/01/27 _ 09:05:29 ERROR: Message hist queue is filling up (500 messages in queue)
Heartbeat [27330]: 2015/01/27 _ 09:05:30 ERROR: Message hist queue is filling up (500 messages in queue)
Heartbeat [27330]: 2015/01/27 _ 09:05:30 ERROR: Message hist queue is filling up (500 messages in queue)
Heartbeat [27330]: 2015/01/27 _ 09:05:31 ERROR: Message hist queue is filling up (500 messages in queue)
Heartbeat [27330]: 2015/01/27 _ 09:05:32 ERROR: Message hist queue is filling up (500 messages in queue)
Heartbeat [27330]: 2015/01/27 _ 09:05:32 ERROR: Message hist queue is filling up (500 messages in queue)
Heartbeat [27330]: 2015/01/27 _ 09:05:33 WARN: node usvr-128: is dead
Heartbeat [27330]: 2015/01/27 _ 09:05:33 info: Cancelling pending standby operation
Heartbeat [27330]: 2015/01/27 _ 09:05:33 info: Dead node usvr-128 gave up resources.
Heartbeat [27330]: 2015/01/27 _ 09:05:33 info: all clients are now resumed
Heartbeat [27330]: _ 09:05:33 ERROR: lowseq cannnot be greater than ackseq
Heartbeat [27330]: 2015/01/27 _ 09:05:33 info: hist-> ackseq = 74575, old_ackseq = 0
Heartbeat [27330]: 2015/01/27 _ 09:05:33 info: hist-> lowseq = 74576, hist-> hiseq = 74824, send_cluster_msg_level = 1
Heartbeat [27333]: 2015/01/27 _ 09:05:34 CRIT: Emergency Shutdown: Master Control process died.
Heartbeat [27333]: 2015/01/27 _ 09:05:34 CRIT: Killing pid 27330 with SIGTERM
Heartbeat [27333]: 2015/01/27 _ 09:05:34 CRIT: Killing pid 27334 with SIGTERM
Heartbeat [27333]: 2015/01/27 _ 09:05:34 CRIT: Killing pid 27335 with SIGTERM
Heartbeat [27333]: 2015/01/27 _ 09:05:34 CRIT: Killing pid 27336 with SIGTERM
Heartbeat [27333]: 2015/01/27 _ 09:05:34 CRIT: Killing pid 27337 with SIGTERM
Heartbeat [27333]: 2015/01/27 _ 09:05:34 CRIT: Emergency Shutdown (MCP dead): Killing ourselves.
The slave server 128 log is as follows (only some logs are listed ):
Jan 27 10:11:35 heartbeat: [15999]: info: glib: ucast: bound receive socket to device: eth0
Jan 27 10:11:35 heartbeat: [15999]: info: glib: ucast: set SO_REUSEPORT (w)
Jan 27 10:11:35 heartbeat: [15999]: info: glib: ucast: started on port 694 interface eth0 to 192.168.4.218
Jan 27 10:11:35 heartbeat: [15999]: info: glib: ping heartbeat started.
Jan 27 10:11:35 heartbeat: [15999]: info: G_main_add_TriggerHandler: Added signal manual handler
Jan 27 10:11:35 heartbeat: [15999]: info: G_main_add_TriggerHandler: Added signal manual handler
Jan 27 10:11:35 heartbeat: [15999]: info: G_main_add_SignalHandler: Added signal handler for signal 17
Jan 27 10:11:35 heartbeat: [15999]: info: Local status now set to: 'up'
Jan 27 10:11:35 heartbeat: [15999]: info: Link 192.168.3.1: 192.168.3.1 up.
Jan 27 10:11:35 heartbeat: [15999]: info: Status update for node 192.168.3.1: status ping
Jan 27 10:13:35 heartbeat: [15999]: WARN: node usvr-218: is dead
Jan 27 10:13:35 heartbeat: [15999]: info: Comm_now_up (): updating status to active
Jan 27 10:13:35 heartbeat: [15999]: info: Local status now set to: 'active'
Jan 27 10:13:35 heartbeat: [15999]: info: Starting child client "/usr/lib64/heartbeat/ipfail" (498,498)
Jan 27 10:13:35 heartbeat: [2, 15999]: WARN: No STONITH device configured.
Jan 27 10:13:35 heartbeat: [15999]: WARN: Shared disks are not protected.
Jan 27 10:13:35 heartbeat: [15999]: info: Resources being acquired from localsv218.
As shown above, both the master and slave sides check that the other node is dead and take over the VIP, resulting in split-brain.
2. the preliminary conclusion is that the communication or network delay between the master and slave sides is caused. Is it because the time is not synchronized? Although the time difference does not affect the heartbeat, there are many differences and problems will certainly occur, so when both parties are right.
/Usr/sbin/ntpdate ntp. api. bz & hwclock-w
Echo "0 23 **** root/usr/sbin/ntpdate ntp. api. bz & hwclock-w>/dev/null 2> & 1">/etc/crontab
3. after the time pair is complete, an error in the log is still reported. Check the master and backup configuration files again and find that there is no problem. The only difference is that both the master and backup have firewalls, because heartbeat is set to communicate with udp port 694
The port is removed from the firewall.
Add the following to master 218:
/Sbin/iptables-a input-I eth0-p udp-s 192.168.4.128 -- dport 694-m comment -- comment "heartbeat-slave"-j ACCEPT
Add the following to slave 128:
/Sbin/iptables-a input-I eth0-p udp-s 192.168.4.218 -- dport 694-m comment -- comment "heartbeat-master"-j ACCEPT
Note: 1. If the firewall policy is strict, you must release the heartbeat ip address; otherwise, udp Communication will still fail.
2. Entry Nic for the ip address of the heartbeat
After the firewall configuration, the master and slave nodes can communicate normally. Normally, the master node takes over the VIP. When the master node is down or the heartbeat Service of the master node is stopped, the slave node takes over the VIP.
-------------------------------------- Split line --------------------------------------
Iptables examples
Iptables-packet filtering (Network Layer) Firewall
Linux Firewall iptables
Iptables + L7 + Squid implements a complete software firewall
Basic use of iptables backup, recovery, and firewall scripts
Detailed description of firewall iptables usage rules in Linux
-------------------------------------- Split line --------------------------------------
This article permanently updates the link address: