Heartbeat is a Linux open source, widely used high-availability cluster system, since 1999 to now, released a number of versions, is currently open source Linux-ha project The most successful example, in the industry has been widely used.
1.1 Heartbeat Effect
Heartbeat can transfer resources (such as IP and program services) from one failed computer to another on a functioning machine, which is generally referred to as high-availability services. In a real-world production scenario, Heartbeat's functionality is much the same as another highly available open source software keepalived, and the actual business application in production is different.
Heartbeat Official Address: Http://linux-ha.org/wiki/Main_Page
1.2 Heartbeat Working principle
By modifying the configuration file of the heartbeat software, you can specify which heartbeat server is the primary server, and the other will automatically become a hot standby server. Then configure the heartbeat daemon on the hot standby server to listen to the heartbeat information of the autonomic server. If the hot standby server does not hear the heartbeat from the primary server for a specified time, it starts the failover program and takes ownership of the associated resource service on the primary server, and the primary server continues to provide services uninterrupted to achieve the high availability of resources and services.
In addition to the main standby mode, Hertbeat also supports the main master mode, that is, the two servers are primary preparation, which is that they will send each other a newspaper Greek tell each other their current state, if not received by the other side of the heartbeat message, then, the party will think the other side failure or downtime, Each well-functioning host will then start its own resource takeover module to take over the resources or services running on the other host and continue to serve the user. The so-called business continuity, during the failover period also need to switch time, heartbeat switching time is generally around 5-20 seconds.
To switch to a common condition:
(1) Server outage.
(2) Heartbeat service itself failure.
(3) Heartbeat connection failure.
A service failure does not cause a switchover, and the heartbeat service can be stopped by a service outage.
1.3 Heartbeat Heartbeat Connection
We have learned about the heartbeat service, which requires at least two hosts to complete. So how can the two machines communicate with each other and monitor each other for high-availability services?
Here are some common ways to communicate between two heartbeat hosts:
(1) Serial cable (preferred, disadvantage is not too far away from the distance).
(2) General Ethernet cable Two network Kazhilian (production environment commonly used way).
(3) Ethernet cable, through the switch and other network equipment connection (secondary selection).
Increase the switch point of failure, at the same time, the line is not a dedicated heartbeat line, vulnerable to other data transmission, resulting in heartbeat message delivery problems.
1.4 Heartbeat cleft brain 1.4.1 What is split brain
Since the two high-availability servers can not detect each other's heartbeat (possibly due to a cable failure) during the specified time, they start the failover function and take ownership of the resources and services, while the two high-availability servers are still alive and working, This causes the same IP or service to start simultaneously at both ends of the serious problem of conflict, the most serious is that the two hosts occupy the same VIP address, when the user writes the data may be written to both ends, which may lead to inconsistent data on both ends of the server or the loss of data, this is known as split brain. Also known as partitioned clusters or brain vertical segmentation. The English name is Splitbrain.
1.4.2 causes the splitting of the brain
In general, the occurrence of split brain, there are several reasons for the
(1) High-availability server-to-heart heartbeat link failure, resulting in inability to communicate properly.
A. The heartbeat line is broken (including broken, aging).
B. Nic and related drivers are broken, IP configuration and conflict issues (NIC direct connection).
C. Device failures (switches and network cards) connected between the heartbeat lines.
D. The arbitration machine is out of the problem.
(2) High availability server pair on the firewall blocking the heartbeat message transmission.
(3) The high-availability server has incorrectly configured information such as the heartbeat card address, which causes the heartbeat to fail.
(4) Other reasons such as different service configuration, such as Heartbeat mode, heartbeat broadcast conflict, software bug, etc.
1.4.3 8 ways to prevent split brain occurrence
In the actual production environment, we can prevent the occurrence of the split brain problem from the following several aspects.
(1) Simultaneous use of serial cable and Ethernet cable connection, while using two heartbeat lines, such a line broken another is still good, still able to transmit heartbeat information (network card devices and network cable devices, recommended use of this method)
(2) When the cracked brain is detected, forcibly close a heartbeat node. (This feature requires special device support, such as stonith, fence). Equivalent to the program on the node to find the heartbeat line failure, send off the command path Master node.
(3) To do a good job of the monitoring of the split brain alarm (such as mail and mobile phone messages, on duty), in the event of the first time to intervene in arbitration, reduce losses. Baidu Monitoring has an upstream and downstream there is a process of human interaction. Of course, if there is no human interaction process in implementing a highly available scenario, it is necessary to determine whether such a loss can be tolerated based on the actual business requirements. For the general website business, this loss is controllable. We can also be the first time to log out of the faulty machine, see what the reason, if the problem is very small can be repaired.
(4) To enable the disk lock, is the service side lock the shared disk, "split brain" occurs when the other party completely "Rob" shared resources. However, the use of the lock disk will also have a small problem, if the party occupying the shared disk does not actively "unlock", the other party will never get the shared disk. In reality, if the service node suddenly freezes or crashes, it cannot execute the unlock command. The backup node will not be able to take over shared resources and application services. So someone designed the "smart" lock in Ha. That is, the party being served only enables the disk lock when it discovers that the heartbeat line is completely disconnected (unaware of the peer), and usually does not lock.
(5) The alarm report before the server takes over to leave enough time for the person to handle.
1 minutes alarm, but the server does not take over at this time, but 5 minutes to take over, take longer. Data is not lost, which prevents users from writing data.
(6) After the alarm, not directly automatic server takeover, but by human personnel to take over.
(7) Increase the arbitration mechanism to determine who should get the resources, here are a few parameters of the idea.
A. Add an arbitration mechanism. If you set a reference IP (such as a gateway IP), beware of jumper completely disconnected, 2 nodes are each ping see IP, not the general rule indicates that the breakpoint on the side, not only the heartbeat line, as well as the local network link of the external services broken, so that the initiative to abandon the competition, so that can ping the reference IP side to take over the A party that pings a reference IP can restart itself to completely release those shared resources that might still be in use (heartbeat also has this feature).
B. Arbitration through third-party software who should have access to resources.
1.4.4 Notes on Fence
Fence is only the term in HA cluster environment, in the hardware field, fence device is actually an intelligent management power management Device (IPMI) is also called intelligent powermanagement Interface, If and the service agents say fence them they must not know what is (the original may know) and they say intelligent management device or remote management card, they can understand. Fence is a device with external fence and internal fence two types are plugged into the server, whether internal or external fence, these devices with Ethernet port, used to restart the server through the network when the HA switch is triggered.
First of all, the internal device fence in different server name is not the same, the following are different servers corresponding to the fence device name.
As for the external equipment, APC (the famous UPS power manufacturer) Powerswitch, this is an Ethernet port of the power socket, each socket corresponds to an ID number, used in the command to specify which ID number on the power to cut off or restart.
1.5 Heartbeat type of message
Heartbeat high-availability software in the work, in general, there are three types of messages, specifically:
(1) Heartbeat message
(2) Cluster conversion message
(3) Retransmission request
1.5.1 Heartbeat Message
Heartbeat messages are approximately 150-byte packets, which may be unicast, broadcast, or multicast, controlling the heartbeat frequency and how long to wait for the failover to occur.
1.5.2 Cluster conversion messages
Ip-request and Ip-request-resp
When the primary server is back online, the ip-request message requires the standby to release the resources that the server made when the primary server fails, and then the backup server shuts down the resources and services that were obtained when the primary server failed to release.
When the standby server releases the resources and services that the primary server failed, it notifies the primary server through the IP-REQUEST-RESP message that it does not have the resources and services that the primary server receives after the IP-REQUEST-RESP message notification from the standby node, and the resources and service that were released when the startup failed. and start to provide normal access services.
1.5.3 Retransmission Request
Rexmit-request controls retransmission of heartbeat requests. This heartbeat control message is sent to any port specified by the/etc/ha.d/ha.cf file or to the specified multicast port address using the UDP protocol.
1.6 Heartbeat IP address takeover and failover
The heartbeat is failed over by IP address takeover and ARP broadcast.
ARP broadcast: In the case of a primary server failure, when the standby node takes over the resource, it forces the update of all client-local ARP tables (that is, the VIP address of the failed server that clears the client's local cache and the resolution record for the MAC address), and ARP is the address Resolution Protocol. The LAN will be broadcast by the standby node. Make sure that the client and the new primary server are talking.
Every machine in the LAN has an ARP table with Arp–a to see
1.7 VIP/IP alias/secondary IP
The real IP, also known as the management IP, is usually the actual IP configured on the physical network card. In a load-balanced and highly available environment, the management IP is not available to the external service, but only for the management server, such as SSH can connect to the server through this management IP.
VIP is a virtual IP, is actually heartbeat temporarily bound on the physical network card alias IP (HEARTBEAT3 or more can also use the secondary IP), such as eth0:x.x 0-255 of any number, you can bind multiple aliases on a network card.
The benefit is that when the server that provides the service goes down, the same VIP service is automatically configured on the server that takes over. If you use a management IP, it is difficult to move back and forth, and
and the management of IP migration away, we can only go to the computer Room connection service. The essence of the VIP is to ensure that the two servers each have a management IP, is ready to connect the machine, and then, increase the binding of other IP, so that even if the VIP migration away, not the server itself is not connected, because there are management IP.
To manually configure the VIP method:
HEARTBEAT2 software By default is to use this command to add VIP
Ifconfig eth0:1 192.168.1.131 netmask 255.255.255.0 up (IP alias)
keepalved software uses this command by default to add a scheme that VIP,HEARTBEAT3 uses
Ip addr Add 10.0.12.1/24 broadcast 10.0.12.255 dev eth1 (secondary IP)
Tip: IP Add can view include aliases and secondary IPs, with ifconfig unable to view accessibility
Ways to manually remove VIPs:
Ip addr 10.0.12.1/24 broadcast 10.0.15.255 dev eth1 (secondary IP)
Ifconfig eth0:1 10.0.0.1 netmask 255.255.255.0 down (IP alias)
Ways to view VIPs manually:
VIP configured with the alias method can be viewed through ifconfig
1.8 Heartbeat Script Default directory
Startup script:/etc/init.d/(especially when using Yum installation)
Important Resources Directory:/etc/ha.d/resource.d/If you later develop your own program in this place, and then heartbeat in the Haresource file directly call.
[[email protected] ha.d]# ll/etc/ha.d/resource.d/total dosage 96-rwxr-xr-x 1 root root 828 December 3 apache-rwxr-xr-x 1 ro OT Root 805 December 3 audiblealarm-rwxr-xr-x 1 root root 760 December 3 db2-rwxr-xr-x 1 root root 910 December 3 De Lay-rwxr-xr-x 1 root root 1903 December 3 filesystem-rwxr-xr-x 1 root root 2325 December 3 hto-mapfuncs-rwxr-xr-x 1 root Root 3488 March 3 23:31 httpd-rwxr-xr-x 1 root root 951 December 3 icp-rwxr-xr-x 1 root root 3424 December 3 ids-rwxr-x R-x 1 root root 2273 December 3 ipaddr-rwxr-xr-x 1 root root 1825 December 3 ipaddr2-rwxr-xr-x 1 root root 1391 December 3 2 013 Ipsrcaddr-rwxr-xr-x 1 root root 1165 December 3 ipv6addr-rwxr-xr-x 1 root root 1091 December 3 linuxscsi-rwxr-xr-x 1 Root root 790 December 3 lvm-rwxr-xr-x 1 root root 1125 December 3 mailto-rwxr-xr-x 1 root root 2926 December 3 ocf-r Wxr-xr-x 1 root root 742 December 3 portblock-rwxr-xr-x 1 root root 1160 December 3 raid1-rwxr-xr-x 1 root root 1563 12 Month 3 Sendarp-rwxr-xR-x 1 root root 1012 December 3 serveraid-rwxr-xr-x 1 root root 1294 December 3 was-rwxr-xr-x 1 root root 1166 December 3 20 Winpopup-rwxr-xr-x 1 root root 666 December 3 Xinetd
Tip: Place the script under either of the two paths above, and then heartbeat the configuration script name in the Haresource configuration file to invoke the script to control the startup and shutdown of resources and services.
1.9 Heartbeat configuration file
The default profile directory for heartbeat is/ETC/HA.D, with 3 commonly used mate files, HA.CF, Authkey, and Haresource, respectively.
For an introduction to heartbeat, see the following article
Mysql DBA Advanced Operations Learning Note-heartbeat Introduction