1. Heartbeat Introduction
Heartbeat is an open-source Linux-based, widely used high-availability cluster system. Since 1999, many versions have been released, it is currently the most successful example of the open-source Linux-ha project and has been widely used in the industry.
1.1 Role of heartbeat
Heartbeat can quickly transfer resources (such as IP and program services) from a faulty computer to another normal machine to continue providing services, which is generally referred to as high-availability services. In actual production scenarios, the features of Heartbeat have many similarities with another highly available open-source software keepalived, and the actual business applications in production are also different.
Heartbeat official address: http://linux-ha.org/wiki/Main_Page
1.2 How heartbeat works
By modifying the configuration file of the heartbeat software, you can specify which heartbeat server is used as the master server, and the other server will automatically become a hot backup server. Then configure the heartbeat daemon on the Hot Standby server to listen to the heartbeat information of the independent server. If the Hot Standby server does not listen to the heartbeat of the Active Server within the specified time, it starts the Failover program and obtains the ownership of related resource services on the active server, then, the master server continues to provide services continuously to achieve high availability of resources and services.
In addition to the master-slave mode, hertbeat also supports the master-slave mode, that is, the two servers act as the master-slave instances. This means they send messages to each other to tell the other server their current status, if the heartbeat packet is not sent by the other party within the specified time, the other party will deem the other party invalid or goes down, in this case, each running host starts its own resource management module to take over the resources or services running on the host to continue providing services to users. The so-called uninterrupted service also requires switching time during failover. The heartbeat switching time is generally about 5-20 seconds.
Switch to common conditions:
(1) The server goes down.
(2) heartbeat service faults.
(3) heartbeat connection failure.
Service failure does not cause failover. You can stop the heartbeat service through service downtime.
1.3 heartbeat connection
We have learned about the heartbeat service and need at least two hosts. In this case, how do two machines communicate with each other and perform mutual inspection and monitoring for high-availability services?
The following are some common methods for communication between two heartbeat hosts:
(1) Serial cable (preferred, but not too far away ).
(2) Two NICs of an ethernet cable are directly connected (commonly used in production environments ).
(3) Ethernet cables connected by switches and other network devices ).
The failure point of the switch is increased. At the same time, the line is not a dedicated heartbeat line and is easily affected by other data transmission, resulting in heartbeat packet transmission problems.
1.4 heartbeat split brain 1.4.1 what is split brain
Because the two high-availability servers cannot detect each other's heartbeat (possibly due to cable failure) within the specified time, the Failover function is enabled separately, the ownership of resources and services is obtained, and the two high-availability servers are still alive and running normally, this will cause a serious conflict between the same IP address and the service to be started at both ends. The most serious problem is that the two hosts occupy the same VIP address, when a user writes data, the data may be written to both ends, which may result in inconsistent data at both ends of the server or data loss. This situation is called split brain. It is also known as vertical segmentation of a partition cluster or brain. The English name is splitbrain.
1.4.2 causes of split brain
In general, split-brain occurs due to the following reasons:
(1) failure of the heartbeat link between highly available server pairs leads to failure of normal communication.
A. the heartbeat line is broken (including broken and aging ).
B. Nic and related drivers are broken, IP configuration and conflict issues (NIC direct connection ).
C. device faults (switches and network cards) connecting heartbeat lines ).
D. The arbitration machine has a problem.
(2) The firewall is enabled on the high-availability server to block Heartbeat message transmission.
(3) The high-availability server fails to send a heartbeat message due to incorrect configuration of the network card address and other information.
(4) other reasons such as different heartbeat methods, heartbeat broadcast conflicts, and software bugs.
1.4.3 eight methods to prevent split brain
In the actual production environment, we can prevent split-brain problems from the following aspects.
(1) Use a serial cable to connect to the ethernet cable at the same time, and use two heartbeat lines at the same time. If a line breaks down, the other line is still good and can still transmit heartbeat information (network card devices and network cable devices, this method is recommended)
(2) Forcibly disable a heartbeat node when detecting split brain. (This feature requires support from special devices, such as stonith and fence ). This is equivalent to a heartbeat line fault detected by the standby node in the program, and the master node is sent as the shutdown command.
(3) perform monitoring and alarm on Split-brain tasks (such as emails, text messages, etc., and on duty). When a problem occurs, the system considers that it is necessary to intervene in arbitration immediately to reduce losses. Baidu monitoring has a process of human interaction between the uplink and the downlink. Of course, if there is no human interaction process, when implementing a high-availability solution, you should determine whether such losses can be tolerated based on actual business needs. This loss is controllable for General website businesses. We can also log on to the faulty machine as soon as possible to check the cause and fix it if the problem is small.
(4) Enable the disk lock. The Service side is locking the shared disk. When split brain occurs, the other party will be completely unable to share resources. However, locking a disk also poses a major problem. If one party who uses a shared disk does not "unlock" the disk, the other party will never get the shared disk. In reality, if a service node suddenly crashes or crashes, the UNLOCK command cannot be executed. The backup node cannot take over shared resources and application services. So someone designed the "smart" Lock In Ha. That is, the Service side only enables the disk lock when it finds that the heartbeat line is completely disconnected (the opposite end is not noticed.
(5) leave sufficient time for personnel to handle the alarm report before the server takes over.
An alarm is reported within one minute, but the server does not take over at this time, but takes over within five minutes. Data is not lost, and users cannot write data.
(6) After an alarm is triggered, the server is not automatically taken over, but is taken over by human personnel.
(7) Increase the arbitration mechanism to determine who should obtain the resources. Here are several parameter ideas.
A. Add an arbitration mechanism. If you set the reference IP address (such as the gateway IP address), when the jumper is completely disconnected, ping the two nodes to see the IP address. Otherwise, it indicates that the breakpoint is at the local end, the heartbeat line and the local network link of external services are disconnected. In this way, the competition is abandoned and the service can be taken over by ping the end of the reference IP address. If you cannot ping the IP address, you can restart the IP address to completely release the shared resources that may still be occupied (Heartbeat also has this function ).
B. Who should obtain the resources through third-party software arbitration.
1.4.4 fence description
Fence is just a term in the HA cluster environment. In the hardware field, fence is actually a smart management power management device (IPMI), also known as intelligent powermanagement interface, if they say fence to service agents, they certainly don't know what it is (the factory may know) and they say smart management devices or remote management cards to them, they will be able to understand. Fence is a device with both external fence and internal fence inserted into the server, whether internal or external fence. These devices have Ethernet ports, it is used to restart the server through the network when the HA switch is triggered.
First, the internal device fence has different names on different servers. The following table lists the names of fence devices corresponding to different servers.
IBM: RSA
HP: ILO
DELL: idrac
For external devices, the APC (famous UPS power manufacturer) powerswitch is a power outlet of an Ethernet port, each of which corresponds to an ID number, this command is used to specify the ID on which the power supply is disconnected or restarted.
1.5 Heartbeat message type
Heartbeat high-availability software is working. Generally, there are three types of messages:
(1) Heartbeat message
(2) Cluster conversion message
(3) retransmission request
1.5.1 Heartbeat message
A packet whose Heartbeat message is about 150 bytes may be unicast, broadcast, or multicasting, which controls the heartbeat frequency and how long it will take for a fault to be converted.
1.5.2 cluster conversion message
IP-request and IP-request-Resp
When the master server returns to the online status, the backup server is required to release the resources acquired by the backup server when the master server fails to be released through the IP-request message, the backup server then closes the resources and services obtained when the master server fails to be released.
After the slave server releases the resources and services that the master server has failed to obtain, it notifies the master server that it does not own the resources and services through the IP-request-Resp message, after the master server receives the IP-request-Resp message from the slave node, it starts the resources and services released when the master server fails to start providing normal access services.
1.5.3 retransmission request
Rexmit-request controls retransmission heartbeat requests. These Heartbeat Control messages are sent to any port or multicast port address specified by the/etc/ha. d/ha. cf file using UDP protocol.
1.6 heartbeat IP address takeover and Failover
Heartbeat performs failover through IP address takeover and ARP broadcast.
ARP broadcast: when the master server fails, after the slave node takes over the resource, it immediately forces update of all local ARP tables on the client (that is, clearing the resolution records of the VIP address and MAC address of the failed server cached locally on the client). ARP is the Address Resolution Protocol. The slave node broadcasts the LAN. Ensure that the client communicates with the new master server.
In the LAN, each machine has an ARP table, which can be seen through ARP-.
1.7 VIP/IP alias/secondary IP
A real IP address is also called a management IP address. It is generally the actual IP address configured on the physical Nic. In Server Load balancer and high availability environments, management IP addresses are used only for server management rather than external services. For example, SSH can be used to connect to the server through this management IP address.
VIP is a virtual IP address, which is actually the alias IP address that heartbeat is temporarily bound to the physical network card (the secondary IP Address can also be used for heartbeat3 or later), such as eth0: X. any number ranging from 0 to X. You can bind multiple aliases to a network card.
The advantage of this is that when the server that provides the service goes down, the same VIP will be automatically configured on the server that takes over the service. If you use a management IP address, it is difficult to migrate back and forth.
When the management IP address is migrated, we can only connect to the data center. The essence of VIP is to ensure that each of the two servers has a management IP address, that is, they can connect to the machine at any time, and then add and bind other IP addresses, so that even if the VIP is migrated, and the server itself cannot be connected, because there are also management IP addresses.
Manual VIP configuration:
The heartbeat2 software uses this command by default to add the VIP
Ifconfig eth0: 1 192.168.1.131 netmask 255.255.255.0 up (IP alias)
The keepalved software uses this command by default to add VIP addresses. The heartbeat3 Solution
Ip addr add 10.0.12.1/24 broadcast 10.0.12.255 Dev eth1 (secondary ip)
Tip: IP add can be used to view aliases and secondary IP addresses. ifconfig cannot be used to view secondary IP addresses.
Manual VIP deletion:
Ip addr 10.0.12.1/24 broadcast 10.0.15.255 Dev eth1 (secondary ip)
Ifconfig eth0: 1 10.0.0.1 netmask 255.255.255.0 down (IP alias)
Manual VIP viewing method:
You can use ifconfig to view the VIP configured using the alias method.
1.8 default directory of Heartbeat script
Startup Script:/etc/init. d/(especially when yum is used for installation)
Important Resource Directory:/etc/ha. d/resource. d/put it in this place if you develop your own program later, and then heartbeat is called directly in the haresource file.
[[Email protected] ha. d] # ll/etc/ha. d/resource. d/total usage 96-rwxr-xr-x 1 Root 828 December 3 2013 APACHE-rwxr-XR-x 1 Root 805 December 3 2013 audiblealarm-rwxr-XR-x 1 Root 760 December 3 2013 db2-rwxr-xr-x 1 Root root 910 December 3 2013 delay-rwxr-XR-x 1 Root 1903 December 3 2013 filesystem-rwxr-XR-x 1 Root 2325 December 3 2013 HTO-mapfuncs-rwxr-XR-X 1 root 3488 March 3 23:31 httpd-rwxr-XR-x 1 Root 951 December 3 2013 ICP-rwxr-XR-x 1 Root 3424 December 3 2013 IDS-rwxr-XR-x 1 Root root 2273 December 3 2013 ipaddr-rwxr-XR-x 1 Root 1825 December 3 2013 IPaddr2-rwxr-xr-x 1 Root 1391 December 3 2013 ipsrcaddr-rwxr-XR-x 1 Root 1165 December 3 2013 IPv6addr-rwxr-xr-x 1 Root root 1091 2013 790 linuxscsi-rwxr-XR-x 1 Root 2013 1125 2013 LVM-rwxr-XR-x 1 Root mailto-rwxr-XR-x 1 Root 2926 December 3 2013 OCF-rwxr-XR-x 1 Root 742 December 3 2013 portblock-rwxr-XR-x 1 Root 1160 December 3 2013 Raid1-rwxr-xr-x 1 Root 1563 December 3 2013 sendarp-rwxr- XR-x 1 Root 1012 December 3 2013 serveraid-rwxr-XR-x 1 Root 1294 December 3 2013 was-rwxr-XR-x 1 Root 1166 December 3 2013 winpopup-rwxr-XR -x 1 Root 666 December 3 2013 xinetd
Tip: place the script under any of the above two paths, and then configure the script name in the haresource configuration file of heartbeat to call the script to control the startup and shutdown of resources and services.
1.9 heartbeat configuration file
The default configuration file directory of heartbeat is/etc/ha. D. There are three commonly used matching files: Ha. Cf, authkey, and haresource.
For more information about heartbeat, see the following article.
Https://www.linuxidc.com/Linux/2017-02/140554.htm
Https://www.aliyun.com/jiaocheng/131309.html
Introduction to heartbeat-01