Heartbeat working principle and its deployment requirements (i)

Source: Internet
Author: User
Tags failover xeon e5

One: Heartbeat introduction

1.1 Heartbeat Introduction

Heartbeat is an open source software that provides high-availability (highly-available) services that can be used by heartbeat to move resources (such as IP and program services) from one failed computer to another to continue serving on a functioning machine, Generally referred to as highly available services. In the actual production application scenario, the heartbeat function and the keepalived have many similarities, but in the production, also has the difference to the actual business application. such as: Keepalived is mainly to control the drift of IP, configuration, application is simple, and heartbeat can not only control IP drift, more adept at the control of resource services, configuration, application is more complex.

1.2 Heartbeat Working principle

By modifying the heartbeat configuration file, you can specify which heartbeat server is the primary server, the other server automatically becomes a hot standby server, and then configure the heartbeat daemon on the hot standby server to listen for heartbeat messages from the autonomic server. If the hot standby server does not hear the heartbeat from the primary server for a specified time, it starts the failover program and takes ownership of the related resource service on the primary server, taking over the primary server and continuing to provide the service for the purpose of high availability of resources and services.
The above description is Heartbeat master Standby mode, heartbeat also support the main main mode, that is, the two servers are mainly prepared, then they will each other to send a newspaper Greek tell each other their current state, if within the specified time not received the other party sent heartbeat message, then, One party will assume that the other party is invalid or down, and each of the normal hosts will start its own resource takeover module to take over the resources or services running on the other host, and continue to provide services to the user. Under normal circumstances, can be better to achieve a host failure, the enterprise business can continue uninterrupted operation. Note: The so-called business continuity, during the failover period is also required to switch the time of < For example: Stop the database and storage services, such as the >heartbeat of the main standby high-availability switching time is generally around 5-20 seconds (server downtime switching faster than manual switching)
In addition, like keepalived high-availability software, heartbeat high availability is operating system level, not service (software) level, can be controlled by simple script, to achieve service level of high availability
Common scenarios for highly available server switching:
1) Primary server physical outage (hardware corruption, operating system failure), the main solution to the target
2) Failure of the heartbeat service software itself
3) Heartbeat connection failure between two primary and standby servers
Service failure does not cause switchover, heartbeat service can be stopped by service outage

1.3 Heartbeat Heartbeat Connection

To deploy the heartbeat service, which requires at least two hosts to complete, how can the two hosts communicate and monitor each other to achieve a highly available service?

Here are some common possible ways to communicate between two heartbeat hosts:
1) Connect two servers with serial cable, so-called serial line
2) One Ethernet cable two network card direct connection
3) Ethernet cable, connected via network devices such as switches (not recommended)

How do I choose a heartbeat communication scheme for a highly available server?

1) Serial line signal is not good and Ethernet network intersection, also do not need to configure IP address and other information, so the transmission stability is not prone to problems, the disadvantage of using serial line is that the distance between two servers can not be too far, serial line corresponding to the service end of the device for/DEV/TTVS0
High-availability programs are generally in the same LAN, across the public network can not be used
2) The use of Ethernet network cable (no special cross-line) Direct connection network card, the configuration is relatively simple, just the two direct-attached network network card with a separate IP segment address can communicate with each other, the ordinary network cable can
3) Use the network Ethernet network cable and network card as the heartbeat line is the second choice, because this link increases the switch equipment such a fault point, and this line is not a dedicated heartbeat line, susceptible to other Ethernet data transmission, resulting in heartbeat message delivery delay or undeliverable problems
In production, the serial line and Ethernet line are often used in direct connection.

Hearbeat Heartbeat Type:

Heartbeat cluster heartbeat can be configured in the/ETC/HA.D/HA.CF file, there are 4 types of heartbeat

① Serial Port

Serial serial port Name
SERIAL/DEV/TTYS0 # Linux
SERIAL/DEV/CUAA0 # FreeBSD
SERIAL/DEV/CUAD0 # FreeBSD 6.x
SERIAL/DEV/CUA/A # Solaris

② Broadcast

Interface for broadcast heartbeats
Bcast eth0 # Linux
Bcast eth1 eth2 # Linux
Bcast LE0 # Solaris
Bcast le1 Le2 # Solaris

③ Multicast

Set up a multicast heartbeat media
mcast [Dev] [mcast group] [port] [TTL] [loop]
[Dev] Send/Receive heartbeats devices
Multicast group (D-Class multicast address 224.0.0.0-239.255.255.255) added to [Mcast Group]
The [port] port is used to send/Receive UDP (set this value to the same value as above udpport)
[TTL] The TTL value of the heartbeats of the outflow. This affects how far a multicast packet can spread. (0-255) must be greater than 0.
[Loop] is a multicast heartbeat switch loopback. If enabled, an outflow packet is loopback to its original place and received by the interface that sent it. (0 or 1) Set this value to 0.
Mcast eth1 225.0.0.10 694 1 0

④ Unicast

Configuring a UNICAST/UDP Heartbeat Media
ucast [Dev] [peer-ip-addr]
[Dev] device for sending/receiving heartbeat
[PEER-IP-ADDR] The IP address of the peer to which the package was sent
Ucast eth0 172.10.25.27

Select a scenario Summary:

1, and data-related business, high requirements, can be serial port and network cable directly connected with the way
2, the web-related business can choose the serial port, network cable direct connection or the use of networking

1.4 Heartbeat brain Fissure

What is a brain fissure?

In a "dual-machine hot standby" high-availability (HA) system, when the "Heartbeat Line" of the 2 nodes is disconnected, the HA system, which is a whole, coordinated action, splits into 2 separate individuals. Because of the loss of contact with each other, think it is the other side out of trouble, 2 nodes on the HA software like "crack brain people", "instinctively" scramble for "shared resources", "Application Services", there will be serious consequences: or shared resources are divided, 2-side "services" are not up, or 2-side "services" are up, But also read and write "shared storage", resulting in data corruption (common errors such as online logs polled by the database).

What is the cause of brain fissure?

A: Heartbeat line failure between highly available services, resulting in inability to communicate properly
1) The Heart jumper is broken (broken or aged)
2) NIC and related driver bad, IP configuration and conflict problem (Nic direct connection)
3) device failure (NIC and switch) connected to the heartbeat line
4) Problems with the arbitration Machine (arbitration scheme)
B: The high availability server is turned on, such as iptables firewall blocking heartbeat message transmission
C: The heartbeat card address and other information on the highly available server are not configured correctly, causing the heartbeat to fail
D: Other reasons such as improper configuration of the service, such as different heartbeat mode, heartbeat broadcast conflict, software bug, etc.
Hint: In addition keepalived configuration if the virtual_router_id parameter, the two ends of the configuration is inconsistent, also can cause the brain crack problem

Programmes to prevent the occurrence of brain fissures:

    1) is connected using both a serial cable and an Ethernet cable while using two heartbeat lines (Network card devices and network cable devices)
     2) forcibly shuts down a heartbeat node when a brain fissure is detected (this feature requires special device support such as stonith, fence), which is equivalent to the program's standby node discovering the heartbeat line failure, sending the shutdown command to the Master Node (bank)
     3) do a good job of monitoring the brain crack alarm (such as e-mail and mobile phone messages, on duty), in the case of the first time to intervene in arbitration, reduce losses. Baidu's alarm monitoring is upstream and downstream. The process of human interaction. Of course, when implementing high-availability scenarios, it is necessary to determine whether such losses can be tolerated based on the actual business needs. For general website General business, this loss is controllable by
    4) to enable disk lock. The service party is locking the shared disk, when the brain crack occurs, let the other side completely rob the shared resources, but the use of the lock disk will also have a problem, if the party occupying the shared disk is not actively unlocked, the other side will never get the shared disk. In reality, if the service node suddenly freezes or crashes, it is impossible to perform an unlock command. The backup node will not be able to take over shared resources and application services. So someone in Ha design intelligent lock that is, the side of the service is only when the heartbeat line is found all disconnected (unaware of the peer) when the disk lock is enabled, usually not locked, this function for the shared scene
    5) Alarm report before the server takeover, To the staff to leave enough time, is 1 minutes to call the police, but the server does not take over, but 5 minutes after the takeover, take over longer time. The data is not lost, but it causes the user to be unable to write the data.
    6) after the alarm, not directly automatic server takeover, but by personnel takeover.
    7) Increase the quorum mechanism to determine who should get the resources, which has several reference ideas:
        a) to add an arbitration mechanism. For example, set the reference IP, when the heartbeat is completely disconnected, the 2 nodes each ping the reference IP, the difference is that the breakpoint appears in this paragraph, so that the initiative to abandon the competition, so that can ping the end of the reference IP to take over the service
         b) to arbitrate through third-party software who should get resources, this has similar software applications in Ali

Summary: How to write a script to determine the occurrence of brain fissure

1) simple judgment, as long as the standby node appears VIP on the alarm (a. Host downtime, Standby machine took over. B. The host is not down, the brain is cracked), regardless of the situation, manual view
2) rigorous judgment, the standby machine appears VIP, and the host and service is still alive, the brain cracked

About fence equipment and arbitration mechanism

Fence is only the term in HA cluster environment, in the hardware field, fence device is actually an intelligent power management device (IPMI) or remote management card, fence has external fence and internal fence (plugged in the server), whether internal or external fence, These devices are all with Ethernet ports and are used to provide resource services to the server when the HA switch is triggered by a network restart
Internal fence:
Ibm:rsa,rsaii
Hp:ilo,ilo2
Dell:idrac,idrac3
External Fence Equipment:
Apc

Stonith Overview

Stonith is the acronym "Shoot the other node in the head", which is a component of the heartbeat software package that allows a remote or "smart" power device connected to a health server to automatically restart the failed server's power supply. Stonith devices can turn off power and respond to software commands, the server running heartbeat can send commands to the Stonith device via a serial cable or network cable, which controls the power supply to the other servers in the high-availability server, in other words, the primary server can reset the standby server's power supply, The standby server can also reset the primary server's power.
Note: Although there is no limit to the number of power devices connected to a remote or "smart" loop power system in theory, most stonith implementations use only two servers because the dual-server Stonith configuration is the simplest and easiest to understand, It can run for long periods without compromising system reliability and high availability

Stonith Event Trigger Work steps:

1), the Stontih event starts when the standby server does not hear the heartbeat.
Note: This does not necessarily mean that the primary server does not send a heartbeat, and that the heartbeat may not reach the standby server for a number of reasons, which is why it is recommended that at least two physical paths be required to transmit the heartbeat to avoid the illusion.
2), the standby server issues a stonith reset command to the Stonith device.
3), the Stonith device shuts down the power supply of the primary server.
4), once cut off the main server power, it can no longer access the cluster resources, and can no longer provide resources for the client, to ensure that the client computer can not access the resources on the primary server, to exclude possible split-brain state.
5), then the standby server obtains the resources of the primary server, heartbeat runs the resource script with the start parameter, and performs ARP spoofing broadcasts so that the client computer sends their requests to its network interface.

1.5 Heartbeat Message Type

Heartbeat in the course of work, there are generally three types of messages
1) Heartbeat message
2) cluster Conversion message
3) Retransmission request

Heartbeat message

Heartbeat messages are about 150 bytes of packets, which may be unicast, broadcast, or multicast in a way that controls the heartbeat frequency and how long to wait for failover to occur

Cluster transformation messages

Ip-request and Ip-request-resp
When the primary server reverts to the online state, the resources and services obtained when the primary server fails to be freed by the Ip-request message request to the standby server, and then the server shuts down when the primary server fails.
When the standby server releases the resources and services that the primary server failed, it notifies the primary server through the IP-REQUEST-RESP message that it no longer owns the resource and service, and that the primary server receives the resources and services that were freed when the boot failed when the IP-REQUEST-RESP message notification from the standby node was received. and start providing normal access services.

Retransmission request

Rexmit-request Control retransmission Heartbeat request, this message is not very important
Tip: The above heartbeat control messages are sent to any port specified by the/ETC/HA.D/HA.CF file using the UDP protocol, or to the specified multicast address

1.6 Heartbeat IP address takeover and failover

Heartbeat is a failover via IP address takeover and ARP broadcast
ARP broadcast: In the case of a primary server failure, the standby node immediately forces the update of all client-local ARP tables, which clears the client's local cache of the failed server's VIP address and the resolution record of the MAC address, to ensure the client and the new Primary server dialog.

1.7 vip/ip alias/secondary IP

Real IP, also known as the management IP, is usually configured on the physical network card on the actual IP, in load balancing and high availability environment, the management IP is not to provide user access services, only the management server, such as SSH through this management IP to connect the server
VIP is a virtual IP, is actually heartbeat temporarily bound on the physical network card alias IP (HEARTBEAT3 above also uses the auxiliary IP), such as eth0:x,x 0-255 of any number, you can bind multiple aliases on a network card, This VIP can be thought of as your web name. In the actual production environment, the DNS configuration will need to resolve the site domain name address to this VIP address, by the VIP to provide services to users
The advantage of this is that when the server that provides the service goes down, the same VIP service is automatically configured on the server that takes over. If the use of management IP, back and forth migration is difficult to do, and, the management of IP migration away, we can only go to the computer room connection server, the essence of the VIP is to ensure that the two servers have a management IP, is ready to connect the machine, and then, increase the binding of other IP, so even if the VIP transferred away, Not even the server itself, because the management IP is the same

To manually configure the VIP method:
Ifconfig eth0:1 172.26.10.50 netmask 255.255.255.224 up (IP alias)
#== heartbeat2 Use this command to add VIPs by default
IP addr Add 10.25.16.30/24 broadcast 10.25.16.255 dev eth1 (secondary IP)
#==>keepalived,heartbeat3 Adoption of the scheme
IP add can view include aliases and secondary IPs, cannot view secondary IP with ifconfig

To manually remove the VIP method:
Ifconfig eth0:1 172.26.10.50 netmask 255.255.255.224 down
IP addr del 10.25.16.30/24 broadcast 10.25.16.255 Dev eth1

1.8 Heartbeat related listings and documents

Startup script:/etc/init.d/
Important Resource Directory:/etc/ha.d/resource.d/the script that holds the control service
Default configuration file directory:/etc/ha.d/
There are 3 common configuration files for heartbeat, namely:

Configuration file name Role Note
Ha.cf Heartbeat parameter configuration file Here are some basic parameters for configuring heartbeat
Authkey Heartbeat Certification Documents Peer-to-peer verification between highly available server pairs based on Authkey
Haresource Heartbeat Resource configuration file Configuring IP resources and scripting programs, etc.

1.9 Heartbeat Development Branch

HEARTBEAT3 Description of the branch:
Since 2.1.4, Heartbeat has divided 3 different branches, Heartbeat, Cluster Glue, Resource Agents, and previously the CRM function was also independent of a version pacemaker

1.10 Heartbeat Production Application Scenario

1, the Web high availability between two machines, heartbeat with nginx,haproxy better, lvs+keepalived better
2, the main database of high availability < best use heartbeat>
3, storage aspect (MFS) use HEARTBEAT+DRBD better

II: Deployment of heartbeat high-availability requirements

2.1 Business Requirements Description

Assume that there are two server NODE01/NODE02, the actual IP is 172.10.25.26 (node01), 172.10.25.27 (NODE02).
Configuration target: After the heartbeat service is started, the initial boot on the VIP:172.10.25.18,NODE02 machine on the NODE01 machine starts vip:172.10.25.10, once the server node01 or node02 any one machine down, The VIP that is initially started on the machine on the outage will automatically switch to the machine that is working properly, realize the IP resource automatic takeover, so as to achieve the purpose of high availability and no business impact.
We use heartbeat dual main mode, the production environment can also adopt the main standby mode, that is, the VIP is configured only at one end of the main, the standby side is in the hot standby state

2.2 Simplified layout of the deployment structure

2.3 Production Environment Server hardware configuration

DELL R430 server 2, specifically configured as follows:

Device Name Configuration description
Cpu Xeon e5-2403 V3 * *
Mem ECC DDR4 2*8g
Raid SATA/SAS/SSD Controller: H310 RAID5
DISK SAS 600g*4
Card Dual-port Gigabit network adapter

2.4 System Selection and host resource planning

System we can choose centos5.8 or centos6.5.

Heartbeat Host Resource Planning

Host Name Network interface Ip Use
Node01 Eth0 172.10.25.26 Manage IP for WAN data forwarding
Eth1 10.25.25.16 For inter-server heartbeat connections (direct Connect)
Vip 172.10.25.18 Used to provide application 01 mount service
Node02 Eth0 172.10.25.27 Manage IP for WAN data forwarding
Eth1 10.25.25.17 For inter-server heartbeat connections (direct Connect)
Vip 172.10.25.10 Used to provide application 02 mount service

Note: In production between storage server, storage server and switch can be paired with dual gigabit NIC bonding, improve the performance of network card

Heartbeat working principle and its deployment requirements (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.