Implementation and working principle of heartbeat high availability cluster in CentOS 6.5 environment

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Linux HA cluster high availability server cluster, so-called high availability is not high availability for the host, but is highly available for the service.

What is high availability: a server down the possibility of a variety of, any one may be bad risk, and the server is usually the cost of offline is very large, especially the Web site, so when a server to provide services down to the end of the service is called high availability.

What is the heartbeat: is to connect multiple servers with the network, and then each server will continue to keep their online information is very brief and small notice to the same network in the standby server host, told the host itself is still online, other servers received this heartbeat information that the machine is online, Especially the primary server.

Heartbeat information How to send, by WHO to collect, in fact, is the process of communication between the two hosts is unable to communicate, can only use the network function, through the process of monitoring in a certain set of words, to achieve data transmission, data requests, so many servers have to run the same process, the two processes continue to communicate, the primary node (master server) Constantly to the other side of the same node to send their heartbeat information, that this software is called high-availability cluster of the baseline level, also known as the heartbeat information transmission layer and the transfer layer of things information, which is running in the cluster of nodes on the process, the process is a service software, shut down the need to start it up, The host can transmit information, usually the master node to the standby node.

The so-called resources: Take the web as an example, the VIP is a resource, a Web service is also a resource, and a Web page is also a resource, a service includes multiple resources, and a web-like shared storage is a resource, and so on, different services require different resources, and shared storage is the most difficult problem to solve in high-availability clusters.

If the main node is hung, how many nodes to choose a standby node to serve as a node, and this should choose which alternative node to serve as a mechanism is called cluster thing decision-making process.

Ha_aware: If an application can use the underlying heartbeat information transfer layer function to complete the process of cluster decision-making, the software is called Ha_aware.

Dc:designated Coordinator The selected coordinator, when the host on which the DC resides is hung, a DC is selected before the decision is made by the DC. Note: In a highly available cluster, the most core and lowest-level management units are called resources, and the resources are grouped together to form a single service.

Any resource in a highly available cluster should not be started by itself, but initiated by CRM management;
Crm:cluster Resources Manager Cluster resource management, the real decision-making is CRM.

Heartbeat V1 version of the concept of resource management, and V1 version of the resource is heartbeat comes with, called haresources, this file is a configuration file, and this configuration file interface is called haresources;
When Heartbeat v2 second edition, Heartbeat was made a great improvement, he can do as a standalone process to run, and can receive user requests through it, it is called CRM, at run time it needs to run a process called CRMD on each node, This process is usually to listen on a socket, the port is 5560, so the server is called CRMD, and the client is called CRM (can be called CRM Shell), is a command line interface, through this command line interface can communicate with the server-side CRM, Heartbeat also has its graphical interface tool, called the Heartbeat-gui tool, which can be configured through this interface.
The third edition of Heartbeat V3, which is independent of three projects heartbeat, pacemaker (cardiac pacing), Cluster-glue (the cluster of the Laminator), the architecture is separated, can be combined with other components to work.

Ra:resource Agent Resource Proxy, in fact, is able to receive the CRM scheduling for the implementation of a resource on a node to complete the management of the tool, this management tool is usually a script, so we often called the resource agent. Any resource agent will use the same style, receiving four parameters: {Start|stop|restart|status}, including the configuration IP address. The proxy for each resource is to complete the output of the four parameter data.
When a node fails, the above resources are automatically transferred to other normal standby nodes and the process initiated is called failover, also known as failover (failover).
If the failed node is back again, then we have to add this node back, that this added back process we called the failure to return, also known as the Fault reversal (failback).

Resource contention, resource isolation:
In case of cluster fragmentation, in order to avoid the cluster of nodes continue to use resources and resource contention, resulting in the file system mounted files crash, become a new cluster will not become a cluster of nodes to fill a gun, so that not the cluster node service is dead, no longer receive requests, This is called Stonith (Shoot the other node in the head), and this function is called resource isolation. The consequences of contention for shared storage are very serious, while shared storage crashes, and the entire file system crashes and data is lost.

There are two levels of resource isolation:
Node level: This is called Stonith, this is how to direct the other side of the power to cut off, generally this host is connected to the power switch.
Resource level: This needs to rely on a number of hardware devices to complete, such as connecting to the shared storage of the fiber switch, the need to kick out of the node's fiber interface shielding, this is called resource-level isolation.
For server-to-left separation of this situation is often called brain-split, left and right uncoordinated, in the high can be used to avoid resource contention to complete the resource isolation is the problem we must filter in the design of high-availability clusters.

Two node mode, once the cluster is separated, one of the nodes fails, when we can not determine which node is not normal, and the normal node must be connected to the Internet, so that the normal node can be with the front-end routing communication, so we put the front-end route as a third node , here we call ping node, when each node to ping the front end of the node first, if you can ping, it is normal, it means that the node is a number of legal votes of the node, and the front-end of the ping node is called the quorum device, help the node to determine which node is the winner of the , the quorum device is used when the number of even nodes is counted.
RHCs not use ping node to judge, he is using a shared storage device, an even number of nodes in the active node constantly write data to disk, according to the heartbeat information frequency every other information frequency on the disk to write a data bit, as long as the device every heartbeat interval to update the data bit, This means that the device is active, if it is found that the node has not written the data bit many times, the node is considered to be hung, this is also called the Quorum device (Qdisk). There are two types of arbitration equipment: ping node and qdisk, respectively;

Information layer (Messaging layer): The primary and subordinate two nodes of the heartbeat information are based on the information layer, also called the underlying infrastructure layer, for transmitting heartbeat information, and can achieve this function has Corosync and heartbeat, Corosync is a component of Openais,
Resource allocation layer (Resource Allocation): Also called the Resource Manager layer, the core component of this layer is called CRM (Cluster Resourcce Manager Cluster resource Manager), CRM must have a resource to be elected manager, called leader, Its work is all things in the decision-making cluster, here is called DC (designated Coordinator designated Coordinator), any DC will run two additional processes, one called PE (Policy engine), The so-called strategy engine is to collect the information from all the nodes in the whole cluster, generate a large pic locally to the node on which the policy node is running, and inform the resource manager on the node to realize the startup and shutdown of the resource. A named Te (Transition Engine Transmission engine), it is mainly to make the PE decision-making notification to the corresponding node of the CRM;
The cluster resource manager must advertise to each node with the help of the messageing layer, automatically broadcast or multicast to each node, so that the information on each node is the same, and the data in the computer how to interact with the data, This will be based on the extension markup language to achieve the format of the data transfer, which is called semi-structured data based on XML, so that the implementation of the configuration information between the nodes are saved through an XML file, and to be able to understand the XML file saved information using a tool called the CIB (Cluster Information base cluster repository); As long as it can be connected to the CRM can be configured to configure the XML file, first it will be saved to the XML in the DC, and then by the DC synchronous branch of each node in the XML file;
Resources layer: While the PE (Policy engine) is based on the XML library to obtain the configuration information of the resource, and through the messaging layer does not get the current node's activity information, and then make a decision, once the decision is made to start resources , so PE with the help of local messaging layer notification to the cluster information of the real node to achieve the transfer of certain resources, such as the notification of other CRM to start a resource, after receiving the information CRM is not responsible for the start, transferred by LRM (Local Resource Manager local resource management) is started, each node is running on this lrm, and concurrent resources with the help of RA (Resource Agent Resource agent) to achieve resource management, this is how it works; CRM is responsible for collecting information, which is recommended PE is responsible for consolidating all the resources in the entire cluster and ensuring that some resources are running on the appropriate nodes, and once the decision is made, it is advertised to the CRM on the other nodes, the CRM on the corresponding node will call its own LRM after receiving the notification, and the LRM command RA to complete the related operation;

Then let's implement the heartbeat V1 version of the work process:

Installing a highly Available cluster: Implementing the Heartbeat V1 version of the process

1, node name is critical, the name of each section of the cluster must be able to parse each other; With the Hosts file, the positive and negative results of the host name in the/etc/hosts:hosts should be consistent with the results of "uname-n";

2, time must be synchronized, use the network time server synchronization time to the company internal server, again through the LAN synchronization time (concrete construction process can refer to the internal enterprise in the centos7.2 system Configuration NTP service and intranet server time synchronization: http://blog.csdn.net/ REBLUE520/ARTICLE/DETAILS/51143450);

# yum install -y ntp
# vim /etc/ntp.conf


Add the following:
Server 192.168.8.102
Restrict 192.168.8.102 nomodify notrap noquery
Server 127.127.1.0 # local clock
Fudge 127.127.1.0 stratum 10


# service ntpd start
# service ntpd restart
Manually synchronize one time, subsequent ntp will automatically synchronize
# ntpdate -u 192.168.8.102

3. Configure the virtual IP (VIP) on the master node and install the HTTPD service: two high-availability node IPs
Master node: 192.168.8.39 web1.chinasoft.com
Alternate node: 192.168.8.40 web2.chinasoft.com
Configure the virtual VIP on the master node

Ifconfig eth0:0 192.168.8.66/16 GW 192.168.8.254 up

Install Apache test by shutting down Apache service later and booting does not start (because httpd service needs to be handed over to heartbeat service management)

# yum Install-y httpd

# Service HTTPD Stop

# chkconfig httpd off

4, each node can be based on SSH key mutual authentication communication;

1) Configure hostname: The first node has a host name of Web1.chinasoft.com, and the second node has a host name of web2.chinasoft.com

# vim/etc/hosts Change the hostname, note that two nodes are to be added, I add a few nodes to parse
192.168.8.40 web2.chinasoft.com WEB2
192.168.8.39 web1.chinasoft.com Web1
# Uname-n
Web2.chinasoft.com
# Uname-n
Web1.chinasoft.com
# Cat/etc/sysconfig/network If this is inconsistent with web1 or 2, change the config file to ensure that the host name is still valid the next time the system restarts.
# if it's changed, it's not going to work, ctrl+d write it off and log in again okay?

2) Two hosts or multiple hosts SSH-free key communication

# ssh-keygen -t rsa -P '' This generates a public key with a blank password and a key, and copies the public key to the other node.
# ssh-copy-id -i .ssh/id_rsa.pub root@web2.chinasoft.com Login username for the host name of the other party


-----------------
# ssh-copy-id -i .ssh/id_rsa.pub root@web2.chinasoft.com

Error
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ warning:possible DNS SPOOFING detected! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The RSA host key for web2.chinasoft.com have changed,
And the key for the corresponding IP address 192.168.8.40
is unknown. This could either mean
DNS SPOOFING is happening or the IP address for the host
And its host key has changed at the same time.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ warning:remote HOST Identification has changed! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT is POSSIBLE this SOMEONE is DOING SOMETHING nasty!
Someone could is eavesdropping on your right now (Man-in-the-middle attack)!
It is also possible, the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
2B:58:E9:BD:28:E5:91:43:CB:67:47:DC:B9:92:E7:CD.
Please contact your system administrator.
Add correct host key in/root/.ssh/known_hosts to get rid of the this message.
Offending key In/root/.ssh/known_hosts:2
RSA host key for web2.chinasoft.com had changed and you have requested strict checking.
Host Key verification failed.

The target host key value recorded in the. ssh/known_hosts is incorrect. This is the most common situation, as long as the deletion of the corresponding host record can be restored to normal.
Run Command: # rm/root/.ssh/known_hosts
-----------------

Both hosts must be able to communicate with each other, so both hosts must generate a key and copy the public key, the hosts file on each other's node is to resolve each other's host name: 192.168.8.40 web2.chinasoft.com web2
192.168.8.39 web1.chinasoft.com Web1
# ssh web2.chinasoft.com ' date ';d ate
Thu APR 14:51:36 CST 2016
Thu APR 14:51:36 CST 2016

Command Dependent package Installation
# ssh-copy-id-i. ssh/id_rsa.pub [Email protected]
-bash:ssh-copy-id:command not found
# yum Install-y openssh-clients

3) Install Heartbeat v1 version of the program, two nodes to install heartbeat related packages

# Install these packages, but there are dependencies that need to be resolved:
Heartbeat-2.1.4-12.el6.x86_64.rpm, heartbeat-pils-2.1.4-12.el6.x86_64.rpm,
Heartbeat-stonith-2.1.4-12.el6.x86_64.rpm
#solve dependencies:
# yum -y install perl-TimeDate net-snmp-libs libnet PyXML gettext-devel
If you can't find a suitable package, install the epel source information first.
# yum install -y epel-release


# rpm -ivh heartbeat-pils-2.1.4-12.el6.x86_64.rpm heartbeat-stonith-2.1.4-12.el6.x86_64.rpm heartbeat-2.1.4-12.el6.x86_64.rpm

Error:
error:failed dependencies:
Libltdl.so.7 () (64bit) is needed by heartbeat-pils-2.1.4-12.el6.x86_64
Libltdl.so.7 () (64bit) is needed by heartbeat-stonith-2.1.4-12.el6.x86_64
Libltdl.so.7 () (64bit) is needed by heartbeat-2.1.4-12.el6.x86_64

Installing the Libtool-ltdl can
# yum Install-y Libtool-ltdl

A highly available cluster depends on: 1, information layer, 2, resource Manager, 3, resource agent
We configure the process to be configured at this level to be able to;
Note here is: How in the network we expect the node cluster to become the node we need, the information in the cluster can not be passed casually, and the heartbeat node is based on the multicast address transmission, if others also installed heartbeat also connected to this multicast address, which is not safe, based on this situation, We each node this information transmission is need authentication, this authentication is based on the HMAC (Message authentication code), the message authentication code is using the one-way encryption method to realize, and one-way encryption generally has three kinds: CRC (cyclic redundancy check code), MD5 (Information Digest algorithm), SHA1. Heartbeat based on UDP protocol, monitoring on Port 694;

4) Configuration Heartbeat

Its configuration file in/etc/ha.d/directory, but after the installation of the program This directory does not have this configuration file, only the/usr/share/doc/heartbeat-2.1.4/directory has HA.CF master profile sample, copied to/ ETC under the modification of the configuration file can be used; there is also a Authkeys authentication file, which is the authentication password and authentication mechanism of each node authentication, so the permission of this file is very important, must be 600, otherwise it can not start the service; the third haresources, Resource Manager is required to read this file when defining resources, so this also has to be;

# cp /usr/share/doc/heartbeat-2.1.4/{ha.cf,authkeys,haresources} /etc/ha.d/
# cd /etc/ha.d/
# openssl rand -hex 8 Generate 16-bit random number
8beffe603880f0a8
# vim /etc/ha.d/authkeys
Auth 2 Here 2 is the same as the number of options below, there is no limit
2 sha1 8beffe603880f0a8
# chmod 600 authkeys
# vim /etc/ha.d/ha.cf Enable the following parameters and functions
Logfile /var/log/ha-log #log file, where is the normal log information recorded?
#logfacility local0 #Close the log
Keepalive 1000ms #How often to send heartbeat information, the unit is seconds, milliseconds with ms
Deadtime 8 #How long does it take to detect the time interval when the other party is offline?
Warntime 10 #Warning time
Udpport 694
Mcast eth0 225.0.0.1 694 1 0 #defined multicast address
Auto_failback on #open fault return function
Node web1.chinasoft.com #definition two nodes
Node web2.chinasoft.com
Ping 192.168.8.254 #ping node, can be set to a fixed online IP, such as configured as a routing address
Compression bz2 #unzip Format
Compression_threshold 2 # indicates uncompressed transmission when less than 2K

Define resources: defined in the configuration file of the resource manager,/etc/ha.d/haresources, there are various resource types under/ETC/HA.D/RESOURCE.D, and when defined in the resource configuration file, the resource types are called to run the corresponding programs;

# vim /etc/ha.d/haresources
Web1.chinasoft.com 192.168.8.66 httpd # 192.168.8.66 This is a floating address
Note: web1.chinasoft.com: Describe which host is the primary node, and who is more inclined to
[web1.chinasoft.com 192.168.8.66/16/eth0 httpd can also be defined like this
Web2.chinasoft.com 192.168.8.40 httpd httpd is how to call it: first look for /etc/ha.d/resource.d directory, if not go to /etc/init.d/ directory to find httpd, find Just start. ]
# scp -p authkeys haresources ha.cf web1.chinasoft.com:/etc/ha.d
# service heartbeat start
# ssh web2.chinasoft.com 'service heartbeat start'

Attention:

After the configuration is complete, it will take a while to take effect

End:

When a node is hung up, the other node will top up, become the master node, using the service can still provide services, and do not use service termination, here we should prepare two nodes different Web page content to distinguish, testing when we put one of the other Web services to terminate, we can see the effect, The heaetbeat automatically switches to a node that is not functioning properly and provides service intermittently.

Error handling (/var/log/ha-log):

heartbeat[4294]: 2016/04/14_19:11:53 ERROR:should_drop_message:attempted replay attack [web2.chinasoft.com]? [Gen = 1460624833, Curgen = 1460624839]
heartbeat[4294]: 2016/04/14_19:11:54 ERROR:should_drop_message:attempted replay attack [web2.chinasoft.com]? [Gen = 1460624833, Curgen = 1460624839]
heartbeat[4294]: 2016/04/14_19:11:55 ERROR:should_drop_message:attempted replay attack [web2.chinasoft.com]? [Gen = 1460624833, Curgen = 1460624839]

Refer to: http://www.cerebris.com/blog/2011/02/14/cloning-a-heartbeat-server/
# Rm/var/lib/heartbeat/hb_uuid
# Service Heartbeat Restart

Implementation and working principle of heartbeat high availability cluster in CentOS 6.5 environment

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Implementation and working principle of heartbeat high availability cluster in CentOS 6.5 environment

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Implementation and working principle of heartbeat high availability cluster in CentOS 6.5 environment

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support