High-availability cluster heartbeat brief introduction

Source: Internet
Author: User

What is a highly available cluster? How does a highly available cluster work?

High-availability clusters (Ha:high availability) are designed to ensure the continued availability of services by using 1 or more standby hosts to ensure that the primary server automatically takes over its work after it is down.

These hosts, are working, we call the master node, looking at the main node to hang up and then work, we call the standby node, the general high availability cluster, only two nodes are very special HA cluster, generally are 3 or more than 3.

and mainframe work is generally to provide a variety of services, whether it's a Web service or a mail service or a database service, we refer to some of the attributes or attributes required by these services and services as resources: (Resource) Common resource types are: Disk partitions, file systems, IP addresses, service programs, NFS file systems , as well as uncommon primitive (local type), Group,clone,master-slave, etc.

In the HA cluster, the main node will always send out some broadcast information, tell the other nodes, he is working, when other nodes do not receive this information, they think he hung up, and then take over his work. It seems pretty good, right. Yes, this message from the main node is generally called heartbeat information.

But in fact, imagine a situation, when the network problem, the primary standby node can not communicate what to do? Perhaps, the main node busy no time to send a heartbeat information how to do? Cold, kiss? In either case, the information that the standby node does not receive the master node within the specified time period will begin to usurp the usurpation. So, what about the data? What about service resources? Mix with the main node? Who is the master node? Everyone is! Therefore, the data and other resources are lying in the gun.

What can happen in these clusters, such as node system failure, network connectivity failure, NIC failure, application failure, etc., we call it: event. This kind of each node behavior, commonly referred to as: Brain crack (Split-brain)

In order to solve the problem of brain crack, those V5 designers also toss out a domineering mechanism: Stonith->shoot (the other Node in the head), to be sure, will not take a gun to the server to collapse, although it would be cool. In fact, simply speaking: who first grabbed the master node, and then found someone else to rob with themselves, the other side of the power to pinch or let the other side reboot.

The device that the detonation head calls, in the Red Hat RHCs kit is also called isolation equipment (Fence), in fact, when the main node is hung, the standby node is not anxious to rob resources, but to call the Fence device, the isolation (restart, shutdown or break the network), isolation after the success of the settle down take over.

Imagine a scenario where there are many servers in a highly available cluster, and one of the main nodes hangs, and whose resources are given to whom? Who's going to grab it? No, we can define that this resource, such as the primary node of the HTTP service, is dead, and we can define it when it's normal, and if he hangs, he chooses which node to move to, or just which node to choose, or which node must not go, or wait for him to fix it, The HTTP service also works on this node, which is the constraint on resources, and also called Resource stickiness stickness.

Control the resources in their own hanging after the choice of the way called: Resource Transfer (failover), and the failure of the main node after the normal, resources to return to the active situation called failback.

If there is a cluster, there are 10 units are working, because the mouse has broken the network cable, cluster split, of which 6 can communicate, the other 4 can communicate, then the cluster, in the end who work? What's everybody doing? Obviously not, so what to do? Vote have wood there? Well, then vote, very simple, more than half of the votes, then continue to work, less votes, then give up the service. So what happens now? Must be 6 to continue, 4 to give up? Wrong, you know, the number of votes (quorum) is not equal to the number of servers, pro. The node that manages the polling is called the DC (Coordinator specified by designated coordinator).

Read the author nonsense so much, Nani is not to ask: "This ya's how to work?" A little something clear. First, a picture, of course, not the author's own painting ...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.