High Availability Cluster principle

Source: Internet
Author: User
Tags failover

What is high availability

On the basis of the cluster load, consider the service quality and the availability of the service , simply speaking, when a part of the cluster fails, through a number of mechanisms to quickly recover services, it is best to let users completely unaware, to achieve a seamless effect, generally as a service can achieve high availability, Like Ipvs Cluster service, Httpd,mysql and so on. I think there are two kinds of situations that can be divided.

1: Through the deployment of multiple servers, the device through the heartbeat line, fixed time through the transaction Transport layer to send each other transaction information , if there is a device failure, other equipment in a limited time without receiving transaction information, then the legal votes to start voting, according to the number of votes, Strategy for a party with fewer votes (restart, close, etc.)

2: Back-end node failure, front-end load balancer, fixed time to the back -end server for health check, after a certain number of checks, if there is a problem is not the line, then no longer allocate the request to the failed node, if the subsequent release of the failure, To allocate the request again (for the backend node of the load-balanced cluster)

How to judge a fault

1: with the help of another reference node, such as Ping Gateway (can be a node), can be ping with the test point, but can not and the other side, indicating that the other node has a problem, or the node has a problem

2: The quorum device, such as the quorum disk, each node will be a certain time interval to the disk to write data, if the other side no longer write, it may be the other node failure

Fault handling:


Statutory number of votes (quorum)

that is, when there is a node failure, the Inter-node voting determines which node is problematic, and the number of votes is more than half legal. The number of votes can be less , for example, some nodes have better performance or have other advantages, you can set a lot of votes, according to the needs of planning

Global cluster resource management strategy: Strategies for cluster nodes that do not have a quorum (without_quorum_policy)

    • Freeze: An established request continues to serve, but no longer receives a new request

    • Stop: Pause all services and requests

    • Ignore: Continue normal service


failover (FailOver)

Transfer highly available cluster resources from a cluster node that does not have a quorum to a failover domain (a node that can receive failed resource transfers), and a node resource with a low number of votes to be transferred according to resource constraints; When troubleshooting, whether to failback (failback), Depends on the resource stickiness and resource constraints of the settings, the general backup device is only used for backup, performance is lower than the main device, so when the main device recovery should be reversed, according to the actual judgment. Through the resource inclination to realize


Highly Available cluster resources (HA Resource): For example, a service can be a resource, a VIP, a file system configuration, etc. are highly available cluster resources

    • Primitive: A master resource that can only be run on a node at a time, such as a VIP

    • Group: Groups, typically containing only primitive resources

    • clone: Cloning, a resource that can run on multiple nodes, such as Stonith as a resource, should run on all nodes; distributed lock manager generated for cluster filesystem (Distributed lock MANAGER,DLM) for this resource, should run on all nodes

    • Master/slave: Special Clone Resources, run on two nodes, one master from, such as:DRBD, distributed replication block device (2.6.33 after integration into the kernel)


the tendency of resources (the basis of resource orientation)

Resource stickiness : The relationship between resources and the current node

    • Whether the resource prefers the current node, score, and positive values tend to the current node (also combined with location constraints)

resource Constraint Constraint: The relationship between resources and resources

    • Permutation constraints (colocation): Mutual exclusion between resources, define whether the resource is running on the same node, score, positive values to run on the same node, negative values are not

    • Position constraints (location), each node has a score value, positive values tend to this node, negative values tend to other nodes

    • Order constraint: Defines the order in which the resource starts or shuts down, such as the VIP should be configured first, httpd after the service is configured

Special score value,-inf negative infinity, inf positive infinity

Highly available cluster Management : a highly available, concrete implementation


cluster transaction information tier (message layer)

a mechanism of transmitting cluster information , through listening to UDP Port No. 694 , can transmit information in real-time by unicast , multicast and broadcast , and the content of transmission is Cluster transactions for highly available clusters , such as heartbeat information, resource information, etc., are only responsible for transmitting information, not responsible for the calculation and comparison of information

Cluster resource Manager (Cluster Resource Manager,CRM)

using the function of messaging layer to collect node information, and responsible for the calculation and comparison of information, and to make corresponding actions, CRM will elect a node for calculation and comparison , called DC(designated Coordinator) specifies the coordination node , which is implemented by the PE policy engine , which calculates the result of the action control by TE(Transition engine ), and there is a LRM(local resource manager) native resource manager on each node that is a sub-feature of CRM that receives the transactions that TE passes over, Take the appropriate action on the node, for example, run the RA script

Resource Agent (Resource Agent, RA ), the ability to manage the cluster resources script, due to the action of the node contains: the status of all resources detection, a certain number of detection after detection, if there is a problem, then consider the resource reset, restart, if the restart is invalid, then the resource transfer, so the HA script format should follow Linux scripting Standards (Linux standardbase,LSB), including (Start|stop|restart|status) , and later more optimized script types, such as OCF scripts that enable monitoring





This article is from the "Call Me boxin" blog, so be sure to keep this source http://boxinknown.blog.51cto.com/10435935/1673396

High Availability Cluster principle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.