Linux Clusters and high availability

Source: Internet
Author: User

The reliability of the computer system is measured by the average time of no fault (MTTF), that is, the average time the computer system can run normally before a fault occurs. The higher the system reliability, the longer the average fault-free time. Maintainability is measured by the mean service time MTTR), that is, the average time it takes to repair and restore the system to normal after a fault occurs. The better the maintainability of the system, the shorter the average maintenance time. The availability of computer systems is defined as MTTF/(MTTF + MTTR) * 100%. Therefore, the availability of computer systems is defined as the percentage of the system's normal running time.

The computer industry generally uses the number of "9" shown in the following table to classify computer system availability types.

Availability Classification Available level Annual downtime
Fault Tolerance and availability 99.9999 <1 min
High Availability 99.999 5 min
Availability with automatic fault recovery capability 99.99 53 min
High Availability 99.9 8.8 h
Product availability 99 43.8 h

Both hardware redundancy and software can greatly improve system availability. Hardware redundancy mainly aims to provide services by maintaining multiple redundant components in the system, such as hard disks and network cables, to ensure the failure of working parts; the software monitors the running status of multiple machines in the cluster, and starts the slave machine to take over the failed machine when the machine fails to provide services.

In general, it is necessary to ensure the high availability of the Cluster Manager and the high availability of nodes. Eddie, Linux Virtual Server, Turbolinux, Piranha, and Ultramonkey all adopt high availability solutions similar to figure 1.

High Availability of Cluster Manager
To prevent the failure of the Cluster Manager, you need to create a backup machine for it. Both the master manager and backup manager run the heartbeat program to monitor the running status of the other party by transmitting information such as "I am alive. When the backup machine cannot receive such information within a certain period of time, it activates the fake program and allows the backup manager to take over the master manager to continue providing services; when the backup manager receives the message "I am alive" from the master manager, it invalidates the fake program and releases the IP address, in this way, the main manager begins to manage the cluster again.

High Availability of nodes
The high availability of a node can be achieved by constantly monitoring the node status and the running status of applications on the node. When the node is found to have expired, you can reconfigure the system and assign the workload to the nodes that are running normally. 1. The system runs the mon genie program on the Cluster Manager to monitor the running status of the service programs on the actual servers in the cluster. For example, you can use fping. monitor to monitor whether the actual server is still running at a certain interval. Use http. monitor to monitor the http service, and use ftp. monitor to monitor the ftp service. If you find that an actual server is faulty or the service on it has failed, delete all the rules related to the actual server in the Cluster Manager. Otherwise, if you find that the system has been able to provide services again soon, add all corresponding rules. In this way, the Cluster Manager can automatically block the failure of the server and the service programs running on it, and re-add them to the cluster system when the actual server is running normally.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.