1. A critical aspect of a High-reliability system is the elimination of spof. Spof refers to the failure of a single device or software that can cause system downtime or data loss. To eliminate spof, check the redundancy of the following structures:
Network modules, such as vswitches and vrouters
Automatic migration of applications and services
Storage module
IDC facilities, such as power supply, air conditioning, and fire prevention
Most highly reliable systems fail when multiple non-independent faults occur. A typical high-reliability system can achieve 99.99% or higher reliability, which means that at most one hour of downtime can be reached in a year. To achieve this goal, A high-reliability system must be restored within one to two minutes after a fault occurs.
For the basic services of openstack, the above requirements can be met, that is, the reliability of openstack can reach 99.99%. However, openstack does not guarantee 99.99% reliability for a single customer instance.
High reliability depends on whether the service is stateless. To make stateless services highly reliable, redundant instances must be created for stateless services. Openstack's stateless services include Nova-API, Nova-conductor, glance-API, keystone-API, neutron-API, and Nova-schedctor. The stateful services of openstack include databases and message queues. The high reliability of stateful services depends on whether master-slave deployment or active-active deployment is selected.
Http://docs.openstack.org/high-availability-guide
Introduction to high reliability of openstack