As a software system usability is the first, if a system is not available, you do the rest of the place again how good, and then egg.
Generally what happens when the software is unavailable:
Our failure, resulting in the system is not available, of course, there will be a single machine is not available and N multi-group of all is not available.
- Program failure function error, program exit
- System failure CPU overload, memory overload, network overload
- Physical failure machine crashes power off network
- Unrecoverable failure earthquake, tsunami, etc.
The same failure occurs in the client side, causing the system to be unavailable and, of course, the unavailability of individual users and the availability of regional users.
For our problems, we must solve the problem through the structure, for the customer's problems, we try to find ways to solve the problem, solve the regional problems, and then solve the individual user problems. The solution has to take into account the cost and strategy to make trade-offs, such as early in the startup, there is no large amount of money, to solve the unrecoverable failure is basically unlikely.
We first try to solve our failures from an architectural approach, which is similar to a design pattern and is called an architectural pattern.
For single-machine unavailability, there is a professional term called single point of failure, the best way is to deploy multiple machines, through multi-machine load balancing, to avoid single point of failure.
- Distributed
- Load Balancing
For multi-machine unavailability, we need to classify how to solve:
- Program failure function error, program exit, this error has classmate said, can add unit test, functional test, let the test to find the problem. Yes, but that's the development process, we're not going to talk about that, we're talking from an architectural perspective, the main solution is as follows:
- Grayscale publishing
- exception monitoring
- system failure CPU overload, memory overload, network overload
- Flow control
- function downgrade
-
- exception monitoring
- physical fault machine crash power off network
- offsite Live
- hot spare or cold
- geo-Data synchronization
- Unrecoverable failure earthquake, tsunami, etc.
I will give you a detailed explanation of each topic in the following.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Architect Express 8.3-availability