Transferred from: http://timyang.net/service/application-failure-managment/
For some of the fault-tolerant design issues encountered in the project, the team recently conducted a technical salon to discuss the following topics.
Why do I need a fault tolerant design of the application layer?
A complete system is made up of many small services internally, and remote calls between services and services and resources exist.
- The availability of each system cannot reach 100%
- Various network and hardware problems, such as network congestion, network interruption, hardware failure ...
- The average response speed of the remote service is slow
The average server response speed slows down and consumes all the resources of the system, causing the entire system to become unusable. Therefore, in the distributed system, in addition to the remote service itself needs to have a fault-tolerant design, in the application layer of the remote invocation of the link, need to have a good fault-tolerant design.
What are the methods of fault tolerant design for the application layer? Here are some of the practices that the microblog team has used.
Fault-tolerant design for MySQL access
- Write operation: If master exception, throw exception directly.
- Read operation: If slave has multiple, first select one of the slave, if get connection failure, then select Other slave, if all is not available, finally select Master.
Fault tolerant design for access Memcached/redis
First set the so_timeout, avoid unlimited wait, server connection If IO exception, set the error flag, stop access for a period of time, after an error is active (such as ping Redis) or passive (when accessed again) to detect whether the service recovery.
Failover mechanism:
If a node fails to connect, the current pool enables a consistent hash switch to backup node, and if Backup node has no data, the data is fetched through another service pool (a copy of the data).
Fault tolerant design for accessing remote HTTP APIs
Set So_timeout, part of the scene: short timeout, retry once, and due to the diversity of HTTP service conditions, there is a general downgrade mechanism in the business level.
Problems with different methods of accessing different resources
From some of the scenarios listed above, when accessing different resources, each client access has some common principles, but it has to use a different iterative implementation. Due to the independent implementation of each client, the implementation time due to the various remote service protocols and behavior differences, resulting in these fault-tolerant principle can not be directly reused. In addition to the code level, different clients also use some of the different years of the underlying library, some early client implementation, data layer, the connection layer, the protocol layer all coupled together, also cause maintenance costs further increased.
For example, some of the problems encountered in service development are as follows:
- Hbase-client because the fault-tolerant design is not implemented, which causes the access jitter and affects the other calls of the same service pool, it needs to increase the fault tolerance and fast failure strategy like MySQL client;
- MySQL slave traffic is unbalanced and needs to be re-added, on-line, and validated because there is no common load balancing policy between multiple slave IPs.
In addition, most of the remote resources in the current distributed system are IO bound rather than CPU bound, and most of the clients are synchronous calls, which makes most of the calls waiting for remote return while also consuming the worker resources and a lot of thread context switch.
Is there a possible unified client?
These strategies are common in principle, can a unified client layer come out once and for all? But isn't that the demand that Twitter did?
Finagle, is not only the usual understanding of the RPC framework, but also the goal is to become a Commons client, from another level, the broad sense of access to remote resources can also be understood as RPC, so finagle is often referred to as RPC framework.
Finagle implements uniform client and server APIs for several protocols, and are designed for high performance and Concurre Ncy.
In the Twitter system, distributed services can be understood from the future, service, filter three levels, fault tolerance, timeout, authorization, tracing, retry and other mechanisms are embodied in the filter, and the future will be the client from multi-threaded, queue, connection pool, Resource management is released, from the focus of control flow to the focus of data flow. and becomes asynchronous by default.
Finagle's failfast module avoids distributing requests to the problematic service, which is flagged by an error logged to each host, and when an error occurs, Finagle is periodically re-connected through a background thread to check for recovery. When host is down, the associated service is marked as unavailable.
If you are redisign a generic network client, what elements should it include?
- Layered design with service for reference to Future/service/filter concept
- Hierarchical design with network, distinguish protocol layer, data layer, transport layer, connection layer
- Independent and adaptable codec layer, can flexibly increase the support of Http,memcache,redis,mysql/jdbc,thrift and other protocols.
- Many years of remote call high availability experience into the implementation, such as load balancing, failover, multi-copy strategy, switch demotion.
- Universal remote invocation implementation, using Async to reduce the cost of business services, and through the future to separate remote calls and data flow concerns.
- With status View and statistics function
- Of course, the final goal is to have the following common remote fault-tolerant processing capabilities, timeouts, retries, load balancing, failover ...
Fault tolerance and layered design of application layer