As a software developer, our life is fast paced, we use agile software development methods, iterative development of our software features, development to complete the submission of testing, through the QA test is deployed to the production environment, and then the terrible things happen in the production environment, the pressure of production environment more than our design values, That is, overload, this situation often occurs in the invocation of remote services, because there is no overload protection, causing the requested resource blocking on the server to wait to exhaust the system or server resources, many times just at the beginning of the system is only a partial, small-scale failure, but due to various reasons, the scope of the fault is getting larger , ultimately leading to global consequences, Murphy's law is particularly effective in software. As the saying goes, "anything that will go wrong, it will go wrong", how do we solve this problem, there is a design pattern called fuses, can be used to solve the problem of overload protection.
We have a common phenomenon in daily life, if the home electrical load is too large, such as open a lot of household appliances, will be "automatic tripping", then the circuit will be disconnected. One of the older ways is a "fuse" that, when the load is too large, or the circuit fails or is abnormal, the current will increase, to prevent the rise of the current may damage some important parts of the circuit or valuable devices, burning the circuit even caused a fire. Fuse will be in the abnormal rise of the current to a certain height and heat, the self-fusing cut off the current, thereby playing a protective circuit safe operation of the role. This automatic tripping device is the circuit fuse, usually the electromagnet cut off the circuit instead of burning, the fuse can be reused. The component mode in which we imitate circuit fuses in software is circuitbreaker.
In large-scale distributed systems, it is often necessary to invoke or manipulate remote services or resources that cause calls to these remote resources to fail because of reasons such as slow network connections, resource occupancy, or temporary non-availability due to the fact that the caller is not controllable. These errors usually return to normal within a later period of time. However, in some cases, it can be difficult to anticipate results due to unpredictable reasons, and remote methods or resources may take a long time to repair. This error is so severe that the part of the system loses its responsiveness and even makes the entire service unusable. In this case, the use of constant retries may not solve the problem, instead, the application should return immediately and report an error at this time.
In general, if a server is very busy, a partial failure in the system may result in "chained failure" (cascading failure). For example, an operation might call a service in the cloud, which sets a time-out and throws an exception if the response time exceeds that time. However, this strategy causes concurrent requests to call the same operation to block until the time-out expires. This blocking of requests can consume valuable system resources, such as memory, threads, database connections, and so on, and eventually these resources are exhausted, making the resources used by other unrelated parts of the system exhausted and thus dragging down the entire system. In this case, it may be a better choice for the operation to return an error immediately instead of waiting for the timeout to occur. We try again only if the invocation of the service is likely to succeed.
Fuse design mode
Uncle Martin. Summary of Fuse mode http://martinfowler.com/bliki/CircuitBreaker.html, fuse mode prevents applications from continually trying to perform operations that might fail, allowing applications to continue without waiting for remediation errors , or wasting CPU time to wait for a long time-out to occur. The fuse mode also allows the application to diagnose if the error has been corrected, and if it has been corrected, the application attempts to invoke the operation again.
The fuse mode is like a proxy for operations that are prone to errors. This agent can record the number of times a recent call has occurred, and then decide to continue with the allow operation, or return an error immediately.
Fuses can be implemented using a state machine, which simulates the following States.
- Closed (closed) state: A request to an application can directly cause a call to a method. The proxy class maintains the number of times the most recent call failed and, if a call fails, adds 1 to the number of failures. If the number of recent failures exceeds the allowable threshold for failure within a given time, the proxy class switches to the disconnected (Open) state. At this point the agent turns on a timeout clock, and when the clock exceeds that time, it switches to the semi-disconnected (half-open) state. This time-out is set to give the system a chance to fix the error that caused the call to fail.
- Disconnect (Open) state: In this state, the request to the application returns an error response immediately.
- Semi-fractured (half-open) state: Allows a certain number of requests to the application to invoke the service. If these requests make a successful call to the service, you can assume that the error that caused the call to fail has been corrected, when the fuse is switched to a closed state (and the error counter is reset), and if a certain number of requests have failed to call, the problem that caused the previous call to fail persists, and the fuse is cut back to the disconnection mode , and then start resetting the timer to give the system some time to fix the error. The semi-fractured state effectively prevents the service being recovered from being brought down again by a sudden flood of requests.
Transitions between states such as:
In the close state, the error counter is time-based. Resets automatically within a specific time interval. This prevents the fuse from entering a disconnected state due to an accidental error. The failure threshold that triggers the fuse to enter the disconnected state is generated only if the number of errors reaches the specified number of errors within a specific time interval. The number of successive successes used in the Half-open state counter records the success of the call. When a successive call succeeds to a specified value, switches to a closed state, and if a call fails, immediately switches to the disconnected state, and the consecutive successful call Count timer is zeroed the next time the semi-fractured state is entered.
The implementation of the fuse mode makes the system more stable and resilient, provides stability when the system recovers from errors, and reduces the impact of errors on system performance. It improves the response events of the system by quickly rejecting those that attempt to potentially invoke the service that would cause the error, rather than waiting for the operation to time out or never return the result. If the fuse design mode issues an event at each state switch, this information can be used to monitor the running state of the service and to notify the administrator to process when the fuse switches to a disconnected state.
Fuse patterns can be customized to accommodate certain scenarios that may cause remote service failures. For example, you can use a growing strategy for time-outs in fuses. When the fuse starts to break, you can set the timeout to a few seconds, then if the error is not resolved, then set the time-out to a few minutes, and so on. In some cases, we can return some default values for errors in the disconnected state, rather than throwing exceptions.
The above content comes from an article in MSDN Circuit breaker Pattern. The article lists the factors to consider:
When implementing fuse mode, these factors may need to be considered:
- Exception handling: When calling a service that is protected by a fuse, we have to handle exceptions when the service is unavailable. These exception handling is usually dependent on the specific business situation. For example, if your application is only temporarily degraded, you may need to switch to another replaceable service to perform the same task or get the same data, or report an error to the user and then prompt them to try again later.
- Type of exception: There may be many reasons why the request failed. Some of the reasons may be more serious than others. For example, a request may fail because of a remote service crash, which may take several minutes to recover, or it may be due to a temporary overload of the server causing a timeout. Fuses should be able to check the type of error to adjust the policy based on specific error conditions. For example, it may take many times for a timeout exception to conclude that a switch to a disconnected state is required, and a few error prompts can be used to determine if the service is unavailable and quickly switch to a disconnected state.
- LOG: fuses should be able to record all failed requests, as well as some requests that might attempt to succeed, allowing administrators to monitor the execution of services that are protected with fuses.
- Test service availability: In a disconnected state, the fuse can use a regular ping remote service or resource to determine if the service is recovering, rather than using a timer to automatically switch to a semi-disconnected state. This ping operation can impersonate previously failed requests, or it can be judged using a method that calls the check service provided by the remote service.
- Manual Reset: It is difficult to determine the recovery time for failed operations in the system, providing a manual reset feature that allows the administrator to manually force the fuse to be switched to a closed state. Similarly, if a fuse-protected service is temporarily unavailable, the administrator can force the fuse to be set to a disconnected state.
- Concurrency problem: The same fuse may be accessed concurrently by a large number of concurrent requests. The implementation of fuses should not block concurrent requests or increase the burden of each request call.
- Resource differences: When using a single fuse, a resource if?? There is a need to be careful when distributing in multiple places. For example, one data may be stored on multiple disk partitions (shard), one partition can be accessed normally, and the other may have a temporary problem. In this case, if the different error responses are confused, then the likelihood of the failure of these problematic partitions that the application accesses is high, and those that are considered normal are likely to be blocked.
- Accelerate fuse operation: Sometimes, the error message returned by the service is sufficient to allow the fuse to perform the fuse operation immediately and for a period of time. For example, if the response prompt returned from a distributed resource is overloaded, you can conclude that it is not recommended to retry immediately, but instead wait a few minutes before retrying. (The HTTP protocol defines "HTTP 503 Service Unavailable" to indicate that the requested service is currently unavailable and that he can contain additional information such as timeouts, etc.)
- Repeated failed requests: When the fuse is in a disconnected state, the fuse can record the details of each request, rather than simply returning the failure information, so that when the remote service resumes, these failed requests can be re-requested again.
Fuse Usage Scenarios
You should use this pattern to:
- Prevents applications from calling directly those remote services or shared resources that are likely to invoke a failure.
Not suitable for the scene
- For direct access to local private resources in applications, such as in-memory data structures, the use of fuse mode only increases system overhead.
- Not suitable as an exception handling substitute for business logic in the application
There are many class libraries that implement the fuse design pattern, and here we introduce a project called Polly. It is a very neat package and offers many kinds of fuses for us. It covers most of the exception handling like retry, retry and wait for the policy, Polly is also very simple to use, the following is how to use Polly:
Break the circuit after the specified number of exceptions
And keep circuit broken for the specified duration
var policy = Policy
. Handle<dividebyzeroexception> ()
. Circuitbreaker (2, Timespan.fromminutes (1));
var result = Poilcy. Execute (() = DoSomething ());
If DoSomething () throws a DivideByZeroException 2 times the fuse disconnects for one minute. It is very simple to use, more detailed please see the article "Circuit Breaking with Polly" http://blog.jaywayco.co.uk/circuit-breaking-with-polly/, Microsoft has already considered retries in some core components, one example is the EF 6 can be very convenient to implement the retry strategy, see the article "Entity Framework Connection Resiliency and Polly"/http blog.jaywayco.co.uk/entity-framework-connection-resiliency/.
in the application system, we usually call the remote service or the resource ( These services or resources are usually from third parties ) , calls to these remote services or resources usually result in a failure, or the suspension is unresponsive until the time-out occurs. In some extreme cases, a large number of requests are blocked on calls to remote services for these exceptions, causing some critical system resources to be exhausted, resulting in cascading failures that can bring down the entire system. Fuse mode in the form of a state machine, so that these may cause the request failed remote service packaging, when the remote service exception, you can immediately return an error response to incoming requests, and inform the system administrator, the error control in the local scope, thereby improving the stability and reliability of the system.
Protect software with fuse design mode