Use fuse design mode protection software, fuse design mode Protection

Source: Internet
Author: User

Use fuse design mode protection software, fuse design mode Protection

As software developers, our life is fast-paced. We adopt agile software development methods, develop our software functions iteratively, and submit for testing after development, after passing the QA test, it is deployed to the production environment, and then the terrible thing happens in the production environment. The pressure on the production environment exceeds our design value, that is, overload, this often happens when remote services are called, because there is no overload protection, resulting in resource blocking requests waiting on the server to exhaust system or server resources, many times, at the beginning, the system only experienced local and small-scale faults. However, due to various reasons, the scope of the faults became larger and larger, leading to global consequences, murphy's Law is especially practical in software. As the saying goes, "Anything that will go wrong will surely go wrong." How can we solve this problem? There is a design model called Fuse, which can be used to solve the problem of overload protection.

We often encounter a phenomenon in our daily life. If the household electrical load is too large, for example, a lot of household electrical appliances are enabled, the circuit will automatically trip, and the circuit will be disconnected. In the past, the older method was "fuse". When the load is too high, or the circuit is faulty or abnormal, the current will continue to rise, to prevent rising current from damaging some important or expensive devices in the circuit, burning the circuit may even cause fire. When the current abnormally increases to a certain height and heat, the fuse locks the current and protects the safe operation of the circuit. This automatic trip device is a circuit fuse, which is usually cut off the circuit with an electromagnet rather than burned out. The fuse can be reused. In the software, the component mode that imitates circuit fuses is CircuitBreaker.

In a large distributed system, you usually need to call or operate remote services or resources. These remote services or resources, such as slow network connection, are not controllable by the caller, the remote resource cannot be called because the resource is occupied or temporarily unavailable. These errors can usually return to normal later. However, in some cases, the results may be unpredictable due to unpredictable reasons. Remote methods or resources may take a long time to repair. This error is critical to the loss of response to the system, and even the entire service is completely unavailable. In this case, continuous retry may not solve the problem. On the contrary, the application should immediately return and report an error at this time.

Generally, if a server is very busy, some failures in the system may lead to "cascading failure ). For example, an operation may call a cloud service. This service sets a timeout time. If the response time exceeds this time, an exception is thrown. However, this policy will cause concurrent requests to block the same operation and wait until the timeout period expires. Such blocking of requests may occupy valuable system resources, such as memory, threads, and database connections. At last, these resources will be exhausted, so that the resources used by other unrelated parts of the system are also exhausted, dragging down the entire system. In this case, an error is returned immediately for the operation, rather than waiting for timeout. We will try again only when the service call is successful.

Fuse Design Mode

Uncle Martin summarizes the fuse mode http://martinfowler.com/bliki/CircuitBreaker.html, which can prevent applications from constantly trying to execute operations that may fail, so that the application continues to execute without waiting for correction errors, or a waste of CPU time to wait for a long time-out. The fuse mode can also enable the application to diagnose whether the error has been corrected. If yes, the application will attempt to call the operation again.

The fuse mode is like a proxy for operations that are prone to errors. This type of proxy can record the number of recent call errors, and then decide to allow the operation to continue, or immediately return an error.

Fuses can be implemented using state machines, which simulate the following states internally.

  • Closed status: requests to applications can directly cause method calls. The agent class maintains the number of recent call failures. If a call fails, the number of failures is increased by 1. If the number of recent failures exceeds the allowed failure threshold for a given period of time, the proxy class switches to the Open state. At this time, the proxy enables a time-out clock. When the time-out clock exceeds this time, it switches to the Half-Open state. The timeout value is set to give the system a chance to correct the call failure error.
  • Open: An error response is immediately returned for an application request.
  • Half-Open: allows a certain number of requests to an application to call the service. If these requests call the service successfully, you can think that the previous error that caused the call failure has been corrected. At this time, the fuse is switched to the closed state (and the error counter is reset ); if a certain number of requests fail to be called, the problem still persists, and the fuse is switched back to the disconnected mode, then start resetting the timer to give the system some time to correct the error. The semi-disconnected status can effectively prevent the service being recovered from being dragged down by a large number of sudden requests.

The transition between States is as follows:

In the Close state, the error counter is time-based. It is automatically reset within a specific time interval. This prevents the fuse from being disconnected due to an accidental error. The failure threshold that triggers the fuse to enter the disconnected state is generated only when the number of errors reaches the threshold of the specified number of errors within a specific interval. The number of consecutive successes in the Half-Open state counter records the number of successful calls. When the number of consecutive successful calls reaches a specified value, it is switched to the closed State. If a call fails, it is immediately switched to the disconnected state. The timer for the number of consecutive successful calls is set to zero when the next semi-disconnected state is entered.

The fuse mode makes the system more stable and flexible, provides stability when the system recovers from errors, and reduces the impact of errors on system performance. It quickly rejects services that may cause errors by trying to call them, instead of waiting for Operation timeout or never returning results to improve system response events. If the fuse design mode issues an event during each status switch, this information can be used to monitor the service running status and notify the Administrator to handle the event when the fuse is switched to the disconnected status.

You can customize the fuse mode to adapt to specific scenarios that may cause remote service failure. For example, you can use the increasing Timeout Policy in the fuse. When the fuse starts to enter the disconnected state, you can set the timeout time to several seconds. If the error is not resolved, set the timeout time to several minutes, and so on. In some cases, we can return some default values of errors in the disconnected state, instead of throwing an exception.

The above content comes from Circuit Breaker Pattern in an MSDN article. The article lists the factors to consider:

The following factors may need to be taken into account when implementing the fuse mode:

  • Exception Handling: when calling a fuse-protected service, we must handle exceptions when the service is unavailable. The exception handling usually depends on the specific business situation. For example, if the application is temporarily degraded, you may need to switch to another replaceable service to execute the same task or obtain the same data, or report an error to the user and prompt them to try again later.
  • Exception type: there may be many reasons for request failure. Some causes may be more serious than others. For example, a request failure may be caused by a remote service crash, which may take several minutes to recover. It may also be caused by temporary server overload. The fuse should be able to check the type of the error and adjust the strategy according to the specific error situation. For example, it may take many time-out exceptions to determine that the service needs to be switched to the disconnected state, and the service can be quickly switched to the disconnected state after several error prompts.
  • Logs: fuses should be able to record all failed requests and some requests that may be successful, so that administrators can monitor the implementation of services protected by fuses.
  • Test whether the service is available: In the disconnected state, the fuse can periodically ping remote services or resources to determine whether the service is restored, rather than automatically switching to the semi-disconnected state using a timer. This ping operation can simulate previous failed requests, or you can use the method of calling the remote service to check whether the service is available.
  • Manual resetting: it is difficult to determine the recovery time of failed operations in the system. A manual Resetting Function is provided so that the administrator can manually force switch the fuse to the closed state. Similarly, if the service protected by the fuse is temporarily unavailable, the administrator can force the fuse to be disconnected.
  • Concurrency problem: the same fuse may be accessed by a large number of concurrent requests at the same time. The implementation of the fuse should not block concurrent requests or increase the burden of each request call.
  • Resource difference: when using a single fuse, you need to be careful if one resource is distributed in multiple places. For example, one data may be stored in multiple disk partitions (shard), one partition can be accessed normally, and the other may have temporary problems. In this case, if different error responses are confused, the possibility of failure of these problematic partitions accessed by the application is high, and those partitions are considered normal, it may be blocked.
  • Speed up the fuse circuit breaking operation: Sometimes, the error message returned by the service is enough for the fuse to immediately perform the circuit breaking operation and keep it for a period of time. For example, if a response from a distributed resource prompts that the load is too heavy, you can determine that you do not recommend that you try again immediately, but should wait a few minutes before trying again. (The HTTP protocol defines "HTTP 503 Service Unavailable" to indicate that the requested Service is currently Unavailable. It can contain other information such as timeout)
  • Repeated failed requests: When the fuse is disconnected, the fuse records the details of each request, instead of returning the failure information, so that when the remote service is restored, you can re-request these failed requests.

Use Cases of fuses

This mode should be used:

  • Prevent applications from directly calling remote services or sharing resources that are likely to fail to be called.

Unsuitable scenarios

  • For applications that directly access local private resources, such as the data structure in the memory, using the fuse mode will only increase the system overhead.
  • Not suitable as an exception handling alternative to business logic in applications

Many class libraries have implemented the fuse design mode. Here we will introduce a project called Polly. It is a very clean package that provides us with many kinds of fuses. It covers most exception handling strategies such as retry, retry, and wait. Polly is also very simple to use. The following describes how to use Polly:

// Break the circuit after the specified number of exceptions

// And keep circuit broken for the specified duration

Var policy = Policy

. Handle <DivideByZeroException> ()

. CircuitBreaker (2, TimeSpan. FromMinutes (1 ));

Var result = poilcy. Execute () => DoSomething ());

If DoSomething () causes DivideByZeroException two times the fuse is disconnected for one minute. It is very simple to use, for more details, please refer to the article "Circuit Breaking With Polly" http://blog.jaywayco.co.uk/circuit-breaking-with-polly/, Microsoft has considered retry in some core components, there is an example of EF 6 can be very convenient to implement the retry policy, see the article Entity Framework Connection Resiliency and Polly http://blog.jaywayco.co.uk/entity-framework-connection-resiliency.

In an application system, we usually call remote services or resources (these services or resources are usually from a third party). Calls to these remote services or resources usually lead to failures, or wait until a timeout occurs. In some extreme cases, a large number of requests will be blocked in calls to these abnormal remote services, leading to the depletion of some critical system resources, leading to cascade failure, this will drag the entire system down. The fuse mode uses a state machine internally to encapsulate the remote services that may cause request failure. When the remote service encounters an exception, you can immediately return an error response to the incoming request and notify the system administrator to control the error to a certain extent, so as to improve system stability and reliability.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.