Cloud computing design model (18)-retry model
Enable the application to handle the expectation. When a temporary failure occurs, it tries to connect to the expectation of a previous failure through transparent retry. The cause of the failure is instantaneous service or network resources. This mode can improve the stability of applications.
Background and problems
The communication application and elements running on the cloud must be sensitive to transient faults that may occur in such environments. These faults include the temporary unavailability or timeout of a service in components and services with transient losses when a service occurs during network connection.
These faults are generally self-corrected. If a fault is repeatedly triggered after a proper delay, it is likely that the fault is successfully triggered. For example, in the database service, it is processing a large number of concurrent requests to implement a throttling policy and temporarily rejects it until its workload degrades any further requests. Applications trying to access the database may not be connected, but it may succeed if it tries again after a proper delay.
Solution
In the cloud, transient faults are not uncommon and applications should be designed to handle them elegantly and transparently, reducing the impact of such faults that may be performing business tasks on applications.
If an application attempts to send requests to a remote service when it detects a fault, it can use the following policy to handle the failure:
? If the fault indicates that the fault is not instantaneous or unsuccessful, if it is repeated (for example, the authentication failure resulting from invalid creden is impossible, no matter how many attempts are attempted ), the application should abort the operation and report an appropriate exception.
? If the specific fault reported is unusual or rare, it may be due to abnormal circumstances, such as the network packet becomes corrupted and it is sent at the same time. In this case, the application can immediately retry failed requests again, because the same fault cannot be repeated and the request may be successful.
? If a fault is caused by a more common connection or a "busy" failure, the network or service may need to be connected to the problem at the same time within a short period of time or the backlog of work may be cleared. The application should wait for the appropriate time before the request is retried.
For more common transient faults, during the retry period, you should select to spread requests from as many instances as possible evenly from the application. This can reduce the possibility of continuous overload of busy businesses. If multiple instances of an application continuously attack and retry the requested service, the service may need to be restored for a longer period of time.
If the request still fails, the application can wait for a further period and try again. If necessary, this process can be repeated to increase the retry delay until the maximum number of requests has failed. The delay time can be gradually increased or the available timing policies, such as exponential rollback, depend on the nature and likelihood of the fault, which will be corrected during this period.
Figure 1 shows this mode. If a predetermined number of requests fail after the attempt, the application should fail into an exception and handle it accordingly.
Figure 1-call the managed service operation in retry mode
The application should replace all the code that tries to access the remote service to implement the retry policy. Different policies apply to requests sent to different services. Some vendors provide the secret library for encapsulation. The policies normally implemented by these libraries are parameterized, which can be specified by application developers, such as the value of the time item between the number of retries and the number of retries.
The code of the application that records the detailed information of these faults should be recorded in operations that detect faults and retry failures. This information may be useful operators. If a service is frequently reported as unavailable or busy, it is often because the service has exhausted its resources. Then, the frequency of the service can be calculated by changing the number of failures. For example, if the database service is constantly overloaded, it may be advantageous to distribute the partition database and load to multiple servers.
Note:
Microsoft Azure provides extensive support for the retry mode. This mode and practice allow applications to handle transient faults of many Azure services through a series of retry policies. Microsoft Entity Framework Version 6 provides a method for retrying database operations. In addition, many APIs stored in AzureService Bus and Azure transparently execute retry logic.
Problems and precautions
When deciding how to implement this mode, consider the following:
? The retry policy should be adjusted to meet the business needs of the application and fault nature. It may be better for non-critical operations to fail quickly rather than retry several times and affect the application throughput. For example, in an interactive Web application that tries to access remote services, it may be that there is only one short delay between retry attempts and a small number of failures, and display an appropriate message to the user (for example, "please wait") and try again to prevent the application from becoming unresponsive. For batch processing applications, it can be more appropriate to increase the number of retry attempts and the exponential increase in latency between attempts.
? The policy of high attack retry with minimal latency and a large number of retries between attempts may further reduce the occupation of close-to-running or capacity. This retry policy may also affect the application's response if it is constantly trying to perform failed operations, rather than doing something useful.
? If the next retry request still fails for a significant number of times, it may be a better application to prevent further requests from going to be in the same resource for a cycle and simply report the fault immediately. When the term expires, the application can temporarily allow one or more requests to see if they are successful. For more information about this strategy, see circuit breaker pattern.
? It implements a retry policy that may need to be called by an idempotent application. For example, a request sent to a service can be received and processed successfully. However, due to a transient fault, it may not be able to send a response, indicating that the process has been completed. Then, the retry logic of the application may attempt to repeat the assumption that the first request was not received.
? A request to a service failure may cause different exceptions for various reasons, depending on the nature of the fault. Some exceptions can indicate faults and can be quickly solved, while others may indicate that the fault lasts longer. It may be a beneficial retry policy to adjust the time between retry attempts based on the exception type.
? The retry operation is part of the transaction and will affect the overall transaction consistency. This may be useful for fine-tuning retry policies for transactional operations, maximizing the chance of success, and reducing the need to cancel all transaction steps.
? Make sure that all retry codes are fully tested for various fault conditions. Check that it does not seriously affect the performance or reliability of the application, resulting in excessive load on services and resources, or a race condition or bottleneck.
? The implementation only understands the retry logic in all aspects of a failed operation. For example, if the included retry policy task calls another task and contains a retry policy, this extra retry layer can be used for longer latency processing. It may be better to configure a low-level task to fail quickly and report the cause of failure to return the task that calls it. Then, this higher-level task can decide how to deal with it based on its own policy failure.
? It is important to record all connection failures and prompt retry so that potential problems can be identified with the application, service, or resource.
? Research is the most likely occurrence of a service or resource discovery if they are likely to be persistent or terminal faults. If so, it may be better to handle the fault as an exception. The application can report or record the exception and then try to call another service, either continuously or (if one is available), or by providing the downgrade function. For more information about how to detect and handle persistent faults, see circuit breaker pattern.
When to use this mode
Use this mode:
? When an application may experience a short fault because it interacts with a remote service or accesses remote resources. These faults are expected to be short-lived and repeat requests that failed subsequent attempts.
This mode may not be suitable:
? When the fault is likely to be persistent, because it may affect the responsiveness of the application. This application can simply be a waste of time and resource attempts to duplicate requests is the most likely to fail.
? Faults are not handled due to transient faults, such as internal exceptions that cause errors in the application's business logic.
? As an alternative to solving scalability problems in the system. If an application suffers a frequent "busy" fault, this usually indicates that the accessed services or resources should increase accordingly.
Example
This implementation example describes the implementation of the retry mode. The OperationWithBasicRetryAsync method, as shown below, asynchronously calls an external service through the TransientOperationAsync method (the details of this method will be specific to the service and will be omitted from the sample code ).
Private int retryCount = 3 ;... public async Task OperationWithBasicRetryAsync () {int currentRetry = 0; for (;) {try {// Calling external service. await TransientOperationAsync (); // Return or break. break;} catch (Exception ex) {Trace. traceError ("Operation Exception"); currentRetry ++; // Check if the exception thrown was a transient exception // based on the logic in the error detection strate Gy. // Determine whether to retry the operation, as well as how // long to wait, based on the retry strategy. if (currentRetry> this. retryCount |! IsTransient (ex) {// If this is not a transient error // or we shoshould not retry re-throw the exception. throw ;}// Wait to retry the operation. // Consider calculating an exponential delay here and // using a strategy best suited for the operation and fault. await. task. delay () ;}// Async method that wraps a call to a remote service (details not shown ). private async Task TransientOperationAsync (){...}
The declaration that calls this method is encapsulated in a try/catch block in a loop. If the TransientOperationAsync method is successfully called, no exception is thrown for loop exit. If the TransientOperationAsync method fails, the catch block check is the cause of the failure, and if it is considered as a transient error code, wait for a short delay, and then retry the operation.
In the for loop, the number of times this operation has been tried is also tracked, and if the code fails three times, the exception is considered to be more persistent. If the exception is temporary or persistent, catch will handle the thrown exception. This exception exits the for loop and should capture the code that calls the OperationWithBasicRetryAsync method.
The IsTransient method is as follows. Check whether a specific group is related. The exception in the environment where the code is run is described. The definition of an exception varies depending on the Accessed resource and the operating environment on it.
Private bool IsTransient (Exception ex) {// Determine if the exception is transient. // In some cases this may be as simple as checking the exception type, in other // cases it may be necessary to inspect other properties of the exception. if (ex is OperationTransientException) return true; var webException = ex as WebException; if (webException! = Null) {// If the web exception contains one of the following status values // it may be transient. return new [] {WebExceptionStatus. connectionClosed, WebExceptionStatus. timeout, WebExceptionStatus. requestCanceled }. contains (webException. status);} // Additional exception checking logic goes here. return false ;}
MSDN: http://msdn.microsoft.com/en-us/library/dn589788.aspx
Cloud computing design model (18)-retry model