Spring Cloud (iv): Service-tolerant Protection Hystrix "version Finchley"

Last Update:2018-05-11 Source: Internet

Author: User

Tags semaphore switches

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spring Cloud (iv): Service-tolerant Protection Hystrix "version Finchley" Posted in 2018-04-15 | updated on 2018-05-07 |

In a distributed system, it is often the case that a basic service is unavailable, causing the entire system to become unusable, a phenomenon known as service avalanche effect. To cope with service avalanches, a common practice is to downgrade the manual service. And the advent of hystrix gives us another option.

Hystrix [h?st ' r?ks] The Chinese meaning is "porcupine", the porcupine is full of thorns, can protect themselves from predators, represents a defense mechanism, which coincides with the function of Hystrix itself, so the Netflix team named the Framework Hystrix, and use the corresponding cartoon image to do as a logo.

Service Avalanche effect Definition

The service avalanche effect is a process that is not available to service callers because of the unavailability of the service provider , and will not be available for gradual amplification . If the following is true:

, A is the service provider, B is the service caller of a, and C and D are service callers of B. Service avalanches are formed when a is unavailable, causes B to be unavailable, and will not be available to gradually enlarge C and D.

Causes of formation

I reduced the service avalanche participants to service providers and service callers , and divided the process of service avalanche generation into the following three phases to analyze the reasons for the formation:

Service Provider Not available
Retry to increase traffic
Service Caller not available

Each stage of a service avalanche can be caused by different causes, such as the reasons for service unavailability :

Hardware failure
Program Bug
Cache breakdown
Large number of user requests

Hardware failure may be caused by hardware corruption of the server host outage, network hardware failure caused by the service provider is inaccessible.
Cache breakdown typically occurs when a cache application restarts, when all caches are emptied, and when a large number of caches expire in a short period of time. A large number of cache misses, making the request hit the back end, causing the service provider to run overloaded, causing the service to be unavailable.
Before the start of the second kill and the big promotion, if the preparation is not sufficient, the user initiates a large number of requests will also cause the service provider's unavailability.

The reasons for the resulting retry increase in traffic are:

User Retry
Code Logic Retry

After the service provider is unavailable, users constantly refresh the page or even submit a form because they cannot tolerate long waits on the interface.
The retry logic after a large number of service exceptions exists on the service call side.
These retries will further increase the request traffic.

Finally, the main reasons for the unavailability of service callers are:

Resource exhaustion due to synchronous wait

When a service caller uses a synchronous call , a large number of waiting threads are generated to occupy system resources. Once the thread resource is exhausted, the service provider's services are also in an unusable state, and the service avalanche effect is generated.

Coping strategies

Different coping strategies can be used for the different reasons that cause service avalanches:

Flow control
Improved cache mode
Automatic Service expansion
Service Caller Demotion Service

Specific measures for flow control include:

Gateway Current Limit
User interaction Limit Flow
Close Retry

Because of the high performance of Nginx, at present, the first-line internet companies use Nginx+lua gateway for flow control, and the openresty is becoming more and more popular.

The specific measures for user interaction throttling are: 1. Loading animation is used to increase the user's endurance wait time. 2. The Submit button adds a force wait time mechanism.

Measures to improve cache mode include:

Cache preload
Synchronization changes to asynchronous flush

The measures of automatic expansion of service include:

AWS's Auto Scaling

The measures that service callers downgrade services include:

Resource Isolation
Classify dependent services
Call to unavailable service failed quickly

Resource isolation is primarily the isolation of the thread pool that invokes the service.

Depending on the business, we divide our services into: strong dependencies and dependencies. A strongly dependent service unavailability can cause the current business to abort, and the unavailability of a weak dependent service will not cause the current business to abort.

The failure of a call to a service is generally accomplished by a timeout mechanism , a fuse , and a post-fuse downgrade method .

Avalanche Service demotion using Hystrix prevention services (Fallback)

For the query operation, we can implement a fallback method that can use the value returned by the fallback method when the backend service is requested to have an exception. The return value of the fallback method is typically the default value for the setting or from the cache.

Resource Isolation

In order to prevent leakage and the spread of fire, cargo ships will be separated into multiple warehouses, as shown in:

This resource isolation reduces the risk in a manner known as: bulkheads (bulkhead isolation mode).
Hystrix applies the same pattern to the service caller.

In Hystrix, resource isolation is achieved primarily through the thread pool. Typically, when used, we divide multiple thread pools based on the invocation of the remote service. For example, call the command of the product service into the A thread pool, and invoke the command of the account service into the B thread pools. The main advantage of this is that the operating environment is isolated. This will not affect other services of the system, even if the code that invokes the service has a bug, or if it is otherwise exhausted by its own online pool.
The following benefits can be achieved by isolating the thread pool of dependent services:

The application itself is fully protected from the impact of an uncontrolled reliance on the service. Even if the thread pool that is allocated to the dependent service is filled, it does not affect the remainder of the application itself.
Can effectively reduce the risk of access to new services. If the new service is running unstable or has problems after access, it will not affect the application of other requests at all.
When a dependent service returns to normal from a failure, its thread pool is cleaned up and can restore healthy services immediately, compared to the container-level cleanup recovery rate.
When a dependent service is misconfigured, the thread pool quickly reacts to this problem (by increasing the number of metrics such as failures, delays, timeouts, rejections, and so on). At the same time, we can handle it with real-time dynamic property refreshes without affecting the application's functionality (which is described later through the use of Spring cloud Config with Spring Cloud Bus).
The monitoring metrics information of the thread pool reflects such changes when the dependent services change greatly due to the implementation mechanism adjustment and so on. At the same time, we can adjust the threshold value of the dependent service to adapt to the change of the relying party through the real-time dynamic refresh of our own application.
In addition to the benefits of using the thread pool isolation service above, each proprietary thread pool provides a built-in concurrency implementation that can be leveraged to build asynchronous access for synchronous dependent services.

In summary, by implementing thread pool isolation for dependent services, our application is more robust and does not cause the exception of non-related services due to individual dependent service problems. At the same time, it makes our application more flexible and can be adjusted to the performance configuration with dynamic configuration refresh without stopping the service.

Although the thread pool isolation scenario has so many benefits, many users may be concerned that allocating a thread pool for each dependent service will increase the load and overhead of the system too much. Users don't have to worry too much about this, because these concerns are what most engineers would consider, and Netflix, when designing Hystrix, argues that the overhead of thread pooling is unmatched compared to the benefits of isolation. At the same time, Netflix has also tested the cost of the thread pool to demonstrate and eliminate the hystrix implementation of the performance impact concerns.

Is the performance monitoring of a hystrix command provided by Netflix Hystrix, which accesses a single-service instance at 60 requests per second (QPS), and the service instance runs at a peak of 350 threads per second.

Statistics we can see, the time-consuming differences between using thread pool isolation and not using thread pool isolation are shown in the following table:

Compare Cases	not using thread pool isolation	use of thread pool isolation	time-consuming gap
Number of Median	2ms	2ms	2ms
9,000 minute bit	5ms	8ms	3ms
9,900 minute bit	28ms	37ms	9ms

In the case of 99%, the delay of using the thread pool isolation is 9ms, which is negligible for most requirements, not to mention the huge increase in stability and flexibility of the system. Although for most requests we can ignore the additional overhead of the thread pool, and for a small amount of latency itself (which may only require 1ms), the latency overhead of 9ms is very expensive. In fact Hystrix also designed another solution for this: Semaphore (semaphores).

In addition to using the thread pool in Hystrix, semaphores can also be used to control the concurrency of a single dependent service, which is much less expensive than the thread pool, but it cannot set timeouts and implement asynchronous access. Therefore, the semaphore is only used if the dependent service is reliable enough. Supports the use of semaphores in Hystrixcommand and Hystrixobservablecommand at 2:

Command execution: If the quarantine policy parameter execution.isolation.strategy is set to Semaphore,hystrix, the semaphore is used instead of the thread pool to control the concurrency control of the dependent service.
Downgrade logic: When Hystrix attempts to downgrade the logic, it uses semaphores in the calling thread.

The default value for Semaphores is 10, and we can also control the number of concurrent threads by dynamically refreshing the configuration. The estimation method for semaphore size is similar to the estimation of the thread pool concurrency. Requests that access only memory data are typically time-consuming within 1ms, and performance can reach 5000rps, so we can set the semaphore to 1 or 2, and we can set the semaphore according to the standard and time-consuming actual request.

Circuit Breaker Mode

The circuit breaker model originates from Martin Fowler's circuit breaker. "Circuit Breaker" itself is a switch device, used in the circuit to protect the line overload, when the circuit has a short circuit, "circuit breaker" can timely cut off the fault circuit, to prevent the occurrence of overload, heat, and even fire serious consequences.

In a distributed architecture, the role of the circuit breaker mode is similar, when a service unit fails (similar to a short-circuit with electrical appliances), through the fault monitoring of the circuit breaker (similar to the fuse fuse), directly cut off the original main logic call. However, in the Hystrix circuit breaker in addition to cutting off the function of the main logic, there are more complex logic, below we look at its deeper processing logic.

The logic of switching between circuit breakers is as follows:

When the Hystrix Command requests that the number of backend service failures exceeds a certain threshold, the circuit breaker switches to the open State (open). At this point, all requests fail directly and are not sent to the backend service.

This threshold involves three important parameters: the snapshot time window, the lower total number of requests, and the lower error percentage. The function of this parameter is:
Snapshot time window: The circuit breaker determines whether to open the need to count some requests and error data, and the time range of the statistics is the snapshot time window, the default is the last 10 seconds.
Minimum number of requests: Within the snapshot time window, the minimum total number of requests must be met to qualify for the fuse. The default is 20, which means that in 10 seconds, if the call to the Hystrix Command is at this point less than 20 times, all requests are timed out or other reasons fail, and the circuit breaker will not open.
Error percent lower limit: When the total number of requests in the snapshot time window exceeds the lower limit, such as 30 calls, if in these 30 calls, 16 times the time-out exception, that is, more than 50% error percentage, at the default setting of 50% lower limit, this time the circuit breaker will open.

The circuit breaker remains in the open state for a period of time (default of 5 seconds) and automatically switches to the semi-open state (Half-open). This will determine the return of the next request, if the request succeeds, the circuit breaker is switched back to the closed circuit (CLOSED), or re-switch to open mode (open).

Using Feign Hystrix

Because the fuse only works at this end of the service invocation, we only need to change the Eureka-consumer-feign project-related code as per the example code in the previous article.

POM Configuration

Because feign already relies on hystrix, there is no need to make any changes to the MAVEN configuration.

Configuration file

Modified on the basis of the original APPLICATION.YML configuration

Copy

 spring: 
  Application: 
  name: eureka-consumer-feign-hystrix 
 eureka: 
  client: 
   Defaultzone: http://localhost:7000/eureka/
 server: 
   port: 9003 
 feign: 
   enabled: true

To create a callback class

Creating a Helloremotehystrix class implementation method for implementing callbacks in Helloremote

Copy

@Component
Helloremote {

@Override
Hello"name") String name) {
"Hello world!";
}

}

Add Fallback property

HelloRemoteadds the specified fallback class to the class, returning the contents of the fallback class when the service is fused.

Copy

"Eureka-producer", fallback = Helloremotehystrix.class)
Helloremote {

@GetMapping ("/hello/")
Hello"name") String name);

}

No need to move any more, it's easy!

Test

Start Eureka-server, Eureka-producer, and just Eureka-consumer-hystrix in turn. Three items.

Visit: HTTP://LOCALHOST:9003/HELLO/WINDMT
Return:[0]Hello, windmt! Sun Apr 15 23:14:25 CST 2018

Description After adding hystrix, the normal access is not affected. Next we manually stop the Eureka-producer project from testing again:

Visit: HTTP://LOCALHOST:9003/HELLO/WINDMT
Return:Hello World!

This time we start the Eureka-producer project again to test:

Visit: HTTP://LOCALHOST:9003/HELLO/WINDMT
Return:[0]Hello, windmt! Sun Apr 15 23:14:52 CST 2018

According to the return results, the fuse was successful.

Summarize

By using Hystrix, we can easily prevent the avalanche effect, while the system has the effect of automatic demotion and automatic recovery service.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More