Ramble About Avalanches

Last Update:2015-08-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Avalanche refers to a system and B system that are normally called and called, and suddenly a system's access to B system exceeds the capacity of B system, causing B system to crash.

Note that the avalanche effect differs from a denial of service attack: both are caused by overloading of the B system, but the latter is an artificial, deliberate attack.

Note The Avalanche effect differs from the traffic surge: if the a system is directly facing the user, then the surge of users will directly lead to a surge of traffic for A and B systems, but this case can be scaled up by estimating the A and B systems.

an avalanche case

If a system has a steady amount of traffic, what is causing a surge in traffic to B? The cause of avalanche is many, this paper gives a more typical case. First look at an architecture called a match stick:

A system depends on the B system read service, a system is a cluster of 60 machines, B system is a cluster of 6 machines, the reason that 6 machines can carry 60 machine access, because a system does not always access B, but the first request cache, only the cache of the corresponding data will be requested B.

This is the meaning of the cache existence, it makes B system save a lot of machines, if no cache,b system has to form a cluster of 60 machines, if a also rely on C system? The C system also needs 60 machines, and the amplified traffic will soon drain the company's resources.

This is not perfect, however, and the structure is like a match stick against a big truck above, not because the match stick is so great, but because the big truck is tied with a hydrogen ball, and the hydrogen ball is the cache. I think you should see the problem, if the balloon is broken is not stupid? Yes, it's a way of avalanches!

Back to the architecture of A and B, there are at least three possible causes of avalanches:

1, b System of the pre-agent failure or other causes of the B system temporarily unavailable, and so on system B system service recovery, a system traffic will be overwhelming call over.

2, the cache system failure, a system access will be overwhelming to hit the B system.

3, cache failure recovery, but this time the cache is empty, the cache instant hit rate is 0, the equivalent of the cache is penetrated

The first reason is not very good understanding, why the B system after the recovery of traffic will soar? The main reason is the cache time-out. When there is data time-out, a system will access the B system, but this time the B system is not available, then this data has to timeout, and so found that B system recovery, b system found that the data in the cache has already timed out, all become old data, then of course all the requests hit a.

Perhaps you are quick to give a solution: Since the introduction of the cache will have so many avalanche hidden danger, then remove the cache, so that B system service capacity is greater than or equal to a system.

This solution does eliminate avalanches, but it's not the right way to do it! This sentence explains a little trouble, but to give a chestnut description: It is well known that capitalism is a kind of system progress in feudal society, but capitalism itself will have a variety of new problems, these problems may not exist in the feudal society, if you want to solve these problems of capitalism should be returned to the feudal society?

Prevention of avalanches client-side scenarios

The so-called client-side refers to the above structure of a system, relative to the B system, a system is the client of the B system.

For the three causes of avalanches: B system failure recovery, cache failure, cache failure recovery, see what the a system can handle.

reasonable use of cache for B system outage

Each key in the cache, in addition to the value, corresponds to an expiration time t, within T, the get operation takes the key in the cache directly to the value and returns. But when T arrives, there are five main modes of the get operation:

1. timeout-based simple (stupid) mode

When t arrives, any thread get operation discovers that the key and corresponding value in the cache will be cleared or marked as unavailable, and the get operation will initiate a call to the remote service to obtain the value corresponding to key, and update the write back cache;

2. General mode based on timeout

When T arrives, the key and corresponding value in the cache will be cleared or marked as unavailable, and the get operation will invoke the remote service to obtain the value corresponding to key and update the write back cache;

At this point, if another thread finds that key and value are already unavailable, theget operation also needs to determine if there are any other threads initiating the remote call , and if so, wait until the thread remotely obtains the operation and the key in the cache becomes available. You get it directly from the cache and you're done.

For the sake of understanding, make an analogy: 5 workers (threads) to the port to take the same key goods (get), found that the goods have expired was thrown away, then just send a person to the other side to pick up the goods, the other four people waiting in the port, instead of 5 people all go. And based on the simple mode of time-out, the equivalent of 5 workers to the port to take the same key goods, found no, then 5 workers to the far port to pick up goods, 5 workers with each other completely without communication, see each other does not exist.

The classic Java Synchronized-double-check idiom is required to implement a regular pattern based on timeouts.

3. Refresh-based simple (stupid) mode

When T arrives, the key and the corresponding value in the cache do not move, but if the thread calls the get operation, the refresh operation is triggered, which is divided into two modes according to the sync relationship of Get and refresh:

Synchronization mode: Any thread discovers that key expires, triggers a refresh operation, the get operation waits for the refresh operation to end, and after refresh finishes, the Get operation returns the value corresponding to key in the current cache. Note that the refresh operation ends does not mean that refresh is successful, that it may throw an exception, that the cache is not updated, but that the get operation returns a value that may be the old value.
Asynchronous mode: Any thread that discovers a key expires, triggers a refresh operation, and the GET operation triggers the refresh operation, without waiting for refresh to complete, returning the old value in the cache directly.

4. Regular mode based on refresh

Synchronization mode: The get operation waits for the refresh operation to end, and after refresh finishes, the Get operation returns the value corresponding to key in the current cache, noting that the refresh operation does not mean that refresh is successful, that it may throw an exception, that the cache has not been updated, But regardless of the get operation, the value returned by the get operation may be the old value. If another thread makes a get operation, the key has expired, and found that the thread triggered the refresh operation , you return the old value without waiting for refresh to complete.
Asynchronous mode: The get operation triggers the refresh operation, which does not wait for refresh to complete and returns the old value in the cache directly. then obviously, if another thread makes a get operation, the key has expired, and it is discovered that the refresh operation was triggered by a wired thread , then it does not wait for the refresh to finish returning the old value directly.

Lift up the example of the dock worker above. Based on refresh: this time or 5 of workers to the port to pick up goods, when the goods are in, although it is old, then 5 workers have two options:

Send a person to the distant port to fetch new goods, the remaining four of the elder brother took the garage first back (synchronous mode);

Inform a hired worker to go far to fetch new goods, elder brother five all hold secondhand first back (asynchronous mode);

The difference between a simple refresh-based pattern and a regular refresh-based pattern can refer to the difference between a simple pattern based on timeouts and a regular pattern based on timeouts, no longer repeating.

5. Refresh-based renewal mode

The only difference between this pattern and the regular refresh-based pattern is the refresh operation timeout or failed processing. In the general refresh-based mode, the refresh operation times out or fails with an exception, and the corresponding key-value in the cache is the old value so that a refresh operation is triggered when the next get operation arrives.

In refresh-based renewal mode, if the refresh operation fails, refresh will return the old value as a new value, which is equivalent to the old value being renewed for t time, and the get operation in the subsequent T-time will fetch the old value of the renewal without triggering the refresh operation.

Refresh-based renewal mode is also divided into synchronous mode and asynchronous mode as normal mode, no longer repeat.

The following is a discussion of the performance of these 5 cache get patterns at the time of an avalanche, first assuming the following:

Suppose that a system has access to M-times per minute
Suppose the cache can have a key of C, and the key space has n
Assuming a normal state, the B system accesses a number of W times per minute, apparently w= misses + expired number <c+m* (N-C)/n.

At this time for some reason, such as B long failure, causing the cache key all expired, B system then recover from the failure, the five Get Mode analysis performance analysis is as follows:

1, in a simple mode based on timeout and refresh, the instantaneous flow of B system will reach and a instantaneous flow m roughly equivalent to the cache is penetrated. There was an avalanche, and the B system just recovered would surely be shot dead.

2, in the regular mode based on timeout and refresh, the instantaneous traffic of B system will be roughly equal to the key space n in the cache. If there is an avalanche, B system can be carried to see if the key space n is more than the flow limit of system B.

3, in the refresh-based renewal mode, B system instantaneous flow is w, and the normal situation is the same without an avalanche! In fact, in the refresh-based renewal mode, there is no cache key all expired, even if the B system to kill permanently, a system cache will be based on the old value of long-lasting smooth operation!

From the point of view of the B system, the refresh-based renewal mode that resists avalanches is outright.

From a system-wide perspective, because a system is generally a high-volume online Web application, the most annoying word of this application is "thread Wait", so the various asynchronous patterns based on refresh are better.

In General, the refresh-based asynchronous renewal mode is preferred .

However, there are two points to note when there are pros and cons:

1, based on the refresh mode the biggest disadvantage is that key-value once placed in the cache will not be cleared, each update is also the new value overwrite the old value, GC can never garbage collection. In the timeout-based mode, if the new access does not arrive after the Key-value timeout, the memory can be garbage collected by GC. So if you're using a local memory with a gold footprint, be careful with the cache.

2, refresh-based renewal mode needs to be well monitored, otherwise it is possible that the value of the cache has been compared with the actual value of the difference is far away, the application thought it was new value, and you do not know.

For the specific cache, the guava local cache library from Google supports the second, fourth, and fifth get operation modes above.

However, for distributed caches such as Redis, only the original get, set method is provided, and the provided get is only obtained, and the five get operation modes mentioned above are not a concept. Developers want to use the five get operation mode, they have to package and implement.

In the five get operation modes, the simple mode based on timeout and refresh is the simplest to implement, but unfortunately these two modes are completely immune to avalanches, which may also be an important reason for avalanches to occur frequently in heavily dependent cache systems.

dealing with distributed cache outages

If the cache is directly hanging, then the above refresh-based asynchronous renewal mode is also the egg. At this time a system will not be able to access the cache, only the traffic completely to B system, B system in the face of avalanche doomed ...

The prevention cache outage discussed in this section is limited to distributed cache because the local cache and a system apply shared memory and processes, the local cache hangs a system is also hung, there will be no local cache hangs and a system is applied normally.

First, the a system requests the thread to check the distributed cache State, if no answer, and so on, the distributed cache is hung, then turn to request B system, so that the avalanche will crush B system. The following options are available:

1. The current thread of the A system does not request a B system, but logs and sets a default value.

2. The current thread of a system decides whether or not to request B system according to a certain probability.

3, a system's current thread check B system operation, if good then request B system.

Scenario 1 is the simplest, a system knows that if no cache,b system may not be able to carry its own total traffic, simply do not request B system, waiting for the cache to recover. But at this point B system utilization is 0, obviously not the optimal scheme, and when the request is not easy to set the default value, this scenario will not be.

Scenario 2 allows a subset of threads to request a B system, which can certainly be hold by the B system. Can be conservatively set this probability u=b the average flow of the system/a system peak flow

Scenario 3 is a more intelligent scenario, if the B system is running well, the current thread requests, if the B system is overloaded, then no request, so a system will let B system in a critical state of downtime and non-downtime, to maximize the performance of B system. This scenario requires B system to provide a performance evaluation interface that returns YES and No,yes indicates that B system is good and can be requested; No indicates that system B is in bad condition, do not request it. This interface will be called frequently and must be efficient.

The key to scenario 3 is how to evaluate the health of a system. The performance parameters of the current host in a system are CPU load, memory usage, swap usage, GC frequency and GC time, average response time of each interface, etc., the performance evaluation interface needs to return Yes or no based on these parameters, is it a two classification problem in machine learning? On this issue can be written in separate article discussion, and then do not expand here, you can think of a relatively simple fool's conservative strategy, the result is nothing but a system can not be very good approximation of the performance of the B system.

Combined with the above analysis, Scenario 2 is more reliable. If option 3 is selected, it is recommended that a dedicated team be responsible for researching and providing a unified system performance real-time assessment scheme and tools.

coping with the recovery after a distributed cache outage

Do not assume that the successful hold of the distributed cache outage is all right, the real test is that the distributed cache after the outage process recovery, then there is nothing in the distributed cache.

Even if the refresh-based asynchronous renewal policy is mentioned above, there is no OVA at this time, because the distributed cache is empty, and the B system is requested anyway. At this point, the maximum flow of the B system is the space value of the key number.

If the value space of key is very small, then it is peaceful; if the value of the key space is greater than the volume of the system B, the avalanche is still unavoidable.

in this case a system is difficult to deal with, the key reason is A system requests that the cache return key corresponds to value NULL, a system cannot know because the current cache is just initialized, all content is empty, or because the key that you requested is not in the cache .

If it is the former, then when the front-line is going to be the same as the cache downtime to do some kind of policy avoidance, if the latter, directly request B system, because this is the normal cache use process.

For the cache outage recovery, a system is really powerless, can only send hope in the B system of the program.

server-side Scenarios

The client side needs to deal with a variety of complex issues, the server side needs to address the simple problem is how to calmly deal with the problem of overloading. Whether it's an avalanche or a denial-of-service attack, it's an overload protection issue for the server side. For overload protection, there are two main kinds of realistic solutions and a super-realistic scheme.

Flow Control

Flow control is the B system real-time monitoring of current traffic, if the pre-set value or the system to withstand the ability, then directly rejected a portion of the request, in order to achieve the protection of the system.

Flow control can be divided into two types depending on the data on which it is based:

1, flow control based on traffic threshold: Traffic threshold is the upper limit of traffic for each host, traffic exceeds the threshold of the host will go into an unstable state. Thresholds are set in advance, and if the host current traffic exceeds the threshold, a portion of the traffic is rejected so that the actual processed traffic is always below the threshold value.

2. Host State-based flow control: Each request is accepted before the current host State, if the host is not in good condition, the current request is rejected.

Threshold-based flow control implementation is simple, but the biggest problem is the need to set thresholds in advance, and as the business logic becomes more and more complex, more and more interfaces, the host's service capability should actually be down, so it is necessary to constantly lower the threshold value, increase maintenance costs, and in case you forget to adjust, hehe ...

The threshold of the host can be determined by the stress test, which can be conservative when selected.

The host state-based flow control eliminates the human control, but its biggest determination has been mentioned above: How to determine the host state according to the current host parameters? It's not easy to think of a more perfect answer to this question, so I recommend a threshold-based flow control before there's a good answer.

Flow control is based on the implementation of different locations, and can be divided into two kinds:

1, reverse proxy to achieve flow control: In reverse proxy, such as nginx based on a variety of strategies for flow management. This is typically for HTTP services.

2, with the service management system: If the server side is RMI, RPC and other services, you can build a dedicated service governance system for load balancing, flow control and other services.

3, the service container to achieve flow control: Before the business logic to achieve traffic management.

The third way to implement flow control in a server container (such as a Java container) is not recommended, because the flow control and business code mix easily, and secondly, the traffic is already in the full amount into the business code, then the flow control just prevent it into the real business logic, so the flow control effect will be discounted; again, if the traffic policy changes frequently , the system will have to change frequently for this.

Therefore, the first two methods are recommended.

Finally, make a note: When a request is rejected because of a flow control, be sure to bring the relevant information in the returned data (such as "The current request is forbidden to access because it is out of traffic"), and if the return value nothing will be a big pit. Because there are many reasons why a caller request is not responding, it could be a caller bug, a service-side bug, or a network instability, which is likely to be found to be a ghost of flow control after a full day of troubleshooting ...

Service Downgrade

Service demotion is generally triggered by man-made, and belongs to the strategy of avalanche recovery, but in order to compare with flow control, put it here.

Flow control essentially reduces the amount of traffic, while service processing power is the same, while service degradation is essentially enhanced service processing power, and the amount of access is constant.

Service demotion is the shutdown of unimportant interfaces (direct denial of processing requests) when the service is overloaded, while preserving important interfaces. For example, the service is 10 interfaces, the service is degraded when five are closed, five are reserved, then the service processing capacity of this host will be increased to about twice times.

However, avalanche arrival is more than 10 times times the system processing capacity, and service degradation can increase the host service processing power 10 times times? Obviously difficult, it seems that service demotion does not replace flow control as the main strategy of overload protection.

Dynamic Expansion

Dynamic expansion refers to the automatic triggering of cluster expansion, automatic deployment and on-line operation when traffic exceeds the system service capability, and the automatic recovery of redundant machines after the flow has passed, fully resilient.

This program is not feeling very good. But even the first two years of cloud-computing bubbles blew the most hype and did not see which big domestic companies really use it. It is said that Amazon, Google can, but the celestial not to use, self-research words or a heavy task!

Recovery of Avalanches

Avalanche occurs when the operation and maintenance of traffic control flow, and other background system after the start of the gradual release of traffic, the main purpose is to let the cache slowly warm up. Flow control can start at 10%, then 20%, then 50%, then 80%, the final full amount, of course, the specific proportion, especially the initial ratio, but also the back-end capacity and the ratio of front-end traffic, the system is not the same.

If the back-end system has a special tool for the cache preheating, then save the operation of the work, and so on after the cache heat up and then release the backend system. However, if the key space in the cache is large, it will be difficult to develop the preheating tool.

Conclusion

"Prevention" in the avalanche response is also suitable, prevention, remedial supplement. Comprehensive analysis of the above, the specific prevention points are as follows:

1. The caller (a system) uses the cache with the refresh-based asynchronous renewal mode, or at least cannot use the simple (stupid) mode based on timeout or refresh.

2. The caller (a system) checks that the cache is available each time the cache is requested (available) and, if not available, accesses the backend in a conservative probability, rather than a reckless direct access back end.

3, the Service party (b system) in the reverse proxy set up flow control for overload protection, the threshold needs to be obtained by pressure measurement.

If there is energy, we can study the host application health judgment problem and the dynamic elastic operation and maintenance problem.

As for the recovery of avalanches is mainly by the transport and research and development of traffic release.

The avalanche has been studied for more than two weeks, and it's here. The next step may be to Turing Press to translate some open source books, you want to see what enthusiastically proposed Oh!

Ramble About Avalanches

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Ramble About Avalanches

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Ramble About Avalanches

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support