How to avoid failure?

Last Update:2015-08-27 Source: Internet

Author: User

Tags try catch

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For each programmer, the fault is hanging on the head of the Damocles sword, are afraid to avoid, how to prevent failure is every programmer is struggling to find a solution to the problem. For this problem, we can from the requirements analysis, architecture design, code writing, testing, code review, on-line, online service operations and other perspectives to give their own answers. I combine my two years of limited internet back-end work experience, from a certain point of view to talk about their understanding of the problem, the shortcomings, hope that many points.

Most of our services are the following structure, both for the use of users, but also rely on other people to provide third-party services, in the middle of a variety of business, algorithms, data and other logic, each of which may be the source of failure. How to avoid failure? I summarize in a nutshell, "suspicion of the third party, to prevent the use of the party, do their own ."

1 suspected third party

Adhere to the belief that "all third-party services are unreliable", regardless of what the third party's hype promises. Based on this belief, we need to have the following actions.

1.1 have backstop, make good business downgrade plan

What if a third-party service hangs up? Our business is hanging out? Obviously this is not the result we want to see, and if we can make a good downgrade plan, it will greatly improve the reliability of the service. Give a few examples so that you can understand better.

For example, when we make personalized referral service, we need to get the user's personalized data from the user center, so that we can get the scoring sort in the model, but if the User Center service hangs up, we can't obtain the data, then it is not recommended? Obviously not, we can put a hot item in the cache in order to backstop;

Another example is to do a data synchronization service, the service needs to get the latest data from third parties and update to MySQL, just the third party provides two ways: 1) a message notification service, only send the changed data; 2) one is an HTTP service that requires us to invoke the fetch data on our own initiative. We began to choose the way the message is synchronized, because the real-time is higher, but then encountered the message is not sent over the problem, and nothing unusual, and so we found that one day has passed, the problem has escalated to failure. A reasonable way should be two synchronization scenarios are used, message mode for real-time updates, HTTP active synchronization mode timing trigger (for example, 1 hours) for backstop, even if the message is a problem, through active synchronization can also guarantee an hourly update.

There are times when third-party service surfaces look normal, but the data returned is contaminated, and what are the ways to backstop it? Some people say that this time, in addition to notifying third parties to recover data quickly, basically just wait. For example, we do the mobile search service, which needs to call the third-party interface to get the data to build the inverted index, if the third-party data is wrong, our index will also error, which causes our retrieval service to filter out the wrong content. Third-party service recovery data The most nearly half an hour, we build the index also need half an hour, that may have more than 1 hours of time retrieval service will not work properly, this is unacceptable. How to backstop it? The way we do this is to keep a snapshot of the full index file every once in a while, and once the third-party data source is contaminated with data, we first press the Stop index build switch and quickly roll back to the earlier normal index file snapshot, so that although the data is not very new (perhaps 1 hours ago), it can at least guarantee the results. Not have a particularly big impact on trading.

1.2 Follow the fast failure principle and be sure to set the time-out

A service invocation of a third-party interface normal response time is 50ms, one day the third-party interface problems, about 15% of the request response time of more than 2s, not long after the service load soared to more than 10, the response time is very slow, that is, third-party services to drag our services down.

Why is it being dragged down? No timeout set! We used a synchronous invocation, using a thread pool, where the maximum number of threads in the thread pool was set at 50, and if all the threads were busy, the extra requests were placed in the queue. If the third-party interface response time is about 50ms, then the thread can quickly finish the work in their hands, and then processing the next request, but unfortunately if a certain proportion of the third-party interface response time is 2s, then the last 50 threads will be delayed, the queue will accumulate a large number of requests, resulting in a significant decline in overall service capability.

The correct approach is to negotiate with a third party to determine a shorter time-out such as 200ms, so that even if their service problems do not have a significant impact on our services.

1.3 Appropriate protection of third parties, carefully select the retry mechanism

You need to carefully consider whether to use the retry mechanism in conjunction with your own business and exceptions. For example, call a third-party service, reported an exception, some students no matter 3,721 directly retry, this is not right, such as some business returns the exception indicates business logic error, then you how to retry the result is an exception, and if some exceptions are interface processing Timeout exception, this time need to combine business to judge, Sometimes retries tend to put more pressure on the rear service and start to add insult to effect.

2 precautions against the use of the party

There is also a belief: "All the users are not reliable", regardless of the use of what the hype of the guarantee. Based on this belief, we need to have the following actions.

2.1 Designing a good API (RPC, Restful) to avoid misuse

The past two years have seen a lot of failures, either directly or indirectly due to bad interfaces. If your interface let a lot of people misuse, it should be a good reflection of their interface design, although the interface design is easy to see, but the knowledge is very deep, we suggest you take a good look at Joshua Bloch's speech "How to Design a good API & why it Matters (How to design a good API and why it's important) "and theJava API design Checklist ."

Here is a brief discussion of my experience.

a) follow the minimum interface exposure principle

The use of how many interfaces we provide, because the more interfaces provided more prone to disorderly use phenomenon, throat. Furthermore, the more exposed the interface, the higher its maintenance costs.

b) Do not let the user do things that the interface can do

If the user needs to call our interface multiple times to perform a complete operation, then this interface design may be problematic. For example, the interface to obtain data, if only to provide getdata (int id), interface, then the consumer if you want to get 20 data at a time, it will need to loop through the call our interface 20 times, not only the performance of the user is poor, but also unwarranted increase the pressure of our service, then provide getdatalist (list<integer> idlist); The interface is clearly necessary.

C ) avoid long-executing interfaces

Take the data acquisition method as an example: Getdatalist (list<integer> idlist); Assume that a user to pass 1w ID at a time, our service estimates not a few seconds out of the results, and often the result of a timeout, the user how to call the result is a timeout exception, how to do? Limit length, such as the limit length of 100, that is, a maximum of 100 ID at a time, so as to avoid long execution, if the user passed the ID list length of more than 100 to report an exception.

By adding such a limit, it is important to let the user know clearly that this method has this limitation. Before the misuse of the situation, a user an order to buy more than 100 items, the order service needs to call the product center interface to get all the goods under the order information, but how to call all failed, and the exception did not play any valuable information, and then the troubleshooting for a long time to know that the Product center interface did a length limit.

How can we add a limit and not let users misuse it?

Two ideas: 1) interface to help users do a split call operation, such as the user passed the 1w ID, the interface is divided into 100 ID list (each length 100), and then loop call, so that the user shield the internal mechanism, the use of transparent; 2) let users do their own segmentation, write loop display call, This needs to let the user know our method is limited, the method is: 1) Change the method name, such as Getdatalistwithlimitlength (list<integer> idlist); ; 2) Add comments, 3) If the length is more than 100, it is very clear to throw an exception, very straightforward to inform.

D ) parameters easy to use principle

Avoid parameter length is too long, generally more than 3 after the more difficult to use, then someone said I parameter is so much, how to do? Write a parameter Class!

In addition to avoid the same type of continuous parameters, it is easy to misuse.

Use other types, such as int, to try not to use the string type, which is also a way to avoid misuse.

e ) Exception

The interface should most realistically reflect the implementation of the problem, but not with the clever code to do some special processing. Often see some students in the interface code a try catch, no matter what exception is thrown inside, after capturing to return an empty collection.

Public list<integer> Test () {        try {            ...        } catch (Exception e) {            return collections.emptylist (); c8/>}    }

This makes the user very helpless, many times do not know the problem of their own parameters, or the internal problems of the service, and once the unknown may be misused.

2.2 Flow control, distribute traffic by service, avoid misuse

Believe that a lot of students do too high concurrent services encounter similar events: a June suddenly found his interface request volume suddenly up to 10 times times, not long before the interface is almost unusable, and triggered a chain reaction caused the whole system crashes.

Why will rise 10 times times, is the interface was attacked by outsiders, in my experience to see the general internal people "crime" more likely. I have seen a classmate of the MapReduce job call Online services, minutes of service to kill.

How to deal with this situation? Life has given us the answer: for example, the old-fashioned switch has a fuse, and once someone uses an ultra-high-power device, the fuse will burn off to protect the appliance from being burnt out by a strong current. In the same vein, our interface also needs to be installed "fuse" to prevent unexpected requests to system pressure caused by the system paralysis, when the traffic is too large, you can take a refusal or drainage mechanism. The specific current limit algorithm is described in the "Interface Current limit Practice" article.

3 Do your own

To do a good job is a very big topic, from the requirements analysis, architecture design, code writing, testing, coding review, on-line, online service operation and other stages can focus on the introduction, this simple sharing under the architecture design, code writing several experience principles.

3.1 Single Duty principle

For the students who have worked for more than two years, the design pattern should take a good look, I feel that the various specific design patterns are not important, the important is the principle behind the embodiment. For example, the principle of single responsibility, in our needs analysis, architecture design, coding and other stages are very instructive.

In the phase of demand analysis, the single responsibility principle can define the boundary of our service, if the service boundary is not clearly defined, all reasonable unreasonable demands are connected, and finally the service is not maintainable, can not be expanded, and the tragic ending of the failure continues.

For architecture, a single responsibility is also important. For example, the reading and writing modules are placed together, causing the read service jitter to be very strong, if read and write separation that will greatly improve the stability of the Read service (read and write separation), such as a service contains the order, search, the recommended interface, if the recommendation of the problem may affect the function of the order, At this point, you can split the different interfaces into separate services and deploy them independently, so that one problem does not affect other services (resource isolation), such as our image service, which uses separate domain names and is placed on a CDN, independent of other services (static and dynamic separation).

From a code point of view, a class only does one thing, and if your class has done more than one thing, consider separating him. The benefits of this are very clear, it is very convenient to modify later, the impact on other code is very small. To look at the method of the class in fine granularity, a method also only one thing, that is, only one function, if you do two things, then separate it, because modifying a function may affect another function.

3.2 Controlling the use of resources

Write code the brain must tighten a chord, recognizing that our machine resources are limited. What are the machine resources? CPU, memory, network, disk and so on, if do not do the protection control work, once a resource full load, it is easy to cause the problem of the line.

3.2.1 How to limit CPU resources?

A) Computational algorithm optimization

If the service needs a lot of computation, such as the recommended sorting service, then you must optimize the calculation algorithm, for example, the geographical space distance calculation of this heavy use of the algorithm has been optimized, achieved good results, see "Geographical Space distance computing optimization" article.

b) lock

For many services, there are not so many computational resource-intensive algorithms, but CPU usage is also very high, this time need to look at the use of locks, my advice is not necessary, as far as possible without explicit use of locks.

c) Customary problems

For example, when writing a loop, you must check to see if you can exit correctly, sometimes accidentally, in some conditions become a dead cycle, very well-known case is "multi-threaded under the hashmap of the death cycle problem." For example, when the collection traversal using poor performance traversal, string + check, if there are more than one string added, whether to use Stringbuffer.append?

d) Use the thread pool as much as possible

Limit the number of threads by line pool, and avoid the overhead of thread context switching caused by excessive threads.

e) JVM parameter tuning

JVM parameters also affect the use of CPUs, such as the "jitter Problem Resolution" when publishing or restarting online services.

3.2.2 How to limit memory resources?

A) JVM parameter settings

Using JVM parameter settings to limit memory usage, JVM parameter tuning comparison experience, a friend wrote a good article can refer to "Linux and JVM Memory relationship analysis."

b) Initialize the Java collection class size

It is important to initialize the size as much as possible when using the Java Collection class, and in services that consume memory resources such as long connection services;

c) using a memory Pool/object pool

d) Be sure to set the maximum queue length when using the thread pool

I've seen a lot of failures before because the maximum queue length has no limit and eventually causes memory overflow.

e) If the data is large avoid using local cache

　　If the amount of data is large, you can consider placing it into distributed cache such as Redis, tair, etc., or the GC may put its own service card dead;

f) compressing the cached data

For example, before the recommendation related services, need to save user preferences data, if the direct save may have 12G, and then use the short compression algorithm directly compressed to 6G, but must consider the compression decompression algorithm CPU utilization, efficiency and compression ratio of the balance, some high compression rate but poor performance of the algorithm, Nor is it suitable for on-line real-time calls.

There are times when you can use PROBUF directly to serialize and save, which also saves memory space.

g) Clear third-party software implementation details, precise tuning

In the use of third-party software, only the clear details to know how to save memory, which I have deep experience in the actual work, such as before reading the source of Lucene and found that our index files can be compressed, and this is not found in the documentation, specific reference to the Lucene index file Size optimization summary of the article.

3.2.3 How to limit network resources?

a) Reduce the number of calls

Reduce the number of calls? Often see a classmate in the loop with Redis/tair get, if you realize that the network overhead should use batch processing, and in the recommendation service often encountered to go to many places to fetch data, generally using multi-threaded parallel to fetch data, this time not only consumes CPU resources, also consumes network resources , a method that is often used in practice is to store a lot of data offline first, when the online service can get all the data as long as a single request.

b) Reduce the amount of data transferred

One way is to compress the transmission, there is another is on-demand transmission, such as often encountered GetData (int id), if we return the ID corresponding to the data of all information, the other people do not need, and secondly, the volume of transmission is too large, this time can be changed to GetData (int ID, List <String> fields), the user transmits the corresponding field, and the server only returns the fields required by the user.

3.2.4 How is disk resource limited?

Hit the log to control the amount and clean up regularly. 1) print only critical exception logs, 2) monitor alarm for log size. I once met a third-party service hangs, and then my side will continue to print calls to the third-party service Exception Log, originally my service has a downgrade scheme, if the third party service hangs will automatically use other services, but suddenly received the alarm said I service hung, boarded the machine a look only know is not enough to cause the crash of disk ; 3) regularly clean the log, such as with crontab, every few days to clean up the log; 4) print the log to the remote, for some of the more important logs can be directly printed to the remote HDFs file system;

3.3 Avoiding a single point

Don't put the eggs on one basket! From a large level of service can be multi-machine room deployment, off-site multi-live, from their own design point of view, service should be able to achieve horizontal expansion.

For many stateless services, through Nginx, zookeeper can easily achieve horizontal expansion;

For some job types of services, how to avoid a single point, after all, can only be run on a node, reference to the "Quartz application and Cluster principle analysis" article;

For data services, how to avoid a single point? In short, it can be achieved by slicing, layering, and so on, followed by a post summary.

4 Summary

How to avoid failure? My experience is condensed into a sentence: " suspicion of the third party, to protect the user, to do their own ", we can also think, summarize and share their own experience.

How to avoid failure?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More