Let it crash philosophy part II

Last Update:2015-12-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Designing fault tolerant systems is extremely difficult. Can try to anticipate and reason on all of the things so can go wrong with your software and code defensively for These situations, but in a complex system it's very likely that some combination of events or inputs would eventually con Spire against-cause a failure or bug in the system.

In certain areas of the software community such as Erlang and Akka, there ' s a philosophy, rather than trying to handle and recover from all possible exceptional and failure states, you should instead simply fail early and let your processes Crash, but then recycle them back into the pool to serve the next request. This gives the system a kind of self healing property where it recovers from failure without ceremony, whilst freeing up T He developer from overly defensive error handling.

I believe that implementing let it crash semantics and working within this mindset would improve almost any application–n OT just real time telecoms system where Erlang was born. By adopting-let it crash, redundancy and defence against errors would be baked into the architecture rather than trying to Defensively anticipate scenarios right down in the guts of the code. It'll also encourage you to implement more redundancy throughout your system.

Also ask yourself, if the components or services in your application do crash, how well would your system recover with or Without human intervention? Very few applications'll has a full automatic recoverability property, and yet implementing this feels like relatively Low hanging fruit compared to writing 100% fault tolerant code.

So how does we start to put the practice?

At the hardware level, you can obviously look towards the ' Google model ' of commodity servers, whereby the failure of a NY given server supporting the system does not leads to a fatal degradation of service. this is easier in the cloud world where the economics encourage us to use a larger number of small virtualised serve rs. just let them crash and design for the fact That's servers can die at a moments notice.

Your application might be comprised of different logical services. Think A user authentication service or a shopping cart system. Design the system to let entire services crash . Where appropriate, your application should be able to proceed and degrade gracefully whilst the service was not available, Or to fall back onto another instance of the service whilst the first one is recycling. Nothing should is in the critical code path because it might crash!

Ideally, your distributed system would be organised to scale horizontally across different server nodes. The system should load balance or intelligently route between processes in the pool, and different nodes should is able to Join or leave the pool without too much ceremony or impact to the application. When you had this style of horizontal scalability, let nodes within your application crash and rejoin the pool W Hen they ' re ready.

What if we go further and implement let it crash semantics to our infrastructure?

For instance, say we had some messaging system or message broker that transports messages between the components of your Application. What if we let this crash and come back online later. Could you design the application so it's not as fatal as it sounds, perhaps by allowing application Write to or dynamically switch between message brokers?

Distributed NoSQL data stores gives us let it crash capability at the data persistence level. Data would be stored in some distributed grids of nodes and replicated to at least 2 different hardware nodes. At the this point, it's easier to the database nodes crash than try to achieve 100% uptime.

At the network level, we can design topologies such that we don't care if routers or network links crash because there ' s Always some alternate route through the network. Let them crash and while they come back the optimal routes would be there ready for our application Again in the future.

Let it crash are more than simple redundancy. It ' s about implementing self recoverability of the application. It's about putting your site reliability efforts to your architecture rather than low level defensive coding. It's about decoupling your application and introducing asynchronicity in recognition so things go wrong in surprising WA Ys. Ironically, sitting back and cooly letting your software crash can leads to better software!

Let it crash philosophy part II

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Let it crash philosophy part II

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support