Let it crash philosophy part II

Source: Internet
Author: User

Designing fault tolerant systems is extremely difficult. Can try to anticipate and reason on all of the things so can go wrong with your software and code defensively for These situations, but in a complex system it's very likely that some combination of events or inputs would eventually con Spire against-cause a failure or bug in the system.

In certain areas of the software community such as Erlang and Akka, there ' s a philosophy, rather than trying to handle and recover from all possible exceptional and failure states, you should instead simply fail early and let your processes  Crash, but then recycle them back into the pool to serve the next request. This gives the system a kind of self healing property where it recovers from failure without ceremony, whilst freeing up T He developer from overly defensive error handling.

I believe that implementing let it crash semantics and working within this mindset would improve almost any application–n  OT just real time telecoms system where Erlang was born. By adopting-let it crash, redundancy and defence against errors would be baked into the architecture rather than trying to  Defensively anticipate scenarios right down in the guts of the code. It'll also encourage you to implement more redundancy throughout your system.

Also ask yourself, if the components or services in your application do crash, how well would your system recover with or  Without human intervention? Very few applications'll has a full automatic recoverability property, and yet implementing this feels like relatively Low hanging fruit compared to writing 100% fault tolerant code.

So how does we start to put the practice?

At the hardware level, you can obviously look towards the ' Google model ' of commodity servers, whereby the failure of a NY given server supporting the system does not leads to a fatal degradation of service.  this is easier in the cloud world where the economics encourage us to use a larger number of small virtualised serve rs.     just  let them crash   and design for the fact That's servers can die at a moments notice.

Your application might be comprised of different logical services. Think A user authentication service or a shopping cart system. Design the system to let entire services crash . Where appropriate, your application should be able to proceed and degrade gracefully whilst the service was not available,  Or to fall back onto another instance of the service whilst the first one is recycling. Nothing should is in the critical code path because it might crash!

Ideally, your distributed system would be organised to scale horizontally across different server nodes. The system should load balance or intelligently route between processes in the pool, and different nodes should is able to  Join or leave the pool without too much ceremony or impact to the application. When you had this style of horizontal scalability, let nodes within your application crash and rejoin the pool W Hen they ' re ready.

What if we go further and implement let it crash semantics to our infrastructure?

For instance, say we had some messaging system or message broker that transports messages between the components of your  Application.  What if we let this crash and come back online later. Could you design the application so it's not as fatal as it sounds, perhaps by allowing application Write to or dynamically switch between message brokers?

Distributed NoSQL data stores gives us let it crash capability at the data persistence level.  Data would be stored in some distributed grids of nodes and replicated to at least 2 different hardware nodes. At the this point, it's easier to the database nodes crash than try to achieve 100% uptime.

At the network level, we can design topologies such that we don't care if routers or network links crash because there ' s   Always some alternate route through the network. Let them crash and while they come back the optimal routes would be there ready for our application Again in the future.

Let it crash are more than simple redundancy.  It ' s about implementing self recoverability of the application.  It's about putting your site reliability efforts to your architecture rather than low level defensive coding. It's about decoupling your application and introducing asynchronicity in recognition so things go wrong in surprising WA  Ys. Ironically, sitting back and cooly letting your software crash can leads to better software!

Let it crash philosophy part II

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.