Windows azure service disruption on Feb 29th

Source: Internet
Author: User

10 days after the accident, Microsoft on to Windows azure service down to give a summary, see here: http://blogs.msdn.com/ B /windowsazure/archive/2012/03/10/summary-of-windows-azure-service-disruption-on-feb-29th-2012.aspx

 

We can see the approximate process of the event:

1. Application VM guest agent cannot generate certificate due to the leap year bug

2. If the Host OS agent finds that the VM guest agent cannot be restarted successfully, it will try again three times to start the VM.

3. After three retries, the host OS finds that the VM cannot be started successfully.

4. The host OS agent did not receive any meaningful error reports from the VM guest agent, so it thought the machine hardware had a problem.

5. The host OS is reported to the fabric controller. The host is considered unusable and requires manual repair.

6. Fabric controller transfers VM creation to other machines. The above process is reproduced on each machine, so soon all machines are considered as hardware problems.

7. When fabric Controller considers all machines to be faulty, the entire cluster and even the data center cannot provide services.

 

From the chain where the above events occur, we should consider at least two points during the design:

1. the condition for whether a machine should be removed: if the above Host OS only retries three times and Cannot initialize the VM, it is considered that the machine hareware has a problem, which is a bit rough, the bug of the VM itself or application attacks on the VM are not taken into account.

2. in a short period of time, a large number of machines are considered to have problems with hardware, which leads to deprecation. Fabric Controller did not quickly issue an alarm: The above summary mentioned that when the number of available machines is reduced by 70%, fabric controller will trigger an alarm, this value is obviously too high and does not take the time into consideration. If we calculate the machine deprecation rate per unit time, I think 5% of the machines have problems enough to report an alarm.

 

The above events look like a state machine, turning to the next state in the set state and making corresponding actions. I think if such a state machine is clearly depicted during design, so can developers realize that "in a state machine, some States are more important than others, and some paths are more costly than others? ", In this case, the above bug may be discovered during the review development stage.

This is another way of thinking. The same is true for programming.ProgramStatus conversion starts from Main. Can we clearly define the High-level status conversion and carefully identify which statuses are more important? Which paths should be more careful? This may be a good way to reduce large bugs.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.