Root cause analysis of Windows Azure service disruption

Source: Internet
Author: User
Keywords Azure azure root cause service interruption

During the Pacific Standard Time (PST) time of February 22, 2013 12:29, services were interrupted in all areas, causing customers to use http://www.aliyun.com/zixun/aggregation/29830.html "> HTTPS is affected when accessing Windows Azure storage blobs, table, and queue. Worldwide availability is restored on PST time February 23, 2013 00:09.

We apologize to the affected customers for the interruption of the service and take the initiative to return the service charges to these customers, as outlined below.

We hereby provide details about the components that the interrupt is associated with, the root cause of the interruption, the recovery process, the lessons we learned in this incident, and the ongoing work we are doing to improve our customer service reliability.

Windows Azure Overview

Before parsing the details of this service interruption, to better understand the context of the event, we first share some information about the internal components of Windows Azure that this event is associated with.

Windows Azure runs many cloud services within the global data centers and geographic regions. Windows Azure Storage runs on Windows Azure as a cloud service. Each geographic area contains multiple physical storage service deployments, which we call seals. Each storage seal has multiple storage node racks.

Windows Azurefabric Controller is the resource configuration and management layer for managing hardware that manages hardware and provides resource allocation, deployment and upgrades, and management capabilities for cloud services on Windows Azure platforms.

Windows Azure uses internal services called confidential storage to securely manage the certificates required to run a service. This internal management service automatically stores, distributes, and upgrades the platform and client certificates in the system. In addition, the internal management Service can automate the processing of certificates in the system to avoid direct access to confidential information by Microsoft employees, thus complying with the requirements and ensuring security.

Root cause analysis of service interruption

Windows Azure Storage uses a separate Secure Sockets Layer (SSL) certificate to provide security for customer data communications for each major storage type: Blob, table, and queue. With these certificates, traffic for all child domains that represent customer accounts, such as myaccount.blob.core.windows.net, can be encrypted through HTTPS. internal and external services use these certificates to encrypt the traffic from/to the storage system. These certificates originate from secret stores, are stored locally on each Windows Azure storage node, and are deployed by fabric controller. The certificates used for blobs, table, and queue are the same for all regions and seals.

The expiration time for the certificate used last week is as follows:

*.blob.core.windows.net


PST Time February 22, 2013 Friday, afternoon 12:29:53 *.queue.core.windows.net


PST Time February 22, 2013 Friday, afternoon 12:31:22 *.table.core.windows.net PST


time February 22, 2013 Friday, 12:32:52

When the certificate expires, the certificate becomes invalid and those connections established with the storage server using HTTPS are denied. HTTP transactions are still working properly throughout.

Although the certificate expiration has a direct impact on the customer, interruptions in the process of maintaining and monitoring these certificates are the root cause. In addition, because the certificates in each region are identical and are temporarily close to each other, they are a single point of failure for the storage system.

Reasons for storing certificates not updated

The event background is that the certificate being administered is scanned weekly as part of the normal operation of the secret store. Send an alert to the management service's team about to expire 180 days in advance. From this point on, confidential storage will send a notification to the team that owns the certificate. The team will update the certificate when it is notified, include the updated certificate in the new build service that is scheduled for deployment, and update the certificate in the confidential stored database. This process is performed regularly hundreds of times a month on many services in Windows Azure.

This time, the confidential storage service notifies the Windows Azure Storage service team that the certificate will expire within the specified date. On January 7, 2013, the storage team updated three certificates in the confidential store and included them in later versions of the service. However, the team failed to mark the version that contains the certificate update as an upcoming release.

Later, the release of the storage service version containing the time-critical certificate update was deferred to an update marked as higher priority, and was not deployed in time before the certificate expiration due date. In addition, because the certificate was updated in the secret store, no additional alerts were provided to the team, which is a flaw in our alert system.

Recovery Storage Service

This event was detected during PST time 12:44 through normal monitoring and is diagnosed because the certificate has expired. By 13:15 PST Time, the engineering team has graded the problem and established multiple workflows to determine the quickest path to recovery services.

During normal operation, Fabriccontroller pushes the node to the desired state, also known as the target State.

The service definition of a service provides the required state of the deployment, which enables fabric controller to determine the target state of the node (server) that is part of the deployment. Service definitions contain role instances and their endpoints, configuration and Failure/update fields, and references to other projects, such as code, virtual hard disk (VHD)
Name, certificate fingerprint, etc.

During normal operation, the specified service updates its build to include the new certificate, and then uses fabric controller to deploy the service by running update domains and deployment services to all nodes systematically. The process is designed to enable external customers to experience seamless updates and meet the release of service-level agreements (SLAs)
Way to update the software. Although part of these tasks can be performed in parallel, the total time to deploy updates to the global service takes several hours.

During this HTTPS service outage, the Windows Azure storage service is still functioning properly and works with customers accessing its data using HTTP, and some customers quickly mitigate their HTTPS problems by temporarily moving to HTTP. We are very cautious in restoring services to other users so that customers who use HTTP are not affected.

After analyzing the multiple options used to restore the HTTPS service, two methods are selected: 1 updating the certificate on each storage node and 2 full updating the storage service. The first approach is optimized to restore customer service as quickly as possible.

1) Update the certificate

The development team completed the manual steps required to update the certificate to verify the remediation method and restore the service. This process is complicated by a fabric controller attempt to restore a node to its target state. At PST time 18:22, the team made the process of successfully updating the certificate and tested it. Drawing on experience from previous interruptions, we take the time to fully test and validate the fixes to avoid complications or two interruptions that affect other services. During the test repair, several problems were identified and corrections were made before the production deployment was validated.

After verifying the automatic update process, we applied it to the storage node in the western United States data Center 19:20 PST Time and successfully restored the service in the region at 20:50 PST time. We then extend it to all storage nodes worldwide. This process was completed 22:45 PST time and the HTTPS service was restored for most customers. During the PST time of February 23, 2013 00:09, other monitoring and verification is completed, and the Azure dashboard is marked green.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.