Learn about common Azure disasters

Source: Internet
Author: User

 

The following covers a variety of different types of disaster situations. Data center failures are not the only cause of application-wide failures. Poorly designed or managed errors can also lead to outages. It is important to consider the cause of the failure during the design and testing phases of the recovery plan. A good plan takes advantage of Azure capabilities and reinforces them through application-specific policies. The selected response is defined by the importance, RPO, and RTO of the application.

Application Failure

As mentioned earlier, the Azure fabric controller automatically handles failures caused by underlying hardware or operating system software in the host virtual machine. Azure creates a new role instance on a functioning server and then adds it to the load balancer rotation. If the number of role instances is greater than one, Azure switches the processing to another running role instance while replacing the failed node.

However, serious application errors that are not related to any hardware or operating system failure can also occur. Applications may fail due to a catastrophic exception caused by a logical error or data integrity problem. You must include sufficient telemetry in your code to enable the monitoring system to detect failure conditions and notify the application administrator. Administrators who are fully aware of the disaster recovery process can decide to invoke the failover process or simply accept an availability outage to resolve critical errors.

Data corruption

Azure automatically stores your azure SQL Database and Azure storage data in a different fault domain in the same datacenter three of times in a redundant way. If you use geo-replication, the data is stored three times in another datacenter. However, if the user or application corrupts the data in the primary replica, the corruption is quickly replicated to the other replicas. Unfortunately, this will result in three corrupted data.

To cope with possible data corruption, you will need to manage your own backups to maintain transactional consistency. You can store backups in Azure or on-premises, depending on your business needs or governance oversight. For more information, see the data Policies for Disaster recovery section.

Network Outage

When some parts of the Azure network are interrupted, you may not be able to access the application or data. If one or more role instances are unavailable due to network problems, Azure takes advantage of the remaining available instances of the application. If your application cannot access its data because of an Azure network outage, you can run it locally in degraded mode by using cached data, so you need to develop a disaster recovery strategy for running in degraded mode in your application. Some applications may not be able to do this. Another option is to store the data in an alternate location until the connection is restored. If the downgrade mode is not a good idea, the remaining options are to generate application downtime or to fail over to the standby data center. Designing to run applications in degraded mode is more of a business decision than a technical decision. The Application Feature demotion section discusses this issue in depth.

Service-dependent failure

Many of the services that Azure offers may be scheduled for downtime. Imagine Azure Shared Caching as an example. This multi-tenancy service provides caching capabilities to applications. It is important to imagine what will happen in the application if the dependent services are not available. This scenario is similar to a network outage scenario in many ways, but considering each service individually is expected to improve the overall plan.

For example, with Caching, the multi-tenancy shared cache model has a relatively new alternative. With Azure Caching on the role, you can cache the application from a cloud service deployment. (It is recommended that Caching be used in the future). Although it has a limitation, it can only be accessed from a single deployment, but it is possible to benefit from disaster recovery. First, the service is now running on your deployment-local role. As a result, the state of the cache can be better monitored and managed during the overall management of the cloud service. However, this type of cache also publishes new features. One of the new features is high availability of cached data. This feature helps preserve cached data in the event of a node failure by preserving duplicate replicas on other nodes. Note that high availability reduces throughput and increases latency because the secondary replica needs to be updated on write. It also doubles the amount of memory that is used for each item, so plan for doing so. This specific example shows that each dependent service can have the ability to improve overall availability and help protect against catastrophic failures.

With each dependent service, you should be aware of the total number of interrupts that may occur. In the Caching example, you might be able to access data directly from the database until the Caching feature is restored. In terms of performance, this would be a degraded mode, but it would provide full functionality in terms of data.

Data Center Failure

Previous failures were primarily a failure that could be addressed within the same Azure datacenter. However, you must also be prepared for the possibility of a failure across the data center. A locally redundant copy of the data is not available when a datacenter failure occurs. If geo-replication is enabled, there is another 3 copies of the Blob and table in the offsite data center. When Microsoft claims a datacenter failure, Azure will remap all DNS entries to the geo-replicated datacenter. Note that you do not have any control over this process, and only for the entire data center-wide failure. Therefore, you must also rely on other application-specific backup methods to achieve the highest level of availability. For more information, see the data Policies for Disaster recovery section.

Azure has failed

In disaster planning, all possible disaster situations must be taken into account. One of the most serious failures will involve all Azure datacenters at the same time. As with any other failure, you may decide to take the risk of downtime in this case. A wide range of failures across multiple datacenters should be much rarer than isolated failures involving dependent services or individual data centers. However, for some mission-critical applications, you may decide that you must also have a backup plan for this scenario. A plan for this event may include services that fail over to an alternate cloud or hybrid on-premises and cloud solutions.

Learn about common Azure disasters

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.