While browsing the upcoming cloud computing conference, I noticed some meetings dedicated to consulting with cloud vendors on contracts and SLAs. Reading about the meeting leads to the conclusion that careful development of SLAs is the basis for successful use of cloud computing.
These meetings provide a detailed description of how they will help participants through the following cloud computing topics:
· Define uptime, availability, and performance
· Negotiation skills in developing SLAs
· What factors are included in the SLA: virtual machine availability, response time, network latency, etc.
· Negotiation of penalties for breach of SLA
After reading a series of descriptions of SLA topics, it is inevitable to think of the fact that SLAs are not related to increased availability; The purpose of the SLA is to provide a basis for legal controversy after the accident.
However, none of the meetings singled out this point. The meeting description looks like it suggests smart SLA negotiations should somehow ensure that the application is immune to running interrupts. But that is not the case.
The reality is that all infrastructures face such or such outages. Although a careful assessment of the cloud vendor's capabilities may be able to choose a stronger cloud provider, paying more will probably ensure faster response or faster communication with a dedicated response team, but these measures will not be immune to operational outages. No matter how much time you spend developing an SLA agreement, you can't guarantee 100% uptime.
So why are people so obsessed with SLAs?
First, the SLA gives people a sense of control. Sitting in a room, insisting on special treatment, canceling a contract and replacing the existing statement with another statement would make people feel as if they were defending their power. And it's a great feeling. But don't think that you are fundamentally changing the contracts offered by cloud providers. I learned about this at an SLA meeting with a lawyer. After spending 90 minutes fine-tuning the contract, he concludes: "Of course, you can't change the standard contract too much, because the contract is used to reduce the responsibility of the cloud supplier." What you are talking about is how much service credit you will get. ”
Another reason people are obsessed with SLAs is that the SLA provides the basis for arguments after the outage. Being wise beforehand may mean getting more compensation later. But keep in mind that no matter how you argue, it is impossible to get all the compensation for the loss of business caused by the interruption of the operation.
Once again: SLA compensation is limited to losses resulting from service costs rather than interruption of services, and service costs typically represent a small fraction of the loss caused by service disruption.
A former employee who worked for one of the largest outsourcing companies and I shared an example: their very large retailer client's website in black Friday was down. Application interruption of 6 hours, resulting in loss of revenue as high as 50 million U.S. dollars. How much compensation do you think the outsourcing company has for retailers? Six hours of service credit, only about 300 dollars.
What is the message of this story? That's the right way to look at the SLA discussions. If your application is not available, you will not be able to get the full loss.
Put too much effort into arguing with the SLA the worst thing that can happen is that you can't focus on more important issues: how to ensure uptime. If you're on the Titanic and the Titanic hits an iceberg and starts sinking, spending time talking about the location and situation of your couch doesn't solve any problems.
The most important question is how will you consider the application service outage and what options do you have to improve uptime?
The cut-in point is to keep Voltaire's observation in mind: "Better be a good enemy." "A rough translation is that perfection is a great enemy." Used in cloud computing, this advice can be seen as "when the cloud provider's data center claims to be able to provide far less acceptable uptime, remember not to adopt the cloud vendor because he cannot guarantee 99.999% uptime." ”
Using cloud computing is the right choice if you are using cloud computing to significantly improve uptime. If there is no accurate statistical data on the availability of your own computing environment uptime, an obvious signal is that migrating to the cloud provider is a step in the right direction. Migrating to a cloud vendor may not be perfect, but it's much better than not even being able to track uptime. Please believe me, there are too many IT organizations just to ensure that the application of the normal running time.
You can take the following steps to improve the uptime of your application:
1. Design application architecture to deal with resource failure. To improve the uptime of your application, perhaps the most effective measure you can take is to design an application architecture so that it can continue to run in the event of a single resource failure, such as a server failure. Even if the server outage causes the virtual machine to break, the application server redundancy ensures that the application continues to run. Similarly, having a replicated database server means that if a server goes down, the application is not stalled. Using the Application Management framework, starting a new instance to replace a failed instance ensures that the redundant topology continues to run when a run outage occurs.
2. Design topology for infrastructure failures. Although a smart design can protect the usability of an application when the hardware element fails, the design does not provide any help if the application environment fails. If the data center that hosts the application is running a failure, then using redundant application design is futile. The solution in this case is to implement the Cross zone distribution of the application, so that even if part of the application becomes unavailable due to a large-scale outage by the cloud vendor, the application can continue to function. Of course, this makes the design of the application more complex, but it provides a greater degree of downtime protection.
3. Design application deployment to deal with cloud supplier failures. Of course, all the infrastructure downtime for cloud vendors is also possible. Even if the likelihood of this happening is rather small, it is possible. For example, the cloud vendor's entire network infrastructure cloud fails, or the cloud vendor may suddenly shut down. This may be far-fetched, but this has happened in the past in online services. The solution is to extend the application architecture among multiple cloud vendors. Although many vendors claim this is challenging because the semantics of different cloud vendors vary, it makes it difficult to design applications that combine different functions. However, after adequate planning and detailed design, it is also possible to extend the application architecture among multiple cloud vendors.
It is clear from the above discussion that higher levels of uptime certainly require increased technical complexity, and higher levels of technical complexity mean increased investment.
The decision to make a given application for a certain level of investment is a risk assessment. This should, of course, be a clear risk assessment exercise that needs to be weighed against business investment ratios, investments, and technical operational complexities. The job is not easy, and there is probably no easy answer. However, this is more likely to result in an acceptable outcome than an enlarged but futile SLA contract debate.