10 worst cloud fault and its lessons

Source: Internet
Author: User
Keywords Fault broken net serious this but

Guide: To help businesses avoid failures in cloud services, the Web World website provides the 10 most severe cloud service outages that the site has experienced and the lessons we can learn from them.

Serious cloud Interrupt 1: Amazon Web Service interrupted. Eliminating your tedious network maintenance work is the main selling point of doing business in the cloud. But his disadvantage is that you will be helpless when your cloud vendors routinely change their configuration to disrupt your business.

This is what many Amazon Web service users experienced in April this year. At that time, the Amazon North Virginia State Data center failure, completely unusable.

This failure occurs during a network upgrade. At the time, when information was looking for devices that were available to embed itself as a backup into these devices, a traffic movement of the wrong route sent a series of Amazon EBS (resilient block storage) traffic to a mirrored storm. This is an anomalous phenomenon. This caused a series of events that eventually led to the disruption of Amazon's many services in the eastern United States.

The fault lasted about four days. But while many companies are in trouble, other companies such as Netflix are troubleshooting. What is the key to survival? This type of failure should be considered when designing the system.

"Our architecture avoids using EBS as our primary data storage service," the Netflix engineer said in a blog titled "Netflix's lessons learned from the outage of Amazon Web services." The SimpleDB, S3 and Cassandra Services we relied on were not affected by the interruption. Multiple redundant hot copies of data without country services and available areas are key to avoiding cloud failures in Amazon Web services.

Think about it. Do you have to be a Netflix-scale enterprise to keep it safe? Twilio Company, which helps developers integrate communications with their Web applications, leverages Amazon's EC2 service to host its core infrastructure. Still, the April outage has had little impact on its stability.

Twilio co-founder and chief technology officer Evan Cooke said the premise of setting up the cloud was to assume that the network would fail. We built an infrastructure around the idea that the mainframe could and would fail. Therefore, we do not rely on any machine or component of the core architecture itself.

Severe cloud interrupt 2:sidekick shutdown. Smartphones make it easy for you to access your data in mobile. However, some things do not because the name of the word "smart" and not stupid. Example: T sidekick interruption occurred around the fall of 2009.

Remember the big fiasco? Microsoft owned sidekick suffered nearly one weeks of service disruption, making it impossible for users to access e-mail, calendar information, and other personal data. Later, Microsoft admitted that it completely lost the data stored in the cloud and might not be able to respond to the data. Microsoft's people apparently forgot to do the backup.

This technology may have developed since then. But the lesson is the same: when it comes to important data, never assume that others will automatically protect you. Make sure you understand your cloud provider's disaster recovery settings. It is best to make plans to back up your important data independently.

The same operating rules apply even to the cloud, says Ken Godskind, vice president of the Alertsite company, which oversees the product. The body that uses the cloud cannot simply assume that because it is in the cloud, the full responsibility for the business continuity plan has been handed over to the provider.

Severe cloud interrupt 3:gmail failure. Of all the cloud services, Google Gmail is one of the biggest threats to Microsoft's built-in mail service fortress within the enterprise. Replace your maintenance-expensive Exchange server with inexpensive, stand-alone e-mail services supported by Postini. What's the difference?

Many annoying interruptions. The recent outage has left 150,000 Gmail users with only a blank page after logging into their accounts, no mail or folders, and nothing shows that they are actually looking at their inbox. To its credit, Google provided regular updates and promised to fix the failures quickly. However, for some affected users, Google to repair the failure in 4 days.

"If you have multiple copies of your data, how could this happen," Ben Treynor, Google's vice-president for engineering, said in a blog post? In rare cases, software flaws can affect several data. That's what's going on here.

Google eventually had to use a physical tape backup to restore data. Ultimately, Google's multi-tier data protection did work, but it left thousands of of users unable to access their e-mails for a few days.

Is failure a reason for not using cloud connections? Maybe not. However, this is a reason to verify your own data protection and consider setting up a backup or offline access solution before pressing requirements arise.

When you look at the broad average, the success rate of the cloud is much higher than your personal success rate, says Ken Godskind of Alertsite. This is only when you get to the web scale, the impact of the failure is magnified in a larger way.

Serious cloud interruption 4:hotmail a mess. Of course, Microsoft is also offering the best advertising for its cloud services. Microsoft Hotmail had a database error at the end of 2010, resulting in tens of thousands of inboxes being emptied in the new year.

Microsoft said the failure was a result of a scripting error. This is a script to delete a virtual account created for automated tests. This script incorrectly deletes 17,000 real accounts.

Microsoft has spent three days recovering most of the user's accounts. About 8% of the hapless users had to wait another three days to recover their data.

Severe cloud interrupt 5:intuit two interrupts. Intuit suffered a serious malfunction last year. Its cloud-based services, including the popular platforms TurboTax, Quicken and QuickBooks, occurred two times in one months. The worst was a 36-hour network accident last June. A power failure apparently caused the main device to use standby power, and the company's main and backup systems were completely disconnected.

Worse, a few weeks later, another significant power failure occurred. In addition, the second interruption obviously caused people to scold.

One user said in Weibo that the 25-hour break was hard to swallow. Intuit passive, opaque, and unacceptable communication is not helping.

"The fact is, if you need absolute usability, there is a better solution than a cloud," said Chris Whitener, the HP Security Advantage program's main strategist. You don't have to back up everything, but you take an extra step there (perhaps relying solely on your own important data) to produce completely different results.

Serious cloud Interruption 6: Microsoft BPOs (Business Office Online suite) failure. When your cloud-based office suite fails, it's hard to work efficiently. That's what happened a few weeks ago when the company relied on Microsoft's Business cloud service. Around May 10, Microsoft BPOs Services began to appear intermittently. Some users ' emails were delayed for 9 hours before they were received.

Two days later, as the bpos seemed to have ruled out the fault, the delay occurred again, and the message sent out was blocked. If the accident is not enough, Microsoft has experienced another failure to prevent users from logging on to the web-based Outlook portal.

' I want to apologize to you, our clients and our partners for the inconvenience caused by this malfunction, ' said the vice president of Microsoft's online services in the blog.

Severe cloud interrupt 7:salesforce service interruption. A one-hour outage may not sound serious. However, if your company has the key to tens of thousands of enterprise customer service business, many of these organizations will certainly consider these 60 minutes as life cycle.

When the data center closed last January, Salesforce.com learned a lesson. In the new year just four days, Salesforce.com reported a comprehensive failure, that is, services, backup and other full-service services are interrupted.

Annoying? Absolutely so. An accident? Not entirely unexpected.

The reality is that cloud-based data centers have also been interrupted, said Tim Crawford, chief information officer of all covered, a subsidiary of Minolta. That has been the cause of the failure and is always the case. We must be realistic about that.

Crawford says successful cloud computing requires a different mindset than traditional servers. You have to decide for yourself whether your business data can withstand occasional disconnection. If you can't afford it, make sure your configuration has the flexibility to avoid broken network failures.

When you choose a cloud provider, you need to do homework to understand how they provide these services and whether they can build a level of redundancy that is better than what you do. If the answer is no, then why do you use these cloud providers?

Serious Cloud Interrupt 8: Cloud provider Terremark terrible day. Recently, the 1 billion dollar deal between Terremark and Verizon may have been a major news story. However, in early 2010, the main news reported was the Terremark accident.

At St. Patrick's Day on March 17, 2010, Terremark's fortunes began to turn bad. The company's Vcloud Express service plunged on that day, with a network of about 7 hours in Miami's data center. During this time, users cannot access data stored in the data center.

No more redundancy. However, this brings the value of redundancy, allowing your important data to be provided to multiple servers in different data centers, or preferably to multiple servers in different regions. As a fail-safe, you can also take additional steps to spread data to different providers.

Harold Moss, chief technology officer of IBM Cloud Security Strategic Plan, says you can choose a range of vendors to host a workload, a vendor responsible for backup or two vendors responsible for backup, and then select a vendor as your main provider. You can then implement your workload in a safe and secure environment and begin to introduce your resilience.

Severe cloud interrupt 9:paypal fault. Do you want a cloud outage that causes a wide range of serious effects? Try to get PayPal to break the net for a few hours to see.

This is not a hypothetical exercise: in the summer of 2009 PayPal's broken Network fault is true, so that the world millions of machines can not sell goods. The service was completely unavailable for about one hours, and was intermittent over the next few hours. PayPal says hardware failure is the cause of the accident.

There is no doubt that such interruption failures are rare. However, this unfortunate disconnection fault makes PayPal easy to win a place in the cloud computing disgrace.

Severe clouds interrupt the 10:rackspace of the rough years. When you offer cloud services to websites such as TechCrunch and Justin Timberlake, you'd better believe that when your server stops working, people will notice.

Rackspace learned several lessons in 2009. The cloud provider suffered four high-profile disconnection failures throughout 2009, bringing the company's customers online for several hours. Rackspace had to compensate the user for nearly 3 million dollars in service charge.

Rackspace The incidents as "painful and very disappointing" and promised to provide services at a high level for a long time to come. Currently, the company continues to focus on uptime, but it also helps users prepare plans to deal with the inevitable chaos in cloud services.

' If you want to build a service cluster or create a geographic redundancy, it's easier than ever, ' says Lew Moorman of Rackspace Corp. But you have to take these steps. If you've done this in your business before, this cloud will not bring any potential weaknesses.

Given all the faults, the biggest lesson here is that there is no single server, center, or service that is absolutely reliable. If you don't build your business in this way, then, my friend, you are walking around in an unrealistic way.

Original link: Networkworld

(Responsible editor: Liu Fen)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.