Is the data center downtime and interruption unavoidable?
In recent weeks, we have heard many reports about the impact of data center interruptions on some well-known American enterprises, including the Wall Street Journal, the New York Stock Exchange, and united airlines were affected to varying degrees within a week. Although it is impossible to prevent the occurrence of every shutdown event by, these issues that have been highly publicized by the media may cost a lot of money, it will significantly influence the customer's view of a company, and thus the company's image and reputation. To this end, we interviewed industry experts and asked them a series of questions, including what should companies do to maintain a high level of normal operation time? What are the common causes of downtime interruptions? What is the average customer expectation for Data Center Security and elasticity? Or is an occasional shutdown and interruption accident a reality for normal enterprise O & M?
Milwaukee Regional Technical College Brian Kirsch
The balance between availability and everything else is one of the cornerstones of IT. We all hope that our system will be able to run normally as needed for our use. The problem arises when you need to balance the availability of the system with what needs to be done to achieve such availability. The question you need to focus on is not only the cost, but also the complexity and test of the problem. The concept that a single hardware or software product can provide availability does not exist today. Although the backup and disaster recovery products we use today have become more extensive and effective, the corresponding applications have become more complex.
The constant competition between such applications and their availability will cause a wide range of downtime interruptions when the disaster recovery products used by the enterprise cannot meet the application requirements and design.
However, hardware and software are only a small part of the downtime interruption accident. Many downtime interruptions are caused by system faults and changes. We are designed to prevent failures; our security system prevents unauthorized changes. However, all these efforts at the front end cannot completely prevent the occurrence of every downtime interruption. We still need to actively explore and find new ways to deal with disaster recovery and downtime interruptions. Let us stick to the idea that companies will inevitably experience downtime interruptions, rather than simply trying to block them to design our system. In the face of faults and operation failures, we can provide real application elasticity, because fault protection is no longer just a superficial task. Then, we can test and prove our ability to handle failures.
In this regard, nothing is more obvious than Netflix and Its Chaos Monkey engineering team. Netflix is facing a massive restart of Amazon EC2 and needs to keep its online services running properly. For many companies, the restart of EC2 cloud will bring them something they think will never see and there are very few targeted plans to prevent downtime interruptions. On the other hand, Netflix and its uniquely named Chaos Monkey engineering team have a plan. At Netflix, the role of Chaos Monkey is to regularly and repeatedly test and exercise failures. Netflix has created a service design specifically designed to handle failed failures of a fault through continuous testing and rectification before the problem causes a large-scale shutdown disruption.
LogicNow company Dave Sauber
For companies and organizations such as the New York Stock Exchange, the Wall Street Journal, and American Airlines, it is almost a shame to have any form of shutdown interruptions. The cost loss caused by the shutdown interruption accident may be extremely expensive. In view of the low computing resources, advanced planning can minimize the chance of a shutdown interruption accident. For enterprises with critical needs, they can now easily create backup systems in the cloud and use them only in emergencies. For example, Microsoft's Windows Azure only charges fees for active computing loads, which means that the entire backup network can wait in cold standby mode to handle problems. You can also set the minimum usage level for hot backup to ensure that you are prepared for failover. Monitoring and management software should always be used, and more advanced predictive analysis should be obtained to predict and analyze possible downtime incidents.
However, communication is the most important part to mitigate the impact of shutdown interruptions. What is most frustrating for a passenger stranded in the event of a shutdown outage by United Airlines is the lack of transparent and effective information. The enterprise shall not take the initiative not to acknowledge the relevant issues, and then deliver in accordance with the commitment. Silence on social media and lack of information communication between employees may be one of the worst reasons for the customer service experience.
Volanto company Jim o'ryley
How can security systems fail to run? This seems to be a contradiction, but for American United Airlines, the New York Stock Exchange and other companies, the situation has recently emerged. What is the problem with their IT infrastructure?
The increasing complexity must be part of the cause. In general, enterprises already have some old systems that have been patched and extended many times. This leads to hardware and software vulnerabilities. The failure of a router leads to a single point of failure in a highly redundant system.
Of course, the communication problem is not exclusive to American Airlines. Cloud computing giant Amazon network service (AWS) also lost several zones when a router software was mistakenly updated. Such failure is often caused by poor operating procedures, lack of checks and balances or poor installation.
Like AWS, there was a shutdown on the chaotic New York Stock Exchange, caused by a bad Software Update-in which case the "matching engine" would connect trading orders.
Although the cause of hardware or software has been criticized, the real culprit of all these problems is human error. In highly evolved systems, faults are predictable, and administrators must make changes to handle different platforms and applications. Poor network topology, untested updates, and misuse of updates can be avoided. The problem now is how to avoid them without causing any other consequences.
Automated operations are the answer to the update question. Anyone who uses Windows is familiar with the upgrade method. Sometimes it is automatically performed in the background, and sometimes the user needs to answer some questions, but most of the work and any re-configuration of new code are handled by the software.
On the other hand, in the best traditional command line interface, the system administrator usually needs to input an astonishing rate to execute updates. Its script is considered to be the most advanced. However, the script always needs to be adjusted to work properly.
High levels of manual interaction system faults occur frequently. Events on AWS and the New York Stock Exchange are typical results. United Airlines have different problems. It is clearly caused by a single point of failure. Preventing such failures is not a rocket-like highly sophisticated science. Only manual review of the routing structure should be able to identify the problem where a router can paralyze the system. Frankly speaking, manual check is not easy when the topology and underlying platform of the application suite are constantly changing.
Some software will be able to play its value when detecting problems in the system. Enterprises tend to solve configuration problems through continuity software. The big data analysis method may increase the complexity of the method.
Even so, poor application design, especially traditional legacy systems with less elasticity, remains a problem and will continue to plague us. The answer to this problem is "sandbox testing" and "more rigorous testing ".
Can a data center with no downtime or interruption fault exist? The answer is that we are still far away from achieving this ideal, but we can do better.