Emergency shutdown Technology in data centers
The data center needs to run continuously throughout the year to provide various application services to the outside world. There are a large number of electronic devices in the data center, and they also need to rest like people. Otherwise, such problems will happen sooner or later. Emergency shutdown is a way for the data center to protect itself. Obviously, the meaning of the modifier "emergency" added in front of the shutdown is completely different. The emergency shutdown is not as easy as shutting down the button or unplug the power. As mentioned above, the data center is constantly working throughout the year to shut down the devices in the data center. We need to consider the impact of such operations on the services of the entire data center, whether or not preparations have been made before shutdown, whether a detailed shutdown plan has been developed, and whether there is a rollback mechanism. This is not as simple as switching the power supply. Next we will detail the emergency shutdown in the data center.
Emergency shutdown is a plan to close some running devices in advance. Emergency shutdown, as one of the emergency response processes, is an issue that must be considered by every data center. In many special cases, the data center needs emergency shutdown. For example, the data center may be seriously threatened by external threats due to the exposure of defects in the data center, or an earthquake or fire in the region where the data center is located, if data in the data center is damaged or damaged, an emergency shutdown should be triggered as a last resort, and some or even all external services should be temporarily closed to protect the data center from being damaged; there are a lot of electronic devices in the data center. These devices have more or less problems. In this world, there is no software with any bugs, and I think the devices I use are okay, however, no bugs have been encountered. Therefore, once the data center encounters these device bugs, it is often necessary to upgrade the software for the device. If the patch is not fixed, it is necessary to upgrade the software version, however, many devices cannot upgrade software without restarting, which requires emergency shutdown and device shutdown and restart. the servers, storage, and other devices in the data center have been running for a long time, if you do not restart the device for a long period of time, a large amount of memory garbage will be accumulated. The active restart of these devices can improve the running efficiency of the device, and the active restart of the device can avoid exposing some bugs, avoid device running exceptions caused by bugs and data center services Impact. In this case, emergency shutdown is an indispensable part of the operation of the data center. Each data center has to go through a solution that actively protects the operation of the data center.
Emergency shutdown requires three preparations to achieve the expected results through shutdown. In some emergency situations, there are emergency shutdown solutions that can often protect the data center from being robbed. First, the shutdown process should be completed before shutdown. In the data center, there is a common dependency between applications and applications, applications and devices, and between devices. It must be executed in a fixed order of shutdown, avoid damage to the data center caused by emergency shutdown. For example, before you plan to shut down network devices, you should switch off or shut down important applications such as database services, storage services, and payment systems, disable the external access portal, compute nodes, and management nodes to prevent network devices from shutting down the system disorder or data loss that is providing services. After completing these steps, disable the network devices, the general procedure is to first disable the application layer service, then the underlying data transmission device, and finally the physical link. The more advanced the service, the more shutdown the device, the operation steps should be fixed before the emergency shutdown, and then executed in sequence according to the steps. At the same time, it is necessary to estimate the time consumed by each operation step, determine the time spent in each link, and control all links for emergency shutdown. Once the time is inconsistent with the expectation, you also need to start the corresponding rollback or avoidance scheme. Since it is called emergency shutdown, the shutdown may be sudden or temporary, and execution exceptions may inevitably occur, which is inconsistent with the original expected results, in this case, you need to flexibly respond to the problem based on the actual situation. Spare parts should be prepared before emergency shutdown, and key equipment should be backed up to make some configurations ready in advance. In case of any exception, replace them with spare parts directly. In the case that the loss cannot be avoided, all decisions are made based on the division of key data. In this case, the wisdom of data center personnel is tested. The shutdown duration is also an important factor that must be considered. In many cases, after the shutdown step is completed, you often need to pay close attention to the external conditions of the data center, determine the time for re-boot, and sometimes the emergency shutdown will soon start again, the duration of shutdown should be determined based on the emergency situation and evaluated. Secondly, during the shutdown process, after each step is completed, the execution results must be confirmed and compared with the original expectations to see if the expected results are met. If it is found that it is not consistent with the set situation or that the system is out of control, you need to immediately enable the rollback solution to restore the original running status. Finally, after an emergency shutdown, you need to start the system based on the preset shutdown duration. After the data center is started, you need to pay close attention to the running status of the data center, instead of starting all the devices. In many cases, you need to evaluate whether the data center is running normally and stably, and observe for several days, if problems or risks still exist, a secondary emergency shutdown may be required.
From the three major parts of emergency shutdown, the most important work is done before the shutdown, which is also an important embodiment of emergency shutdown. After the emergency shutdown policy is formulated, regular emergency drills should be organized to immediately fix the defects, so as to ensure that the final emergency shutdown plan has no vulnerabilities and the emergency shutdown plan is not static, as time passes and personnel changes, they need to be constantly modified. It is very important to organize emergency shutdown drills cyclically. Only in this way can we find the deficiencies in the plan.
No data center is willing to have an emergency shutdown. However, once you have to make a shutdown decision, you must make sufficient preparations in advance and have a detailed emergency shutdown plan, in case of emergency shutdown, the personnel are all in disorder, unorganized, and in disorder. Such emergency shutdown often causes serious losses to the data center and fails to protect the data center.