Background:
Located in the Group B Block 6 floor of the room has been used for many years, many hardware conditions began to restrict the company's IT development. One day in mid-June, as the engine room air-conditioning maintenance, the worker's master took off the power of the total gate, resulting in some of the server room unexpectedly power. The two Active Directory domain controllers at the group headquarters were also tragically killed.
Phenomenon:
The engineer in charge of the computer room has done the timely processing in the first time. Start the server that stopped due to power off immediately after restoring power. After most of the servers started running, it was still found that most clients were having various network access class problems. If some Web sites are inaccessible, the ping command is able to return the correct information, or the network share can be accessed via IP without the machine name.
Problem:
To synthesize most of the phenomena, the initial judgment is that the DNS service is not available. Through command detection and log analysis, it is found that two DCs have been started, but there is no normal external service.
After careful analysis of the ad-related logs, it was found that because both DCs were suddenly powered down, the other DC was not able to properly notify the other DCs to continue to bear the load when the service was stopped. Also, when both DCs are powered on, no additional server state information can be obtained. Causes two DCs to consider themselves to be the last DC in the current ad, resulting in a conflict between the replication relationships between the two DCs, which in turn causes the last full ad service to be unavailable, including the previous DNS service.
Solution:
Because the current ad service has been completely interrupted, it makes no sense to consider restarting the DC to bring ad service outages.
After closing all two DCs, open the DC-01 with five operational roles. After the full boot, check the ad-related log to ensure that all the ad services are started properly. If you encounter Active Directory replication related warnings or errors, temporarily ignore.
After the first DC is successfully started, turn on the second DC. After full startup, check the ad-related logs to ensure that the ad services are started normally.
Open the Active Directory Site Administration console on DC-01, manually replicate immediately on all replication relationships, and monitor event logs.
After the site replication has been successfully completed in the log, reboot two DCs to ensure that the service and data are refreshed.
The restart sequence is, in the guarantee DC-01 normal work at the same time, first restarts the DC-02, and so on DC-02 reboot completely and normally provides the service, then restarts the DC-01.
If the site replication failure at this time, the problem is big, here fortunately I did not encounter, so also temporarily ruled out the problem. (Site replication failure, the need for Shang processing, please refer to the relevant standard debugging process, or directly to Microsoft GTSC Open case Processing)
Results:
After two DC reboots, the Active Directory replication related warnings and errors are not replicated again for the two DCs in the following time from the restart point in time. All ad-related services also start normally.
Through other servers and some clients to test, found that the DNS resolution service has returned to normal, the user logged into the domain normal operation, the user folder sharing access is normal, ad search printers, computers, users and other ad objects are also normal.
During the next two weeks of operation, no further system problems caused by ad service occurred.
At this point, we can assume that the ad service is fully functional again.
Summarize:
For a distributed system, there are interdependent relationships between the components of each system. The closure and opening of the whole system need to be carried out in a certain order. Whether malicious or unintentional, almost simultaneous sudden power outages for a distributed system, there is no communication between the components to stop the service, causing the service between other components to stop, or service conflicts, this damage is great.
Thankfully, simultaneous power outages, in most cases, do not cause data loss in the storage device. In this event, no loss occurred at least after the ad data had been checked. But it is not absolutely absolute, and the composition of luck is very large.
Therefore, to ensure the high availability of power systems is to ensure that it stable operation of the most basic criteria. In addition, in the event of a power system failure, we should turn on the various components in the distributed system in a certain sequence of startup, rather than opening all the servers at the same time, especially in some environments where storage devices are connected.
Source: http://bisheng.blog.51cto.com/409831/171780