It OPS masters are the dreams that everyone pursues, and their keen sense of smell always seems to be the root cause of the computational system failure. This ability to quickly respond and pinpoint is the result of years of accumulated experience and personal knowledge in dealing with complex data center infrastructure challenges, and its success is difficult to replicate. It is clear that no institution is willing to grant certification for this quasi-supernatural divine judgment.
Nonetheless, high-intensity troubleshooting often follows some common and unwritten rules of practice. In this article, I will be combined with their own experience summed up six immutable laws, hoping to help everyone's practical work. Note that these laws apply only to most--not all--situations.
1, never modify the current connected server or network device interface
While this sounds foolish, some people do frequently modify the network interfaces that are being used for device communication, which is the root cause of many failures. While it is sometimes unavoidable to do so, we can use other mechanisms to eliminate this potential flaw. Configure the secondary IP for the interface, if necessary, and temporarily connect it to other devices, subnets, serial consoles, or KVM. This approach is more necessary for devices that are located in remote office environments and have no IT staff around them.
Sometimes I steal a little lazy, use a written script to change the IP in a Linux device, ping the test, and cancel the change if an error occurs. But this is actually a bit of cheating suspicion.
2, to ensure that all operations have room for recovery
Whenever possible, be sure to prepare a set of recovery mechanisms for your operations. This may mean that everyone needs to back up all the files under the entire directory structure before processing the failed disk, which can help us keep all the data that is potentially valuable, although it seems troublesome. In addition, you can remove a disk directly from the physical server's RAID 1 array before processing the damaged operating system. Of course, all of this will be easier in a virtual machine environment, with just a snapshot saved.
3. Record, record and record
Of the laws mentioned today, this one is probably the hardest to follow. To be sure, it's a bit unrealistic to keep a calm record of problems and judgments in a mess. But even so, we still need to keep an analysis for ourselves after the end of the event, to document the steps in the process and the way to resolve it. Remember to keep your records in a safe place, preferably a wiki that is hosted on your intranet-and back up several more in other locations.
4, it work does not believe in magic, but rely on luck
As Thomas Jefferson said, "I find that the harder I work, the more my luck goddess favors me." "The same applies to the IT world. The more time you invest in infrastructure research, the more familiar the operating conditions of routers, switches, and servers, the easier it is to manage. Doing this regularly can help us develop a sharp sense of smell, make accurate judgments early in the problem, and respond faster when problems arise. There are many ways to train it good luck. For example, automating backups of network device configurations with tools can help you deploy alternatives in minutes without having to spend a few hours on the switch when it doesn't work.
5. Make a backup of each configuration file before making changes
This rule is generally applicable only to UNIX servers and network devices, because their configuration files are almost always in the device configuration system in all aspects. Before we change the sensitive configuration, it is best to keep a copy of the switch Flash or TFTP host. In the UNIX system, you just need to save the *.conf as an additional *.conf.orig.
As a result, we can easily restore the service to its original state of health at a critical juncture-it's as simple as copying the file back and restarting the service. However, this does not help in the Windows environment, the existence of the registry and the Windows system features greatly increase the complexity of the simple concept. Even so, we can still in the hands of the change lead out a registration form, so trouble arises when we can also do in the hands of food, heart not panic. Note: Because the Windows registry is so critical, make changes to it and then hold the lifeblood of the server in your hands, and never be sloppy.
6, monitoring, monitoring and re-monitoring
Prevention is better than cure, so it is necessary to check the business environment carefully once a month. Everyone should carefully monitor every aspect of the data center, from the temperature of the room, to the rack, to the server-in addition, the server process check, uptime check, and so on, this is an endless, slightly dull but extremely critical work. We also need centralized system logging for all network devices and monitoring of bandwidth usage, temperature, disk partition usage, and other important data metrics through trending and graphical tools. All of these monitoring mechanisms should warn us when data exceeds a reasonable threshold.
When a disk is out of space and the database is damaged, an email or text message sent an hour in advance is likely to help us get rid of the nightmare of emergency overtime and system downtime. There is no reason not to take full advantage of surveillance in the data center.
Today's rule summary is here. They should not only be strictly adhered to, but more justified as the guiding principles that are deeply rooted in it work. These six rules are nothing more than a must-have for the technical people who deeply understand the meaning of it work, but in others ' eyes, they are like the IT gurus as a elusive myth.
Note : More wonderful tutorials Please pay attention to the triple computer tutorial section, triple Computer office group: 189034526 Welcome to join