The idea of troubleshooting O & M Faults Based on "medical thinking"

Source: Internet
Author: User
In actual work, we always encounter problems in the production environment. If any problem occurs. And the business is very important. Business needs to restore the system to normal in the shortest time, which puts a lot of pressure on people. Under this pressure, it may make me confused. So how can we quickly and effectively troubleshoot faults? The problem is endless. I personally think that only the thought of troubleshooting and thinking can help you withstand the pressure for every troubleshooting and minimize the fault recovery time. In addition, I think troubleshooting and recovery of faults

In actual work, we always encounter problems in the production environment. If any problem occurs. And the business is very important. Business needs to restore the system to normal in the shortest time, which puts a lot of pressure on people. Under this pressure, it may make me confused. So how can we quickly and effectively troubleshoot faults? The problem is endless. I personally think that only the thought of troubleshooting and thinking can help you withstand the pressure for every troubleshooting and minimize the fault recovery time. In addition, I think the troubleshooting and recovery of faults are somewhat similar to the way we went to the hospital when we were ill. So I combined the "Thinking on seeing a doctor" and "Thinking on Troubleshooting" to share my understanding of troubleshooting ideas.

The traditional Chinese medicine doctor pays special attention to "what they want". The routines of western medicine are: Blood Drawing, testing, reading reports, and treating. People who pay attention to health care pay attention to daily health care and training. A system fault is the same as that of us. Troubleshooting is equivalent to seeing a doctor when you are ill. The principle of fault troubleshooting in O & M is to integrate the theories of traditional Chinese and western medicine and health care practitioners. It is not only to say "what you want" and "diagnose and treat" when you are ill, but also to explain the "Health Care" on weekdays ". I split this "seeing a doctor" process:

1. Daily Health Care and Prevention: daily O & M specifications

2. What is the lesion? What is the situation?

3. Blood tests to see the report: identify the cause

4. remedy the drug: Solve the fault and summarize the feedback.

5. Post-illness conditioning: complete the content

(Note: During the troubleshooting process, you must back up system information, data, and on-site faults to avoid secondary damage caused during troubleshooting or troubleshooting, you must pay attention to the same issues as seeing a doctor. People are still living well, so you can conquer it .)


1. Daily Health Care and Prevention: daily O & M specifications

In order to better avoid faults in the production environment due to misoperations or permission abuse in the daily environment. Routine O & M operation rules should be detailed and strictly enforced. Although many companies have developed good O & M rules, but they have not been put in place, it is actually equivalent to a piece of waste paper. O & M rules are equivalent to strengthening exercise on weekdays, so that your chances of getting sick are greatly reduced. Some simple and good operation habits on weekdays can avoid a lot of trouble. This allows you to reduce the system recovery time when a fault occurs.For example:Detailed records of operation information, detailed permission control rules, operational habits and rules, emergency fault recovery manual, timed backup of system and data, and skill training of relevant personnel are all good preventive measures. If all these functions can be done well, they can actually reduce a lot of unnecessary troubles.


2. What is the lesion? What is the situation?

Wang, Wen:Observe. What is the fault observed? What are the statuses of the system at this time? For example, the most intuitive reason is that the customer service reports that the website cannot be accessed or the website is slow. Or the data is lost, the server load is too high, the traffic is abnormal, and the server reports an alarm. When all these faults occur, we must first clarify and record them. At this time, if the most basic representation of a connection failure is unclear, how can we conduct the next troubleshooting. At least you need to know what the error is, where the Website access is slow, and what the data is lost. That is to say, if you are ill, you should at least make sure that you know where it hurts, where it hurts, how uncomfortable it is, and how uncomfortable it is, so that the doctor can know what disease you may have.

Q:Query Information. In this case, we need to view the system logs and ask relevant personnel about the operations according to our fault performance, whether the program has been modified or the configuration file has been modified. Or the system architecture has been adjusted. Check whether this problem has occurred before. Some of them can be obtained from others, and some require us to view and obtain information through system log files or monitoring data. Record the relevant information. (This is equivalent to writing a medical record.) This is actually very helpful for troubleshooting O & M faults, because in the subsequent troubleshooting process, we can clearly understand the possible faults. Which of the following causes have been checked, or can be identified from the most log files.

It is helpful to clarify the entire troubleshooting process to record the information during the troubleshooting process.

Switch:Locate possible causes. Use the information we collected above to identify one or more possible failures. Hardware, network, program, configuration, system architecture, database, misoperation, and so on. (In fact, many people think this is the most difficult thing to do. I personally think it is not difficult to do this well. First of all, after you have done the above two points, you can basically make some basic judgments on the faults you encounter. Of course, it is based on a basic understanding of all aspects of knowledge. Another method is to follow the regular method."Go again"Depending on the theory, you can simulate the normal operation of the system and the steps that need to be taken, such as starting from an access request of a customer --> network transmission of the client, network transmission of the server, --> running the system --> running the program, and then running the involved applications. First, you can simulate the various principles used in normal operation. Second, you give a fault assumption one by one. Then theoretically, whether or not it will show the same performance as your current fault. If possible, this step is the possibility of failure. At this point, you need to record what you think is possible during the derivation of the theoretical fault simulation.

In fact, this is the same as the process of seeing a doctor. Doctors Always have a preliminary diagnosis based on your illness. In fact, in their theoretical system, what is the case when a person is normal. If there is a cause somewhere, then what is the accompanying condition, and then compare it with your personal symptoms, then we will provide a preliminary diagnosis. Of course, this initial diagnosis may be from many aspects.

In addition, this process can improve your sensitivity to fault location through the accumulation of daily work experience. Because you have seen more likely failures, the more you can rely on the past performance to find similar places and help you locate possible failures.


3. Blood tests to see the report: identify the cause

When we go to the hospital to see a doctor, we always take a bunch of blood tests first, and finally determine the cause of our illness through the report. The role of these reports is to identify possible causes one by one. Or help us troubleshoot the cause. After all, we have been able to determine the possible faults. The next step is to use the troubleshooting method to eliminate one item. Of course, if you can directly determine what causes the fault, you can directly take "treatment.

Exclude them one by one and verify them at a specified point. We have determined possible fault aspects. Then, I can use substitution, comparison, monitoring, and other methods to verify whether there is a real error in this aspect. For exampleAlternative:We can use normal configurations or programs in other environments for replacement. (Note: Back up the fault on site.) If the fault disappears, it can be determined that the fault is replaced. Otherwise.Comparison:Let two machines in the same environment, one as the faulty machine and the other as the normal machine. The two are compared in the same aspect. If the two are the same, this aspect should be correct, and the chance of failure in this aspect is very small. For example, two machines in the same data center have one faulty machine and one normal machine. Access to detect the network conditions of normal machines. In addition, the faulty machine and the normal machine are in the network environment and are in the same configuration. It can be determined that the fault is unlikely in terms of the network.Monitoring:Monitors a certain aspect of the faulty machine and monitors what has changed from normal to faulty. If no specific changes are found, or changes are found in a certain aspect. Then it can be determined that this is the cause of some problems.

Through one-by-one troubleshooting and fixed-point verification, you can easily eliminate possible faults and locate the cause of the fault. The next step is to take measures to restore the database.


4. remedy the drug: Solve the fault and summarize the feedback.

When we know the cause of our failure, we take recovery measures. Note: during fault recovery, you must back up the original system information, data, and fault site. This will allow me to roll back when the fault recovery fails, and also help us conduct fault simulation in the future. In addition, when resolving the fault, you must be able to solve the problem without authorization. You must pass the approval from the superior and take measures after passing the approval, even if this action may be small. You have developed this habit. Or, through a simple report, let the leaders know about the matter and agree to do so. If multiple personnel are involved, an emergency meeting should be held if a meeting needs to be held for discussion. You can also notify multiple parties about the fault recovery measures you are responding to, and the possible impacts and the time of impact. Please do not underestimate this part. Sometimes, if you do not coordinate your company, you will not be thrown to death by users, and you will first be thrown to death by other departments.

To solve the fault, you must not forget to make a fault summary report in a timely manner, report it to your superiors, and share it with the students on the Internet.


5. Post-illness conditioning: complete the content

We have handled our fault, understood the cause, and completed the report. In the end, we need to think about the root cause of the fault, whether it is due to permission issues, operation problems, or O & M standardization or system problems, or because our own technology is not enough, monitoring is not in place, the system is not automated, and so on. In addition, we propose an effective prevention, prevention, or even elimination method for the next reason, and implement it. Make the entire system more robust.

Just like after we recover from a serious illness, we also need to take good care of our bodies, know what we lack, strengthen our exercises, and make our bodies strong.


PS: I have no O & M experience, so I rarely encounter any faults. However, some network faults still exist. I personally think the truth is the same, so I wrote a sharing story. I am very grateful to you for your suggestions.

This article is from the "Start from" blog, please be sure to keep this source http://atong.blog.51cto.com/2393905/1349768


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.