When our team was responsible for O & M, optimization, and expansion for the previous company, we met various systems and basic devices with poor performance in different scales (most of large systems, such as CNN or the World Bank System ). If we catch up with the fixing time, the wonderful technical platform, the lack of information and documents, this process will be painful and leave us with deep memories. In the event of server faults, there are few possible causes. We will basically start with the following steps: 1. Try to figure out the cause and effect of the problem and do not immediately jump to the front of the server.
When our team was responsible for O & M, optimization, and expansion for the previous company, we met various systems and basic devices with poor performance in different scales (most of large systems, such as CNN or the World Bank System ). If we catch up with the fixing time, the wonderful technical platform, the lack of information and documents, this process will be painful and leave us with deep memories.
In the event of server faults, there are few possible causes. We will start with the following steps:
I. Clarify the cause and effect of the problem as much as possible
Do not immediately jump to the front of the server. First, you need to understand the number of known conditions on the server and the specific fault conditions. Otherwise, you will probably be in the trouble.
The following problems must be clarified:
What is the fault? No response? Error?
When was the fault discovered?
Can the fault be reproduced?
Is there a pattern that appears (for example, once every hour)
What is the last update of the entire platform (Code, server, etc )?
What are the specific user groups affected by the fault (logged-on, exited, in a certain region ...)?
Can basic architecture (physical and logical) documents be found?
Is there a monitoring platform available?(For example, Munin, Zabbix, Nagios, New Relic... Everything works)
Is there any log for viewing?. (For example, logugly, Airbrake, Graylog ...)
The last two are the most convenient sources of information, but don't hold too much hope. Basically, they don't have either. I can only continue to explore.