Continued 《What is happening at present: if you do not plan the worst, things will often go to the worst (4)"
The following happened between and on March 13, September 23 ~ 12: 00 AM
I arrived at the site five minutes earlier than dy. I would like to assist with the Division of Labor first. Don't mess up your hands because of a fault!
- Unified thinking: Project Manager A, colleagues in charge of telephone and QQ services do not need to be distracted, continue to maintain the service, and ensure that the service quality cannot be reduced!
- If we find again signs of a suspected "crash" accident in advance, we will continue to maintain the current fault announcement process so that the owner and end users can know it in time;
- If an end user or the owner finds a problem, he must make an apology and communicate with the end user and the owner, and inform the technical staff to follow up on the problem, today, this situation has indeed caused inconvenience to everyone's work and informed the system that the system will be able to work normally in 10 minutes. (Users can be notified of this because restart recovery is used currently );
- We will release the latest handling information to the service team.
Dy has already arrived at the site, knowing that he must have missed the Java Connection Pooling problem and arranged for him to implement the parameterized configuration of Connection pooling.
Then, communicate with the CDW responsible for health monitoring and follow up his monitoring methods to find improvements. In the past, check whether the logon page can be accessed.Only determining and determining whether a fault has occurred, We needWarning of the upcoming fault?. Firewire training CDW and wxy colleagues' WebLogic monitoring methods let our colleagues know that we can also give early warnings, so that our colleagues in charge of services know that there may be faults in the future, and transfer the monitoring work to wxy colleagues.
(We will conduct system monitoring training for WebLogic servers next week to enable more colleagues to perform routine checks)
Follow up with CDW on the daily service logs, access logs, program logs, and so on. We hope to find out what the owner or user accesses and what business processing is performed before or exactly when a fault occurs. From the Service Log, we can see that the situation is the same as that of yesterday, September 22, and the error information is the same. From the access log, we can see that there is no access log containing the time-taken of the request, it can be determined that the access log may not be completely set correctly, and Dy and Project Manager A have adjusted the log settings.
Then, we analyzed the logs together with CDW and analyzed the access logs when it was suspected to be "dead". Although the analysis did not contain the time-taken of the request, three points can be found.Qualitative conclusion:
- From the log, we can see that some owners and end users have used our IT systems since half past six;
- We can see from the analysis on the Japanese log with the naked eyeThe access volume of these owners and end users per minute is not very largeBecause the log records the IP addresses of each visitor, combined with the explanation given by CDW, we can see that the number of concurrent requests is not very large;
- Especially when it is suspected to be "dead ",At that time, the concurrency was not large.Because the number of concurrent accesses is not very large when the log records are suspected to be "dead" during the "busy" period, and the IP addresses are basically kept in sequence in the log records.
From the above basic analysis, if the analysis is correct, then we will re-open the time-taken access log, you can find a list of business functions that may cause potential performance risks, access congestion risks, and access deadlock risks. After about 40 minutes of log analysis, we analyzed 10 suspicious function points from access logs containing time-taken.
The above conclusions can be drawn and 10 suspicious functional points are identified from past experiences and experiences:
- A system that is suspected to be "dead" usually has some complications.Perform "Autopsy" as carefully as CSI, suspect many possible causes, and find evidence to overturn or prove suspicious causes! This kind of thinking orientation is often more effective. Like the rescue of the current financial crisis, it is necessary to make a combination of boxing.
- Do not underestimate the improvement of every second. There are 10 suspicious Function Points. Each function consumes several seconds. Each reduction of one second means that more service resources can be used to serve other requests!
(Supplement: how to analyze 10 suspicious functional points. The content involves many technical terms. You can submit a fault analysis report for the CSI version later, but the mentality and method are the same .)