Continued 《 What is happening at present: if you do not plan for the worst, things will often go to the worst (5)"
The following happened between and on January 1, September 23 ~ 3: 30 pm
After lunch, I adjusted the division of labor. In addition to Project Manager A's telephone and QQ services, the other three colleagues and Dy tracked the 10 suspicious issues identified in the morning.
In less than two hours, we have achieved initial success from the Division of Labor follow-up. Two of the first three items are very valuable,
- The most time-consuming processing is Excel File Import. This function is a feature of the previous system period. As this feature has not been changed, it should not be suspected, however, from log analysis, we can see that this function is used many times a day, but the processing time is huge. Through program analysis, we find that asynchronous processing is not complicated, the combination of AJAX is not too winding. But in the test environmentIt is found that every time this function is used, other users' requests will be blocked.The same phenomenon exists in the production system! This is a very meaningful clue. The next step should be further analyzed based on the specific reasons;
- The most frequently used pending items feature. This feature is used when users log on, and will be refreshed frequently. This feature has not been changed yet, but from the log, the usage is as impressive as the time spent. By accessing the production system at idle time, it turns out that 16 CPUs and 32 GB of content both need to be executed for 2.5 seconds, which makes sense obviously; according to the working principle of Weblogic, eight service threads are used for computing at the default startup. At this time, only eight users are logged on together, and the number of working threads is not expandable, at the same time, the time consumption also affects the efficiency of Connection pooling;
The first three suspect objects have already had two small results. Because they are in the phase I function and have not been changed in this period, the tracking work will be adjusted immediately:
- Ydy follows up with the Excel import problem. He is not afraid of tigers, he has no burden, and has not done a phase I, and may be able to find problems;
- The project team should follow up on the handling item function in two of the QPS and wxy members who have not participated in Phase I in phase I;
- CDW colleagues who originally assisted in the discovery of Excel import problems adjusted to accelerate the further analysis of suspect objects;
In the inception, I told all colleagues who participated in Phase I that "Excel import causes blocking of requests from other users". In order to make them deeply aware of this problem, it is unlikely that, I used to demonstrate this on the production system, so that they can feel the experience of the end user! In order to let my colleagues know that this fault is actually very serious, I only used the Excel file with 100 rows of records, which is enough to let the end user wait for any request 10 ~ 20 seconds. Based on these proofs, we can infer from the theory that when the server is similar to "dead", what we can see from thread dump is waiting for getconnection!
During the afternoon check, similar crashes occurred. we showed our colleagues in the first phase how long they had occurred. At this time, as my colleagues monitored it, they found that they were able to recover in time, project Manager A can also install the specifications, inform the owner and end users, and explain accordingly.
The following happened between and on March 31, September 23 ~ 6: 00 pm
After one hour, I checked the solution of the two confirmed suspected objects before the work, but the result was still unsolved. So I quickly sought resource assistance again, other teams have found dzm colleagues who have experience in SQL optimization. Soon, he helped CDW improve the performance of another optimization Statement by 4 times, and then he helped colleagues in charge of the optimization of the pending function to optimize the performance from 2.5 seconds to 0.25 seconds, optimized by 10 times.
The Excel import problem is deadlocked, and ydy has not found any possible causes for the moment. Therefore, LZ, a colleague who participated in Phase I through telephone coordination, is responsible for the follow-up of the Excel import function because it is using open-source software, special emotion"It is unwise to use Java open-source components for development without any doubt, and it is dangerous!", Not fully tested before using the software,"You must always pay back"! This is the worst time and the best time. My colleague told me that the same problem was also found in another project two days ago and has not been solved yet. However, there are some ideas to try and avoid breaking the line.
During the period close to, the system encountered another situation similar to "crash". This was reported by the end user on QQ, which was not detected by our monitoring staff, this is because all forces are placed on issue tracking and analysis.
The following happened from to on January 1, September 23 ~ 12: 00 pm
After dinner, the first thing to follow up is to arrange fault deployment plans and announcements based on the situation. (For the reason why we need to do this, refer to previous blog posts.) After tracking and troubleshooting in the afternoon, four of our 10 questions were excluded and six were included in today's repair plan, communicate and negotiate with colleagues in the team to arrange the time, and notify the owner via QQ and phone.
There were also some minor episodes when fixing bugs and performing hotfix. For example, some team members may have questions about Version Control (refer to blog posts and how to perform hotfix for production systems ), I found that Project Manager A understands that it is one thing to do. It is really"Practice is the only criterion for testing truth"!
Still make the worst plans
The team completed the hotfix for the final smooth installation plan, but the verification was not completed yet. Do you know? Although it is already in the morning, we have to make a work plan for two shifts before the reception team. The specific arrangements are as follows:
- I have worked very hard, but I have to continue to check for a class. Including me, we will continue to follow up on the main office environment before two o'clock P.M. tomorrow morning, make the worst plans, ensure that today's hotfix is effective, and plan to retreat at tomorrow;
- I am in charge of the finishing class. My colleagues in this class will be able to take a break to work until tomorrow, and change their defense against the previous class!
(This story is finally completed, and we are also successful in the fierce battle of the day! Learn about go experts and replay after the competition, so that they can better understand them. In fact, there are still many stories that can be written independently and published later. The Details determine success or failure .)