"Prevention is better than Fire", the truth is understood, but when faced with cost, time and other pressures, the most easily abandoned is the quality, in HW did many times such things, although each pressure Alexander, but a lot of harvest.
The first case of sharing was our product in 08, after winning the bid for an Indian carrier, the project was delivered and maintained by colleagues in India, which could bring millions of dollars a month to the operators, plus the project was a joint venture, and the profits were good. Since the Indian colleague's understanding of the product is not deep enough, and the product itself code quality is not high, so at the end of 08 after delivery, online problems have been continuous, but because the profits are good, because the Indians themselves and their own people, so the quality problem has not become the top 1 problem.
But in 09, the project of the high-level leadership of the party to change people, after the change of a leader will put this matter on the table, and even put forward these quality problems if not resolved, the subsequent contract does not renew.
My colleagues in India have been in contact with the product for more than a year, although I have been in contact for only 2 years, but because I have more information in the domestic contact, and I was in the team at that time a little better skills, so the leadership of the "pseudo-expert" thrown away. Everything is difficult at the beginning, and India's colleagues in addition to the language of their very heavy spoken English I often do not understand, and they did nearly 2 years of custom development, many features and the earliest products have been different, how to quickly solve the obvious problems in the short term, I really do not have the bottom, I do not know whether they can complete the task.
After receiving the task, and their e-mail, the telephone has made full communication, to understand their most prominent problem: the online operation is unstable, so that they every night to arrange a brother to each machine in the cluster to restart each time, so that in most cases can be guaranteed to run a day, but the daily restart is not a way. After discussing with them, the plan is divided into 2 steps:
1. Stop bleeding: Make sure you don't restart manually every day.
2, in the laboratory environment to build the same environment, the main process in the laboratory for pressure measurement, reproduce the problem.
The first needle stops the bleeding, and their requirements are simple: don't wake up every night and restart manually. This requirement is simple, with Perl writing a watchdog script, the script is simple, but it plays a very important role. Because it can not only discover the service hangs, it can also collect a lot of information snapshots and packaging, so that every day to analyze the restart records, can more accurately and quickly analyze the problem. Watchdog judge Restart has 2 conditions, one is the CPU utilization rate of 100%, in judging the CPU usage, and not directly with the Vmstat or top, but read the/proc/stat, because here can read to each CPU usage rate, Sometimes even if a CPU has been 100% is also problematic, there is another condition is to determine the heartbeat of the page. All possible information, such as GC, memory, log, disk IO, and so on, is collected before rebooting. This perl script should be no more than 80 lines in total, but then I used it in a lot of places. In fact our system also has watchdog, but found this watchdog sometimes oneself also will be hang live, so with this script separate process more stable some, back our product also according to my this idea redesign watchdog.
After the watchdog, found that causes the restart of many times because the database connection is not enough, and the connection is not enough, either slow SQL, or some query statements are too frequent, poor performance, a little bit of optimization, found that the frequency of watchdog restart is getting lower.
The second is to do performance testing in the laboratory, we collected a typical scene to do the pressure measurement, the main problem is that some of the process performance is relatively low, such as dynamic ads like the page display, each time the database query, and each page, whether or not display ads will carry out some complex logical calculations and database query. The general approach is to be able to operate asynchronously, asynchronously. For example, the number of clicks on advertising data, in fact, not so high real-time, we changed to the background of every night to calculate one time, the display of direct query, rather than every time the rendering page real-time query.
With the first needle hemostasis, coupled with the second move of the laboratory test, after about one months or so of time, gradually stabilized, in the process of learning how to look at the AWR report, how to use the Linux command to locate performance problems, how to use LoadRunner performance testing. After the stability, wrote a summary to India's colleague, this thing basically counted over, to, the 09 National day I did not rest, all accompany India's colleague to engage in this.
What happened to the firemen (1)