has experienced numerous bug fixes, n+ business version upgrade, from the original expected to support more than 100,000 months active users of the SSH architecture to now support the monthly active 3 million users and even to the monthly active 10 million into the independent research and development structure, from the beginning of the first 5 people to develop a small team to today or 5 people development team At the time, only two blades were developed into today's 10 pc+ systems. It accompanied me through the IT career most memorable youth, it witnessed my growth, it is the most memorable career "mentor."
It is true that fate has made me meet with it when I graduate. By a completely do not know what to do with the need to develop to now finally have a little understanding of what it is going to do the system, I took root from the ignorant soil germination, let me find the direction of career. It's hard to walk all the way. Just from the training institutions or novice I took over the project from scratch, but also nearly three days in a row for two consecutive nights of work intensity scare run, fortunately I insisted on down, I know, it will let me grow.
A few years down, with the system to promote the strength and the rapid increase in user volume, the system encountered a number of bottlenecks, each bottleneck is our opportunity to grow, I will be very excited, I believe I will treat it "good", confidence comes from the trust between us. I am very grateful to the team in the past few years to leave the comrades, because everyone's unity and harmony, so that we continue to advance and progress.
On the eve of the storm, it seemed unusually calm. Sure enough, a relaxed and happy National Day holiday, the workload has risen sharply, a 5 people development + operations + implementation + data analysis and statistics of the small team, almost all in the daily average of 13 hours of work intensity, and the head are in 100% of the working state, there is no room for rest. I know, everyone is already tired of the extreme. However, because of the sense of responsibility and mutual understanding, we all have no complaints, and occasionally just a habit of spit a few sentences. I know, this is one of the means to relieve stress: "rough". But, to make it worse, it usually occurs in such fragments. With the operation of the overwhelming promotion, the system again into the bottleneck, the thread rose sharply, received a lot of complaints and accusations. Because of the complexity of business, the support is limited, always only stay on the rhetoric, wasting human and material resources, the most wasteful or time. We always believe that the most reliable is only ourselves.
In the absence of overtime and almost unable to take some the state, comrades have volunteered to work overtime to balance the progress of the project and solve the problem, I am really glad to meet them. Because we all know, although hard, but this is a rare opportunity, this is our growth of another step, because the next encounter what kind of bottlenecks or problems can not be estimated. The more we encounter not met, the more excited we are, we are so "committed to the cheap."
From discovering problems to solving problems, we have only spent two days, compared to before, regardless of efficiency or skill, has made great progress. Remembering the bottleneck caused by the synchronized of June, it took us 7 days and 7 nights to solve the problem. The bottleneck is rooted in WebService's Xfire architecture, and then we're going to record the process of dealing with bottlenecks so that we can slowly aftertaste and help people with similar problems later on.
Since the October development volume is indeed very large, and many of the requirements are related to the interface of the third party, which is the most time-consuming and saliva part, but it also created our communication skills and skills to improve the effective way. As everyone involved in the development, and human resources constraints, so in the operation of the workload is greatly reduced, even the system monitoring and maintenance of the cycle has been stretched. When we are struggling to struggle in the code, the system thread continues to rise, we thought it was only a temporary promotion of operational activities caused, not too much attention. A few days later, the system was frequently alerted and led by emergency mail, saying that many people responded to system failure rates and complained more and more. At this point I stopped working and monitored the system. It is true that threads are higher than ever, but the entire system is manageable. In comparison with the number of visits and users, there is indeed a continuous upward momentum. I understand the relevant operating departments in the promotion of the situation, the recent real promotion is more severe. It seems that we have to slow down the progress of development, with the monitoring of the performance of the system. Suddenly, one day, the system thread blew up the top of the thread pool, many requests were rejected, and I tried to restart the server and thread pool several times, but with poor results, the thread pool was soon full. I monitored 80 ports and found that each server averaged 1000 effective connections. After the experience of the last bottleneck, I immediately expected the system to encounter a bottleneck somewhere. So start a "global scan", one after the other to troubleshoot the system may encounter bottlenecks.
The database is the first place I began to exclude, because the previous "record" was too much, but the database after the last careful tuning, performance improved a lot, monitoring data is no problem. Then, continue to look for the next "offender": interface performance for docking with third-party platforms. Because the system involved with the third-party platform docking a few, so also to be a troubleshooting. First, the network test is no problem, review and statistics log again. Because the performance of the entire system now has problems, so the log reflects the problem is not very accurate. In a late night, I closed the load the total entrance found that the thread of the service would slowly drop, proving that the thread was not dead as it was last time, should be the competition of resources causing the thread to wait. I did this by analyzing the Javacore file, and I was sure that I found the problem:
Thousands of threads are waiting for resources, there are 12 threads hanging dead, and if the problem of this resource contention is not resolved, later there will be more threads that will time out to hang dead causing the system to crash. Next I analyzed the running of each thread, and the culprit finally appeared:
From the monitoring thread can be seen, a lot of threads are waiting for httpconnection, this is a large number of traffic third-party platform business, our system WebService client use Xfire, from the information can be seen, Xfire uses a httpclient toolkit for Apache, and this httpclient is managed using Multithreadedhttpconnectionmanager. Through the query, found that the default link number of Multithreadedhttpconnectionmanager Maxhostpool is 2, and our Xfire instance is a singleton, that means that the singleton only 2 connection for so many user services, Are you kidding. Because this business WebService interface is too large, previously without a single case caused by memory overflow and system crashes, so I am the design of this business platform and WebService protocol very disgusted.
At the moment, I thought, an open source architecture can not be so unreliable, the opening of a thread pool can not be changed. Since HttpClient is called inside the Xfire code, the modify thread can only start with Xfire code to see if it can find a workaround. After a half-day of Xfire code of the anti-compilation tracking, finally let me see a little hope:
As you can see, Xfire defaults to the number of connections that have the maximum Hostpool set to 6, and the global maximum number of connections is 20. But the key is Getintvalue (), in this method to see, the original Xfire service will have a parameter map, which can be set in the two parameters, That means I can change the number of connections to the Multithreadedhttpconnectionmanager HTTP connection pool. With this hope, I continue to chase in, finally let me in the service of Xfire instance see this method:
This hashmap is the carrier of the entire Xfire context. I increased the number of connection pooling connections by adding the two parameters that changed the HTTP connection pool to this haspmap:
Restart the system and the thread returns to a stable state. After two days of monitoring, the thread is still as stable, the bottleneck segment is over.
For Xfire flow chart, can be very intuitive to understand the working principle of xfire, I find the solution is still very helpful.
20141031 Fault handling: Xfire's httpclient bottleneck