20141031 troubleshooting: XFire: HttpClient bottleneck, 20141031 httpclient

Source: Internet
Author: User

20141031 troubleshooting: XFire: HttpClient bottleneck, 20141031 httpclient

Has undergone numerous vulnerability fixes and N + Business version upgrades; from the ssh architecture that was expected to support more than 10 million active users to the Independent R & D architecture that now supports 3 million monthly active users and even 10 million monthly active users; from the development team of five developers to the development team of five developers, there were only two blade servers and now 10 PCs and smaller servers. IT has accompanied me through the most memorable youth in my IT career; IT has witnessed my growth; IT is the most memorable "mentor" in my career ".

Indeed, fate makes me encounter it when I graduate. The development of a system that does not know what to do is now a little bit familiar with what to do, taking root of me from the ignorant soil, let me find a career direction. It's really not easy along the way. I took over this project from scratch when I got out of the training institution or a newbie. I was almost scared away by the endless intensity of work for two or three consecutive days. Fortunately, I persisted, I know that it will make me grow.

Over the past few years, as the promotion of the system has increased and the number of users has increased dramatically, the system has encountered numerous bottlenecks. Every bottleneck is an opportunity for us to grow. I will be very excited, I believe that I will "cure" It, And confidence comes from trust between us. I am very grateful to my friends who have been in the team for a few years. Thanks to the unity and harmony, we have made constant progress and progress.

On the eve of the storm, it seemed exceptionally calm. Indeed, after a relaxed and pleasant National Day holiday, the workload increased sharply. A team of five developers, O & M + Implementation + data analysis and statistics, almost all of them have an average of 13 hours of work intensity, and their heads are all in 100% of work status. There is no room for rest. I know that everyone is exhausted. However, due to their sense of responsibility and mutual understanding, everyone has no complaints, and occasionally just habitually speak out a few words. I know that this is one of the ways to relieve pressure: "brute force ". However, to make it worse, it usually appears in such fragments. With the promotion of the Operations Department, the system once again fell into a bottleneck and threads rose sharply. Many complaints and accusations were received. Due to the complexity of the business, we have limited support. We will always stay on the treasure, wasting manpower, material resources, and time. We always believe that the most reliable thing is ourselves.

In the absence of overtime and the inability to take a rest, the comrades volunteered to work overtime to balance the progress of the project and solve the problem. I am really glad to have met them. As we all know, although we have worked hard, this is a rare opportunity. This is another step of our growth, because we cannot predict the bottlenecks or problems we will encounter next. The more we encounter something we have never met, the more excited we are. We are so "guilty ".

From discovering the problem to solving the problem, we only spent two days. Compared with the previous one, both efficiency and skill have greatly improved. It took us seven days and seven nights to solve the bottleneck caused by synchronized in March. The bottleneck originated from the XFire architecture of webservice. Next we will record the process of handling the bottleneck, so that we can gradually review and help people with similar problems in the future.

Because the Development Volume in March is indeed large, and many requirements involve interfacing with third-party interfaces, this is the most time-consuming and slot part, however, this has also created an effective way to improve our communication skills and skills. As everyone is involved in development and human resource restrictions, the O & M workload is greatly reduced, and the maintenance cycle of system monitoring is extended. When we were struggling with code, the system threads continued to rise, and we thought it was just a temporary promotion of operation activities, so we didn't care too much about it. A few days later, many people started to receive emergency emails from system alarms and leaders, saying that the failure rate of the system has increased and complaints are increasing. At this time, I stopped my work and conducted a round of monitoring on the system. It is indeed found that the thread is higher than before, but the entire system is still controllable. Comparing the traffic volume and number of users, there is indeed a steady upward trend. I have learned about the promotion of the relevant operation departments, and it has been quite popular recently. It seems that we have to slow down the development progress and keep up with the monitoring of system performance. Suddenly one day, the system thread burst the maximum value of the thread pool, and many requests were rejected. I tried to restart the Server Multiple times and increase the thread pool, but the effect was not good, the thread pool will soon be full. I monitored port 80 and found that each server reached 1000 valid connections. After the last bottleneck experience, I immediately expected the system to encounter a bottleneck somewhere. As a result, we started "global scanning" and checked the possible bottlenecks of the systems one by one.

The database was the first place I began to exclude, because there were too many "cases" left in the past, but the database was well tuned and the performance improved a lot, monitoring data is okay. Then, continue to look for the next "criminal": interface performance to connect to a third-party platform. Because the system involves several connections with third-party platforms, You Need To troubleshoot them one by one. First, the network test is complete. Check and collect logs again. Because the performance of the entire system is faulty, the problems reflected in the logs are not very accurate. One late night, I closed the load entry and found that the service thread would slowly decline, proving that the thread had not crashed as it did last time. It should be because of resource competition that led to thread waiting. By analyzing the javacore file, I found the problem:


Thousands of threads are waiting for resources, and 12 threads are suspended. If the competition for this resource is not solved, more threads will crash and cause the system to crash. Next, I am analyzing the running status of each thread, and the culprit is:


From the monitoring thread, we can see that a large number of threads are waiting for httpconnection, which is a third-party platform service with a large traffic. The webservice client of our system uses XFire. From the information, we can see that, XFire uses an Apache httpclient toolkit, which uses MultiThreadedHttpConnectionManager for management. Through the query, it is found that the default number of connections of the maxhostpool of MultiThreadedHttpConnectionManager is 2, while that of our XFire instance is Singleton, it means that only two connections in this Singleton serve so many users, joke. This set of business webservice interfaces is too large. In the past, when I did not use a single instance, it caused memory overflow and system crash. Therefore, I am very disgusted with the design of this business platform and the webservice protocol.

At this moment, I thought that an open-source architecture cannot be so unreliable. It is impossible to change a thread pool. Because the XFire code calls httpclient internally, the modification thread can only start with the XFire code to see if it can find a solution. After half a day of XFire code decompilation tracking, I finally see the following hopes:


It can be seen that XFire sets the maximum number of hostpool connections to 6 by default, while the global maximum number of connections is 20. However, the key lies in getIntValue (). When we look at this method, the XFire service has a parameter Map, which can be set to the two parameters, that means I can change the number of connections in the http connection pool of MultiThreadedHttpConnectionManager. With this hope, I continued to catch up and finally showed this method in the service of the XFire instance:


This HashMap is the carrier of the entire XFire context. I added the two parameters that changed the http connection pool to this HaspMap to increase the connection pool connections:


Restart the system and the thread returns to a stable State. After two days of monitoring, the thread remains the same and stable, leading to a bottleneck.

The XFire flowchart shows how XFire works. It is helpful for me to find a solution.





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.