Process is unevenly distributed in multi-core CPU environment causing TCP connections to accumulate

Source: Internet
Author: User
Tags cpu usage

Today to deal with a customer's failure, may be in the web troubleshooting a certain representative, so here to share with you.

background :

E-commerce customers, the system architecture is a front-end load balancing device, the load Balancing algorithm is polling (round robin), followed by two Web servers, running the application is Gunicorn(a python WSGI HTTP Server for Unix).

Failure Phenomena :

Users complain about the slow response of the website, there are more pages cannot open the situation. From system monitoring, the number of connections (TCP) between the two back-end servers is unequal. The graph shows that the number of connections in one is getting more and more deviated from the connection number curve of another normal server in a few 10 minutes. At the most severe moment, the number of connections to the failed server is three times times that of the normal server. The front-end load-balanced request distribution is always equal/polled. The number of connections caused by load balancing is not different.

Processing Process :

Run the top command on both servers to compare performance state data. You can see the difference is: the normal server load average is approximately 2, the fault server is close to 6. From the memory state, the normal server's physical memory is 64GB, using 50GB or so, swap is not enabled, the failed server physical memory 32GB all use, and enabled more than 500 MB swap.

Preliminary judgment, the fault is because a server only 32GB, insufficient memory resources, resulting in the request processing is not timely (the number of connections and load average high). It then checks the number of connections on both servers. Run the ss-s command, the number of normal server TCP connections is about 2800, the number of failed server TCP connections is about 3600. This also confirms the information of the monitoring chart.

The high number of concurrent connections for a failed server is not because the front-end load balancer Allocation requests are unbalanced, but because the processing is slower than the normal server, the number of concurrent surviving connections is higher. This also causes the failure to continue to deteriorate.

Therefore, we recommend that users upgrade the failed server to 64GB memory, which remains the same as the other server. Because we are providing public cloud service, configuration upgrade that is quite fast! After upgrading the number of connections, memory usage and other indicators are very close to the user also no longer encounter the problem of open page. It seems like the end of the matter here.

However, in the afternoon, the customer's website visits a lot larger than the morning. The customer found that the load average difference between the two machines was large (one 0.8, one 2.0), and wanted help to find out the reasons for this inconsistency. Customers are worried about the next "big promotion" when the server will still be unable to stay.

So we can't just look at the root of the memory. Since the load average is inconsistent, it should be related to the CPU load. The output of the top command shows that the CPU usage metrics are basically the same for both servers. Since both machines are 16 cores, we are focusing on the distribution of CPU core and process. We learned that the customer was using Gunicorn wscgi server and then looked at the usage of the service. Run ps-ef | grep gunicorn | wc-l, output is 35. Remove grep, which means that customers run 34 gunicorn processes on a single server. Since the machine is 16-core, on average, a nucleus should run two gunicorn processes. We then looked at the distribution of the process and CPU cores. You need to use the ps-ef command at this time, and the seventh column of the output is the CPU core number information.

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/71/13/wKioL1XE6tGibwy7AAHXbkchnwM869.jpg "title=" Ps-ef.png "alt=" Wkiol1xe6tgibwy7aahxbkchnwm869.jpg "/>

In the current environment, run the following command:

ps-ef | grep gunicorn | sort-k7-n

This time we see that on a normal server, 34 Gunicorn processes are basically running in the form of 3 to 5 processes distributed on a certain core (many cores do not run Gunicorn), and on a server that is suspected of failing, 13 gunicorn processes are concentrated on a single core. This obviously leads to a single core load overload, which in turn affects the overall processing efficiency.

After finding the reason , the method was quickly clarified. The next step is to make these processes evenly distributed over these CPU cores. This action has shell script can be executed, everyone is interested to find out. This relates to the knowledge of Linux CPU affinity (affinity). Reference Link: http://www.ibm.com/developerworks/cn/linux/l-cn-linuxkernelint/

After the script is executed, the distribution of these processes is significantly improved. Back to the top and other commands to see the system performance, with Ss-s to see the number of TCP connections and other indicators, the two servers are almost consistent. Client side also said that the page can not open and so on has disappeared. The "Chop-hand Party" can also enjoy online shopping on this wonderful evening.

This article is from the "website Operation Technology Exchange" blog, please make sure to keep this source http://victor1980.blog.51cto.com/3664622/1682832

Process is unevenly distributed in multi-core CPU environment causing TCP connections to accumulate

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.