Server time_wait and close_wait understanding and solutions

Last Update:2016-04-04 Source: Internet

Author: User

Tags ack

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Server time_wait and close_wait understanding and solutions

From: http://blog.csdn.net/shootyou/article/details/6622226

Yesterday resolved a server exception caused by a httpclient call error, as follows:

http://blog.csdn.net/shootyou/article/details/6615051

The analysis process inside has mentioned that the server has a large number of close_wait states detected by viewing the server network status.

The following commands are often used during routine maintenance of the server:

Netstat-n | awk '/^tcp/{++s[$NF]} END {for (a in S) print A, s[a]} '

It will display information such as the following:

Time_wait 814
Close_wait 1
Fin_wait1 1
Established 634
SYN_RECV 2
Last_ack 1

The three commonly used states are: Established is communicating, time_wait indicates active shutdown, close_wait indicates passive shutdown.

Specific each state what meaning, actually needless to say, look at the following diagram is clear, note that the server mentioned here should be the business request to accept processing party:

So many states don't have to remember, just know the meaning of the three most common states I have mentioned above. Generally not to the last resort to check the status of the network, if the server is an exception, 80% or 90% is the following two kinds of situations:

1. The server maintains a large number of time_wait states

2. The server maintains a large number of close_wait states

Because the file handles assigned to a user by Linux are limited (refer to: http://blog.csdn.net/shootyou/article/details/6579139), time_wait and Close_ Wait two states if has been maintained, then means that the corresponding number of channels has been occupied, and is "Occupy Manger not Hard", once the maximum number of handles, the new request can not be processed, then a large number of too many Open files exception, Tomcat crashes ...

Below to discuss the two cases of treatment, there is a lot of information on the Internet to confuse the two cases, to optimize the kernel parameters of the system can solve the problem, in fact, it is inappropriate to optimize the system kernel parameters to solve time_wait may be easy, but to deal with Close_ The wait situation still needs to be started from the program itself. Now let's talk about how to deal with these two situations separately:

1. The server maintains a large number of time_wait states

This situation is more common, some crawler servers or Web servers (if the network management is not in the installation of kernel parameter optimization) often encounter this problem, how this problem is generated?

From the above can be seen, time_wait is active to close the side of the connection to maintain the state, for the crawler server He is the "client", after completing a crawl task, he will initiate the active shutdown connection, thus enter the state of Time_wait, Then, after maintaining this state 2MSL (max segment lifetime) time, completely shut down the recycle resource. Why do you do this? Clearly has been actively shut down the connection why should we keep the resources for some time? This is the design of TCP/IP, mainly for the following two aspects of consideration:

1. Prevent the package in the last connection, get lost and re-appear, affect the new connection (after 2MSL, all the duplicates in the last connection will disappear)
2. Reliable shutdown of TCP connections. The last ACK (FIN) sent at the active shutdown is likely to be lost, when the passive side will resend fin, and if the active side is in the CLOSED state, it will respond to RST rather than ACK. So the active side should be in the TIME_WAIT state, but not CLOSED. In addition, this design time_wait will periodically recycle resources, and will not occupy a lot of resources, unless a short period of time to accept a large number of requests or be attacked.

Refer to the following passage for the MSL:

The MSL is a TCP Segment (a block of TCP web packets) that is sent from the source to the time of the destination (that is, when a network packet can survive on the Internet), because the RFC 793 TCP Transport is defining in 1981, when the speed of the Internet is not as it is now On the Internet, you can imagine you're entering the Web from the browser until the first byte is 4 minutes? There is almost no possibility of this happening in the present network environment, so we can greatly reduce the time_wait state of the time, so that the connection port (Ports) can be more quickly vacated to other connections.

A passage that references a network resource again:

It is worth saying that for TCP-based HTTP protocol, the server side of the TCP connection is closed, so that the server side will enter the TIME_WAIT state, it can be imagined that for the large traffic of the Web server, there will be a large number of time_wait state, If the server receives 1000 requests in a second, the backlog of 240*1000=240,000 Time_wait records will be maintained, which can be a burden to the server. Of course, the modern operating system will use a fast lookup algorithm to manage these time_wait, so for the new TCP connection request, determine whether hit in a time_wait not too much time, but there are so many States to maintain is always bad.
HTTP protocol version 1.1 stipulates that the default behavior is keep-alive, that is, the reuse of TCP connections to transmit multiple request/response, one of the main reasons is to find this problem.

That is to say that the HTTP interaction with the picture above is not the same, close the connection is not the client, but the server, so the Web server will also appear a lot of time_wait situation.

Now, how to solve this problem.

The solution is simply to allow the server to quickly reclaim and reuse those time_wait resources.

Here is a look at our network management changes to the/etc/sysctl.conf file:

#对于一个新建连接, how many SYN connection requests the kernel will send to decide to discard, should not be greater than 255, the default value is 5, corresponding to 180 seconds or so time
net.ipv4.tcp_syn_retries=2
#net. ipv4.tcp_synack_retries=2
#表示当keepalive起用的时候, the frequency at which TCP sends keepalive messages. The default is 2 hours, instead of 300 seconds.
net.ipv4.tcp_keepalive_time=1200
Net.ipv4.tcp_orphan_retries=3
#表示如果套接字由本端要求关闭, this parameter determines how long it remains in the fin-wait-2 state.
Net.ipv4.tcp_fin_timeout=30
#表示SYN队列的长度, the default is 1024, and the queue length is 8192, which can accommodate more network connections waiting to be connected.
Net.ipv4.tcp_max_syn_backlog = 4096
#表示开启SYN Cookies. When a SYN wait queue overflow occurs, cookies are enabled to protect against a small number of SYN attacks, which defaults to 0, which means close
Net.ipv4.tcp_syncookies = 1
#表示开启重用. Allows time-wait sockets to be re-used for new TCP connections, which defaults to 0, which means shutdown
Net.ipv4.tcp_tw_reuse = 1
#表示开启TCP连接中TIME-wait Sockets Fast Recovery, default is 0, indicates off
Net.ipv4.tcp_tw_recycle = 1
# #减少超时前的探测次数
Net.ipv4.tcp_keepalive_probes=5
# #优化网络设备接收队列
net.core.netdev_max_backlog=3000

Execute/sbin/sysctl-p after modification to make the parameters effective.

The main note here is Net.ipv4.tcp_tw_reuse.

Net.ipv4.tcp_tw_recycle
Net.ipv4.tcp_fin_timeout
Net.ipv4.tcp_keepalive_*

These several parameters.

Net.ipv4.tcp_tw_reuse and Net.ipv4.tcp_tw_recycle are turned on to reclaim resources that are in time_wait state.

Net.ipv4.tcp_fin_timeout this time can reduce the time that the server goes from fin-wait-2 to time_wait in exceptional cases.

Net.ipv4.tcp_keepalive_* a series of parameters that are used to set the server to detect the connection survival of the relevant configuration.

For the use of keepalive can be consulted: http://hi.baidu.com/tantea/blog/item/580b9d0218f981793812bb7b.html

2. The server maintains a large number of close_wait states

Take a break, catch a breath, start just to talk about the difference between time_wait and close_wait, did not think the more digging deeper, this is also the benefits of blogging summary, there can always be unexpected harvest.

Time_wait state can be resolved by optimizing the server parameters, because the occurrence of time_wait is the server itself can be controlled, either the exception of the other connection, or they did not quickly recover resources, in short, not due to their own program error.

But close_wait is not the same, from the above figure can be seen, if kept in the close_wait state, then there is only one situation, that is, after the other side closed connection, the server program itself did not send an ACK signal. In other words, it is not detected in the program after the connection is closed, or the program simply forgets to close the connection at this time, so the resource has been occupied by the program. Personally feel that this situation, through the server kernel parameters can not be resolved, the server for the program preemption Resources do not have the right to actively reclaim, unless the program is terminated to run.

If you are using httpclient and you are experiencing a lot of close_wait, this log may be useful for you: http://blog.csdn.net/shootyou/article/details/6615051

In that log, I gave a scene to illustrate the difference between close_wait and time_wait, and here's a re-description:

Server A is a crawler server, it uses simple httpclient to request the Apache above the resource Server B to obtain file resources, under normal circumstances, if the request is successful, after fetching the resources, server A will actively make a request to close the connection, this time is to actively shut down the connection, Connection status of server A we can see that it is time_wait. What if an exception occurs? Assuming that the requested resource Server B does not exist, then this time will be issued by Server B to close the connection request, Server A is the passive shutdown of the connection, if server a passively shut down the connection after the programmer forgot to let httpclient release the connection, it will cause the state of close_wait.

So if a lot of close_wait solution is summed up in a sentence that is: Check the code. Because the problem is inside the server program.

Server time_wait and close_wait understanding and solutions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More