Time_wait and close_wait processing in application environment

Source: Internet
Author: User
Tags ack


Yesterday resolved a httpclient call error caused by the server exception, the following process:

http://blog.csdn.net/shootyou/article/details/6615051

The analysis in the process mentioned that the server has a large number of close_wait states detected by looking at the server network state.


The following commands are frequently used during the day-to-day maintenance of the server:

[plain] view plain copy print? Netstat-n | awk '/^tcp/{++s[$NF]} end {for (a in S) print A, s[a]} '

It will display information such as the following:

Time_wait 814
Close_wait 1
Fin_wait1 1
Established 634
SYN_RECV 2
Last_ack 1

The three commonly used states are: Established means communication, time_wait indicates active shutdown, close_wait indicates passive shutdown.


Each state what the meaning, in fact, needless to say, look at the following diagram to understand, note that the server mentioned here should be the business request to accept the processing party:


Connection Shutdown State Conversion Description:
1. The default is the established state, when the server first sends a FIN packet to the client, and then the server enters the fin_wait_1 state.
2. The client confirms to the server that the fin packet is received, sends the Fin/ack to the server, and the client enters the close_wait state.
3. After receiving the Fin/ack from the client, the server enters the fin_wait_2 state
4. The client has now entered a passive shutdown ("passive close") state, and the client operating system waits for the application to close the connection above him. Once the connection is turned off, the client sends a FIN packet to the server
5. When the server receives the fin package, the server sends Fin/ack confirmation to the client and then enters the famous time_wait state. Since the connection was closed, it was not possible to determine that all packets before the connection shutdown were accepted by the server (packet acceptance is not sequential) and therefore has a time_wait state. In this state, the server is still waiting for a package sent by the client but has not reached the server. This state will remain 2*MSL, where the MSL refers to the maximum time that a TCP packet exists on the network. In general, 2*msl=240 seconds.

In fact, Fin_wait1, FIN_WATI2, close_wait are not normal phenomenon, only established and time_wait is normal.

If you find that there are a lot of fin_wait states, and the client IP distribution is normal, it may be a DDoS attack, or a recent change in the program caused the problem. In general, the latter is more likely, should be actively contacted programmer to solve.

So many states are not to be remembered, as long as you know the meaning of the three most common states I mentioned above. Generally not to the last resort to view the status of the network, if the server out of the exception, 80% or 90% are the following two kinds of situations:

1. The server maintains a large number of time_wait states

2. The server maintains a large number of close_wait states

because Linux is assigned to a user's file handle is limited (refer to: http://blog.csdn.net/shootyou/article/details/6579139), and time_ Wait and close_wait two states if it has been maintained, it means that the corresponding number of channels have been occupied, and is "not hard to occupy the manger," once reached the upper limit of the number of handles, the new request can not be processed, followed by a large number of too Many Open files exception, Tomcat crashes ...

Here's how to deal with these two situations.

There are a lot of data on the Internet to confuse the processing of these two cases, that the optimization system kernel parameters can solve the problem, in fact, it is not appropriate to optimize the system kernel parameters to solve time_wait may be very easy, but to deal with the situation of close_wait or need to start from the program itself. Now let's talk about the two ways of dealing with each of these situations:


1. The server maintains a large number of time_wait states

This kind of situation is more common, some reptile server or Web server (if the network management does not do kernel parameter optimization when installing), this problem is often encountered, this question is how to produce.

From the above schematic can be seen, time_wait is actively shut down the state of the connection, for the reptile server himself is "client", after the completion of a crawl task, he will initiate the active shutdown of the connection to enter the TIME_WAIT state, Then, after maintaining this state 2MSL (max segment lifetime) time, completely shut down the recycle resource. Why would you do that? Obviously has been active shut down the connection why still need to keep resources for a period of time. This is the designer of TCP/IP, mainly for the following two considerations:

1. Prevent the packets from the last connection from being lost and then appearing again, affecting the new connection (after 2MSL, all the duplicates in the last connection will disappear)
2. Reliable shutdown of TCP connections. The last ACK (FIN) sent by the active shutdown is likely to be lost, when the passive side will resend the fin, and if the active party is in the CLOSED state, it will respond to RST instead of ACK. So the active side should be in the TIME_WAIT state, but not the CLOSED. In addition to this design time_wait will regularly recycle resources, and will not occupy a lot of resources, unless a short period of time to accept a large number of requests or attacks.

Refer to the following passage for the MSL reference:

The MSL is a TCP Segment (a block of TCP network packets) from the source to the time between the destination can continue to save (that is, a network packet on the internet can survive the time), since the RFC 793 TCP Transfer Protocol was defined in 1981, the Internet speed is not as online as it is now The Internet, you can imagine entering your Web site from the browser wait 4 minutes for the first byte to appear. It's almost impossible to have this happening in the present web environment, so we can drastically time_wait the continuation of the form so that the connector can be Ports to other connections more quickly than it would have been.


Then refer to a passage on the network:

It is worth saying that for TCP-based HTTP protocols, the server side of the TCP connection is turned off, so that the server end enters the time_wait state. It is conceivable that for a large number of access to the Web Server, there will be a lot of time_wait state, if the Server receives 1000 requests a second, then the backlog 240*1000=240,000 a time_wait record, Maintaining these states brings a burden to the server. Of course, the modern operating system will use a fast lookup algorithm to manage these time_wait, so for the new TCP connection request, to determine whether hit a time_wait not too time-consuming, but there are so many States to maintain is always bad.

The

HTTP protocol version 1.1 stipulates that the default behavior is keep-alive, which means that a TCP connection is reused to transmit multiple request/response, one of the main reasons is the discovery of the problem.


That is to say, HTTP interaction with the above picture is not the same, close the connection is not the client, but the server, so the Web server will also appear a large number of time_wait.
Now, how to solve this problem.
The solution is simply to allow the server to quickly reclaim and reuse those time_wait resources.
Here is a look at our network management of the/etc/sysctl.conf file changes:
#对于一个新建连接, how many SYN connection requests the kernel will send to decide to discard, should not be greater than 255, the default value is 5, corresponding to 180 seconds time
net.ipv4.tcp_syn_retries=2
Number of-ack sent #SYN
#net. ipv4.tcp_synack_retries=2
#表示当keepalive起用的时候, the frequency with which TCP sends keepalive messages. The default is 2 hours, 300 seconds.
net.ipv4.tcp_keepalive_time=1200

#在近端丢弃TCP连接之前, how many retries to make. The default value is 7, equivalent to 50 seconds-16 minutes, #视 RTO. If your system is a heavily loaded Web server, you may need to lower this value, #这类 sockets may be expensive. #另外参的考 Tcp_max_orphans. (In fact, when NAT, reducing this value is significant, my own network environment to reduce the value of 3) #sysctl_tcp_orphan_retries mainly for isolated sockets (that is, has been removed from the process context, # But there is still some cleanup work to be done. For this socket, the maximum number of times we try to retry is it.
#默认值是7, set to 0 to not retry # The most recent test found that the socket will stay in fin_wait_1 this state forever, and will not retransmit closed after the timeout reset. #在Linux上, Tcp_orphan_retries set to 0 is the effect.

#查了资料, there's this explanation.

/* Do not allow orphaned sockets to eat all our resources.
 * This is direct violation of TCP specs but it are required
 * to prevent DoS attacks. It is called when a retransmission timeout
 * or zero probe timeout to occurs on orphaned socket.
 *
&N Bsp;* The Criteria is still not confirmed experimentally and May.
 * We Kill the socket, if:
 * 1. If number of orphaned sockets exceeds an administratively configured
 *    limit.
 * 2. I F We have strong memory pressure.
 */

Set to 0,it doesn ' t mean ' try forever ' and It means ' don ' t try at all. This is the server trying to politely tell the client of the server is getting ready to close his socket, and if it woul D Please do a orderly disconnect, or send some more data, which would be wonderful. It'll try X times to get the client to respond, and after X, it reclaims the socket on the system side.
Setting that number to 0 would suggest to me this is heavily utilized, with a zero tolerance policy for orphan S. It may also have been a response-a ddos:lot of DDOS ' work by opening a socket connection and then in to It, Doing nothing.

Net.ipv4.tcp_orphan_retries=3
#表示如果套接字由本端要求关闭, this parameter determines how long it remains in the fin-wait-2 state. #注意, Tcp_fin_timeout is not 2MSL, but Fin-wait-2 state
Net.ipv4.tcp_fin_timeout=30
#表示SYN队列的长度 (semi-connected queue length, half connection received Syn, send Syn-ack, but no client ACK connection), #默认为1024, increase the queue length of 8192, can accommodate more network connections waiting for the connection number.
Net.ipv4.tcp_max_syn_backlog = 4096
#表示开启SYN Cookies. When SYN wait queue overflow occurs, cookies are enabled for processing to prevent a small number of SYN attacks, the default is 0, which means shutdown
Net.ipv4.tcp_syncookies = 1

#表示开启重用. Allows time-wait sockets to be re used for a new TCP connection, which defaults to 0, which means shutdown
Net.ipv4.tcp_tw_reuse = 1
#表示开启TCP连接中TIME-wait sockets, the default is 0, which means close
Net.ipv4.tcp_tw_recycle = 1

# #减少超时前的探测次数
Net.ipv4.tcp_keepalive_probes=5
# #优化网络设备接收队列, number of unprocessed input packets before kernel starts dropping them, # #default 300. # #我所理解的含义, each network interface receives packets at a rate faster than the kernel processes these packets, allowing the maximum number of queues to be sent, # #一旦超过将被丢弃.
net.core.netdev_max_backlog=3000


Execute/sbin/sysctl-p after the modification to take effect.
The main note here is Net.ipv4.tcp_tw_reuse net.ipv4.tcp_tw_recycle.
Net.ipv4.tcp_fin_timeout
Net.ipv4.tcp_keepalive_*
These several parameters.
Net.ipv4.tcp_tw_reuse and Net.ipv4.tcp_tw_recycle are opened to reclaim resources that are in the TIME_WAIT state.
Net.ipv4.tcp_fin_timeout this time can reduce the time the server transfers from Fin-wait-2 to time_wait in exceptional circumstances.
Net.ipv4.tcp_keepalive_* a series of parameters that are used to set the server to detect the connection to survive the related configuration.
About the use of keepalive can refer to: http://hi.baidu.com/tantea/blog/item/580b9d0218f981793812bb7b.html
[2015.01.13 UPDATE] note the risk of tcp_tw_recycle opening:http://blog.csdn.net/wireless_tech/article/details/6405755
2. The server maintains a large number of close_wait statesRest, catch a breath, at first just intend to talk about the difference between time_wait and close_wait, did not expect to dig deeper, this is the benefits of blogging summary, there can always be unexpected harvest.
time_wait StateCan be resolved by optimizing the server parameters, because the time_wait situation is the server itself controllable, or the other is connected to the exception, or they do not quickly recycle resources, in shortis not caused by your own program error. but close_wait is not the same, from the above figure can be seen, if it has been maintained in the Close_wait state, then only one situation, is that the server after the other closed the connection does not send an ACK signal. In other words, it is not detected in the program after the connection is closed, or the program simply forgets that the connection needs to be closed at this time, and the resource is always occupied by the program. Personally feel this situation, through the server kernel parameters can not be resolved, the server for the program preemption Resources do not have the right to actively recycle, unless the termination of the program operation.
If you are using httpclient and you are experiencing a lot of close_wait, this log may be useful to you: http://blog.csdn.net/shootyou/article/details/6615051 In the log over there I gave a scene to illustrate the difference between close_wait and time_wait, and here's a new description: Server A is a reptile server that uses a simple httpclient to request the Apache of resource Server B to get the file resources, normally If the request succeeds, then after crawling the resources, server A will initiate the request to close the connection, this time is the active shutdown connection, Server A connection state we can see is time_wait. What if an exception occurs. Assuming that the requested resource Server B does not exist, then the Server B will issue a shutdown request, and server A is passively shutting down the connection.if server A passively closes the connection, the programmer forgets to let httpclient release the connection, which will cause the close_wait state.

So if a lot of close_wait solutions are summed up in one sentence, that is:Check the code. Because the problem is in the server program.
reference materials:1.windows under the time_wait processing can participate in this warrior's log: http://blog.miniasp.com/post/2010/11/17/How-to-deal-with-TIME_ Wait-problem-under-windows.aspx
2.WebSphere Server Optimization has a certain reference value: http://publib.boulder.ibm.com/infocenter/wasinfo/v6r0/index.jsp?topic=/ Com.ibm.websphere.express.doc/info/exp/ae/tprf_tunelinux.html 3. Meaning of various kernel parameters: http://haka.sharera.com/blog/ Blogtopic/32309.htm

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.