Socket: close_wait status and time_wait status problems

Source: Internet
Author: User
Tags socket error

Not long ago, my Socket Client encountered a very embarrassing error. It should have been sending data continuously to the server on a persistent socket connection. If the socket connection is disconnected, the program will automatically retry the connection.
One day, I found that the program was constantly trying to establish a connection, but it always failed. Using netstat to check whether there are thousands of socket connections in the close_wait State, so that the upper limit is reached, so a new socket connection cannot be established.
Why?
Why are they all in the close_wait status?
Reasons for close_wait status generation
First, we know that if our client program is in the close_wait status, the socket is passively closed!
If the server actively breaks the current connection, both parties need four packages to close the TCP connection:
Server ---> fin ---> Client
Server <--- ack <--- Client
At this time, the server is in the fin_wait_2 state, and our program is in the close_wait state.
Server <--- fin <--- Client
When the client sends a fin to the server, the client is set to the last_ack state.
Server ---> ack ---> Client
The server responds to ACK, so the client socket is truly set to closed.

 
Our program is in the close_wait status, rather than the last_ack status. It indicates that no fin has been sent to the server, so there may be a lot of data to be sent or other things to do before closing the connection, as a result, the FIN packet is not sent.
 
The reason is clear. Why don't I release a fin package? Is there so much to do before I close my connection?
Another question is, Why are thousands of connections in this status? During that time, did the Server Always take the initiative to remove our connections?
 
In any case, we must prevent similar situations from happening again!
First, we need to prevent new ports from being opened. This can be done by setting the so_reuseaddr socket option:
Reuse local addresses and ports
In the past, I used to change to another port because thousands of Ports entered the close_wait status. If this happens again next time, I want to add a limit, but the current port is in the close_wait status!
Before calling
Sockconnected = socket (af_inet, sock_stream, 0 );
Then, we need to set the socket options for reuse:

/// Allow reuse of local addresses and ports:
/// The advantage is that, even if the socket is disconnected, calling the preceding socket function will not occupy another function, but always be a port.
/// In this way, the socket cannot be connected at all times, so the original practice will continuously change the port.
Int nreuseaddr = 1;
Setsockopt (sockconnected,
Sol_socket,
So_reuseaddr,
(Const char *) & nreuseaddr,
Sizeof (INT ));

In the textbook, if the server is closed or exited, causing both the local address and port to be In the time_wait status, so_reuseaddr is very useful.
We may not be able to avoid freezing in the close_wait state, but at least we can ensure that it will not occupy the new port.
Next, we need to set so_linger socket options:
Close it easily or forcibly?
Linger means "delay.
By default (Win2k), so_dontlinger socket options are 1; so_linger options are {l_onoff: 0, l_linger: 0 }.
If closesocket () is called while sending data (sending () is not completed, and data is not sent), we usually take the following measures ":
Because I will call
/// Disable two-way communication first
Shutdown (sockconnected, sd_both );
/// For security, close the old connection before each socket connection is established.
Closesocket (sockconnected );
 
We will do this time:
If so_linger is set to zero (that is, the l_onoff field in the linger structure is set to non-zero, but l_linger is 0), you do not have to worry about closesocket calls going to the "locked" status (waiting for completion ), no matter whether there is queue data not sent or not confirmed. This method is called "force close" because the virtual circuit of the socket is reset immediately, and all data that has not been sent will be lost. All remote Recv () calls fail and the wsaeconnreset error is returned.
Set this option after connect successfully establishes a connection:
Linger m_slinger;
M_slinger.l_onoff = 1; // (allowed to stay when closesocket () is called, but there is still data not sent)
M_slinger.l_linger = 0; // (the allowable stay time is 0 s)
Setsockopt (sockconnected,
Sol_socket,
So_linger,
(Const char *) & m_slinger,
Sizeof (linger ));

 
Summary
We may not be able to avoid the recurrence of close_wait status freezing, but we will minimize the impact. We hope that the reuse socket option will enable close_wait to be kicked off during the next connection establishment.

Feedback
# Reply: [socket] embarrassing close_wait status and Response Policy PM Yun. Zheng
Reply to: elssann (smelly asshole and his pistachio) () Credit: 51 14:00:00 score: 0

I mean: when one party closes the connection, the other party fails to detect it, which leads to the appearance of close_wait. This is also true of one of my friends last time, he wrote a client to connect to Apache. After Apache disconnected the connection, he did not detect it and close_wait appeared. Then I told him to check the location, after he added the code to call closesocket, this problem was eliminated.
If close_wait still appears before closing the connection, we recommend that you cancel the shutdown call and try closesocket on both sides.

Another problem:

For example:
After the client logs on to the server, it sends an authentication request. The server receives the data and authenticates the client identity. The password is incorrect, in this case, the server should first send a wrong password to the client, and then disconnect the connection.

If
M_slinger.l_onoff = 1;
M_slinger.l_linger = 0;
After this setting, in many cases, the client cannot receive a message with a wrong password, and the connection is broken.

# Reply: [socket] embarrassing close_wait status and Response Policy PM Yun. Zheng
Elssann (ODPS and his pistachio) () Credit: 51 13:24:00 score: 0

The reason for the occurrence of close_wait is very simple, that is, after a certain party disconnects the network, it does not detect this error and does not execute closesocket, leading to the implementation of this status, this can be clearly seen in the status change diagram of TCP/IP protocol. At the same time, there is also a kind of corresponding time_wait.

In addition, setting the so_linger of the socket to zero-second delay (that is, immediately disabling it) is often harmful.
Also, setting ports to reusable is an insecure network programming method.

# Reply: [socket] embarrassing close_wait status and Response Policy PM Yun. Zheng
Elssann (ODPS and his pistachio) () Credit: 51 14:48:00 score: 0

For more information, see here.
Http://blog.csdn.net/cqq/archive/2005/01/26/269160.aspx

Let's look at the figure again:

Http://tech.ccidnet.com/pub/attachment/2004/8/322252.png

When the connection is disconnected,
When a fin is sent from the left side that initiates the active shutdown request, the right side passively closes the request to respond to an ACK. The Ack is a TCP response instead of an application, the party that passively closes is in the close_wait status. If the party that is passively closed does not call closesocket at this time, it will not send the next fin, so that it is always in close_wait. Only when the party that is passively closed calls closesocket will it send a fin to the party that is actively closed, and change its status to last_ack.

# Reply: [socket] embarrassing close_wait status and Response Policy PM Yun. Zheng
Elssann (ODPS and his pistachio) () Credit: 51 15:39:00 score: 0

For example, the client is passively closed...

When the other party calls closesocket, your program is

Int nret = Recv (S ,....);
If (nret = socket_error)
{
// Closesocket (s );
Return false;
}

Many people forget the sentence closesocket, which is too common.

In my understanding, when the active side sends a fin to the passive side, the TCP of the passive side immediately responds to an ACK and submits an error to the application, cause the send or Recv of the above socket to return socket_error. Normally, if closesocket is called after socket_error is returned, the tcp of the passively closed party will send a fin, your status changes to last_ack.

# Reply: [socket] embarrassing close_wait status and Response Policy PM Yun. Zheng
Int nrecvbuflength =
Recv (sockconnected,
Szrecvbuffer,
Sizeof (szrecvbuffer ),
0 );
// Zhengyun 20050130:
/// Elssann, for example, when the other party calls closesocket
/// Recv. At this time, I may not receive the fin package sent by the other party, but it is returned by TCP.
/// An ACK package, so my program enters the close_wait status.
/// Therefore, it is recommended that you determine whether an error has occurred here. It is the active closesocket.
/// Because we have set the Recv timeout time to 30 seconds, if it is time-out,
/// The error here should be wsaetimedout. In this case, you can also disable the connection.
If (nrecvbuflength = socket_error)
{
Trace_info (_ T ("= socket error when receiving with Recv = "));
Closesocket (sockconnected );
Continue;
}

Can this happen?

Network connection cannot be released -- close_wait
Keywords: TCP, close_wait, Java, socketchannel

Problem description: a problem encountered in the recent performance test. The client uses NiO, and the server is still connected to a common socket. After testing for a period of time, it is found that the server system has a large number of unreleased network connections. Use netstat-Na to check whether the connection status is close_wait. This is strange. Why is the socket closed and the connection still not released.

Solution: After half a day on Google, I found that the problem about close_wait is generally C, and Java seems to have encountered a few problems (this article is good, but it also solves close_wait, but it seems that there is no fundamental solution, but a compromise is selected ). Next, I found this article because NiO is used, and I suspect it may be a problem. Several of them mentioned a problem: After the socket at one end calls close, the socket at the other end does not call close. so I checked the code and found that the server did not close the socket in some exceptions. Solve the problem after correction.

Time is basically spent on Google, but I have learned a lot. The following figure shows the status transition of a TCP connection:

Note: The dotted line and solid line correspond to the server (connected) and client (active connection) respectively ).

Use the netstat-Na command to know the current TCP connection status. Generally, listen, established, and time_wait are common.

Analysis:

The problem I encountered above is mainly because the TCP End Process is not completed, resulting in the connection not released. The client is automatically disconnected. The process is as follows:

Client Message Server

Close ()
------ Fin ------->
Fin_wait1 close_wait
<----- Ack -------
Fin_wait2
Close ()
<------ Fin ------
Time_wait last_ack

------ Ack ------->
Closed
Closed

As shown in, because the server socket is not called to close when the client is closed, the connection on the server is "suspended", while the client is waiting for a response. The typical feature of this problem is that one end is in fin_wait2 while the other end is in close_wait. However, the fundamental problem is that the program is not well written and needs to be improved.

Time_wait status
According to the TCP protocol, the party that initiates the shutdown will enter the time_wait status, lasting 2 * MSL (max segment lifetime), the default value is 240 seconds, this post briefly describes why this status is required.

It is worth noting that for the TCP-based HTTP protocol, the server end closes the TCP connection. In this way, the server enters the time_wait status. it is conceivable that for the Web server with a large traffic volume, there will be a large number of time_wait statuses. If the server receives 1000 requests in one second, there will be a backlog of 240*1000 = 240,000 time_wait records. Maintaining these statuses will burden the server. Of course, modern operating systems use Quick search algorithms to manage these time_wait instances. Therefore, it is not too time-consuming to determine whether a time_wait instance in hit instances is used for new TCP connection requests, however, it is always difficult to maintain so many statuses.

HTTP 1.1 requires that the default behavior is keep-alive, that is, multiple requests/response will be transmitted over TCP connections. One major reason is that this problem has been found. Another way to reduce the time_wait pressure is to reduce the system's 2 * MSL time, because the time of 240 seconds is really a little longer. For Windows, modify the registry, add a DWORD Value tcptimedwaitdelay to HKEY_LOCAL_MACHINE/system/CurrentControlSet/services/TCPIP/parameters. Generally, the value should be less than 60, otherwise it may be troublesome.

For a large service, a server may not be able to solve the problem. A Load balancer (Load balancer) is required to distribute traffic to several backend servers. If this LB works in Nat mode, it may cause problems. If the source address of all IP packets from lb to the backend server is the same (the internal address of LB), the TCP connection from lb to the backend server will be limited, because frequent TCP connections are established and closed, the time_wait status is left on the server, and the remote addresses corresponding to these statuses are lb, the source port of LB is more than 60000 (2 ^ 16 = 65536,1 ~ 1023 is the reserved port, and some other ports will not be used by default). Once the port on each LB enters the server's time_wait blacklist, it will no longer be used to establish a connection with the server in 240 seconds, in this way, LB and server support up to 300 connections. If there is no LB, this problem will not occur, because the remote address seen by the server is a vast collection of Internet, and it is enough for each address and more than 60000 ports.

At first, I thought using LB would limit the number of TCP connections to a large extent, but the experiment showed that this was not the case. The number of requests processed by a Windows server after LB reached 2003 per second, does the time_wait status not work? After observing with Net Monitor and netstat, it is found that the connection between server and lb xxxx port enters the time_wait status, and then a SYN Packet of lb xxxx port will be received and processed by the server, instead, it was dropped as expected. Go through the book and find out the UNIX network programming, Volume 1, second edition: Networking APIs: sockets and xTi bought in the dusty college age, for BSD-derived implementation, as long as the SYN sequence number is greater than the maximum sequence number when the previous shutdown is performed, the time_wait status will accept this syn. If it is not possible, it will be considered BSD-derived in windows? With this clue and keyword (BSD), finding this post is different from that of BSD-derived in NT4.0, but Windows Server 2003 is nt5.2, it may be a little different.

Make an experiment, compile a client using Socket API, bind to a local port such as 2345 every time, and repeatedly establish a TCP connection to send an HTTP request with keep-alive = false to a server, the implementation of Windows keeps sequence number increasing. Although the server maintains the time_wait status for the client's port 2345 connection, it is always able to accept new requests and will not reject them. What if the sequence number of SYN decreases? I also use the socket API. However, this time, I used raw IP to send a SYN packet with a small sequence number. Net Monitor showed that the Syn was received by the server, and the system did not respond at all, dropped.

According to the book, BSD-derived and Windows Server 2003 have security risks, but at least this will not cause time_wait to block TCP requests. Of course, the client should cooperate, ensure that the sequence number of different TCP connections is increased or not decreased.

This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/jamex/archive/2009/11/17/4823405.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.