Original how TCP backlog works in Linux
The level is limited, inevitably mistake, welcome to point out!
The following are translations:
When an application listen a socket (socket) to the listen state through a system call, a backlog parameter is specified for the socket, which is typically described as the length of the connection queue to limit incoming (queue of incoming connections).
Because of the three-time handshake mechanism of the TCP protocol, a incoming socket connection undergoes an intermediate state of SYN RECEIVED (see) before it enters the established state and can be returned to the application by the accept call. This means that the TCP stack can have two scenarios to implement the backlog queue:
- Only one queue is used, and the queue size is specified by the system call Listen's backlog parameter. When a SYN packet is received by the server, a syn/ack packet is returned, and the connection is placed in the queue. When the server receives an ACK returned by the client again, the connection state becomes established, which qualifies for processing by the application.
- Use two queues, one SYN queue, and one accept queue. A connection in the SYN received state is added to the SYN queue, and the connection is transferred to the accept queue when the subsequent ACK becomes established state. As the name implies, in this implementation mode, the accept system call can be implemented simply to consume the connection from the Accept queue, at which time the backlog parameter of the listen call determines the size of the accept queue.
Historically, the BCD-derived TCP implementation adopted the first scenario, which means that when the queue size reaches the backlog maximum, the system will no longer send syn/ack packets in response to the SYN packet. Typically, the TCP implementation simply discards the received SYN packet (rather than sending the RST packet) so that the client can retry. This is also W. Richard Stevens's classic textbook, "TCP/IP Detailed Volume III", describes the scenario described in section 14.5 of the Listen Backlog queue.
Note that W. Richard Stevens explains that the BSD implementations actually use two separate queues, but they are represented as a single queue with a fixed maximum length and determined by (but not necessarily equal to) the backlog parameter, which is BSD logic as described in Scenario 1.
The situation on Linux is somewhat different, Listen's man manual writes:
The behavior of the backlog argument on TCP sockets changed with Linux 2.2. Now it specifies the queue length forcompletely established sockets waiting to be accepted, instead of the number of incom Plete connection requests. The maximum length of the queue for incomplete sockets can is set Using/proc/sys/net/ipv4/tcp_max_syn_backlog
The behavior of the backlog parameter on the TCP socket changes with Linux 2.2. It now refers to the queue length of a fully established socket waiting to be accepted (accept), not the number of incomplete connection requests. The maximum length of an incomplete socket queue can be set by/proc/sys/net/ipv4/tcp_max_syn_backlog
This means that the current Linux version is in scenario two with two different queues: a SYN queue, the size specified by the system-wide setting, and an accept queue with the size specified by the application. Scenario 21 The interesting question is, if the current accept queue is full, and a connection needs to be moved from the SYN queue to the Accept queue, what will the TCP implementation do? This situation is handled by the Tcp_check_req function in NET/IPV4/TCP_MINISOCKS.C. The relevant code is as follows:
NULL); if (child == NULL) goto listen_overflow;
For IPV4, the first line of code actually calls the Tcp_v4_syn_recv_sock in net/ipv4/tcp_ipv4.c, which contains the following code:
if (sk_acceptq_is_full(sk)) goto exit_overflow;
The check of the accept queue can be seen in the code above. The code after the Exit_overflow tag performs some cleanup work, updates listenoverflows and listendrops statistics in/proc/net/netstat, and returns NULL, which triggers execution tcp_check_ Listen_overflow code in Req:
listen_overflow: if (!sysctl_tcp_abort_on_overflow) { inet_rsk(req)->acked = 1; return NULL; }
This means that unless /proc/sys/net/ipv4/tcp_abort_on_overflow
set to 1 (in which case the RST packet will be sent as shown in the code), the TCP implementation is basically not doing anything!
In summary, if the TCP implementation in Linux (the server side) receives an ACK packet (client) three handshake, and the accept queue is full, then (the server side) basically ignores the packet. This approach may sound a little strange, but remember that the SYN received state has an associated timer: If the server does not receive an ACK (or is ignored as described here), the TCP implementation will resend the Syn/ack packet (the number of retries is from/proc/sys/ Net/ipv4/tcp_synack_retries, and uses an exponential backoff algorithm).
The above behavior can be seen in the following packet trace, where the client attempts to connect (and send data) to the socket that has reached its maximum backlog:
0.000 127.0.0.1-127.0.0.1 TCP 53302 > 9999 [SYN] seq=0 len=0
0.000 127.0.0.1-127.0.0.1 TCP 9999 > 53302 [SYN, ACK] seq=0 ack=1 len=0
0.000 127.0.0.1-127.0.0.1 TCP 53302 > 9999 [ACK] seq=1 ack=1 len=0
0.000 127.0.0.1-127.0.0.1 TCP 53302 > 9999 [PSH, ACK] seq=1 ack=1 len=5
0.207 127.0.0.1-127.0.0.1 TCP [TCP retransmission] 53302 > 9999 [PSH, ACK] seq=1 ack=1 len=5
0.623 127.0.0.1-127.0.0.1 TCP [TCP retransmission] 53302 > 9999 [PSH, ACK] seq=1 ack=1 len=5
1.199 127.0.0.1-127.0.0.1 TCP 9999 > 53302 [SYN, ACK] seq=0 ack=1 len=0
1.199 127.0.0.1-127.0.0.1 TCP [TCP Dup ACK 6#1] 53302 > 9999 [ACK] seq=6 ack=1 len=0
1.455 127.0.0.1-127.0.0.1 TCP [TCP retransmission] 53302 > 9999 [PSH, ACK] seq=1 ack=1 len=5
3.123 127.0.0.1-127.0.0.1 TCP [TCP retransmission] 53302 > 9999 [PSH, ACK] seq=1 ack=1 len=5
3.399 127.0.0.1-127.0.0.1 TCP 9999 > 53302 [SYN, ACK] seq=0 ack=1 len=0
3.399 127.0.0.1-127.0.0.1 TCP [TCP Dup ACK 10#1] 53302 > 9999 [ACK] seq=6 ack=1 len=0
6.459 127.0.0.1-127.0.0.1 TCP [TCP retransmission] 53302 > 9999 [PSH, ACK] seq=1 ack=1 len=5
7.599 127.0.0.1-127.0.0.1 TCP 9999 > 53302 [SYN, ACK] seq=0 ack=1 len=0
7.599 127.0.0.1-127.0.0.1 TCP [TCP Dup ACK 13#1] 53302 > 9999 [ACK] seq=6 ack=1 len=0
13.131 127.0.0.1-127.0.0.1 TCP [TCP retransmission] 53302 > 9999 [PSH, ACK] seq=1 ack=1 len=5
15.599 127.0.0.1-127.0.0.1 TCP 9999 > 53302 [SYN, ACK] seq=0 ack=1 len=0
15.599 127.0.0.1-127.0.0.1 TCP [TCP Dup ACK 16#1] 53302 > 9999 [ACK] seq=6 ack=1 len=0
26.491 127.0.0.1-127.0.0.1 TCP [TCP retransmission] 53302 > 9999 [PSH, ACK] seq=1 ack=1 len=5
31.599 127.0.0.1-127.0.0.1 TCP 9999 > 53302 [SYN, ACK] seq=0 ack=1 len=0
31.599 127.0.0.1-127.0.0.1 TCP [TCP Dup ACK 19#1] 53302 > 9999 [ACK] seq=6 ack=1 len=0
53.179 127.0.0.1-127.0.0.1 TCP [TCP retransmission] 53302 > 9999 [PSH, ACK] seq=1 ack=1 len=5
106.491 127.0.0.1-127.0.0.1 TCP [TCP retransmission] 53302 > 9999 [PSH, ACK] seq=1 ack=1 len=5
106.491 127.0.0.1-127.0.0.1 TCP 9999 > 53302 [RST] seq=1 len=0
Because a client TCP implementation receives multiple SYN/ACK packets, it assumes that the ACK packet it sends is lost, thereby resending the ACK (see TCP Dup ACK line in the trace above).
If the server-side application reduces the backlog before reaching the maximum retry count of Syn/ack (that is, an entry is consumed from the accept queue), the TCP implementation will eventually process a client's repeated ACK and connect the state from the SYN The received is converted to established and the connection is added to the Accept queue. Otherwise, the client will eventually receive an RST packet (as shown).
Packet tracing also shows another interesting aspect of the behavior described above. From the client's perspective, the TCP connection will become established after the first Syn/ack packet is received. If the client sends data to the server (without waiting for data from the server), the data is also re-transmitted. Fortunately, a slow start of TCP can limit the number of data segments sent during the retransmission phase.
On the other hand, if the client has been waiting for data from the server, and the server backlog has not been lowered, the end result is that the client's connection state is established, and the server connection status is SYN_RCVD (note: The original is closed state, should be wrong), that is, in a semi-connected state!
There is another aspect that we are not currently discussing. The Listen man manual reference indicates that each SYN packet will cause a TCP connection to be added to the SYN queue unless the SYN queue is full, as is the case with the actual situation, because the tcp_v4_conn_ in net/ipv4/tcp_ipv4.c The request function has the following code:
/* Accept backlog is full. If we have already queued enough * of warm entries in syn queue, drop request. It is better than * clogging syn queue with openreqs with exponentially increasing * timeout. */ if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) { NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS); goto drop; }
The above code means that if the current accept queue is full, the kernel imposes a limit on the rate at which the SYN packets are received. If too many SYN packets are received, some of them will be discarded, which will cause the client to retry sending the SYN packet, resulting in the same behavior as in the BSD-derived implementation.
Finally see why the design of Linux is better than the traditional BSD implementation. Stevens presents the following interesting points:
The backlog can be reached if the completed connection queue fills (i.e., the serve R process or the server host is so busy this process cannot call accept fast enough to take the completed entries off The queue) or if the incomplete connection queue fills. The latter is the problem-HTTP servers face, when the round-trip time between the client and server is long, compared To the arrival rate of new connection requests, because a new SYN occupies an entry on this queue for one round-trip time . [...]
The completed connection queue is almost always empty because if an entry are placed on this queue, the server's call To accept returns, and the server takes the completed connection off the queue.
The solution proposed by Stevens is simply to increase the backlog. The problem with this is that it assumes that if the application wants to adjust the backlog, it should not only consider how to handle the newly established incoming connections, but also consider traffic characteristics such as round-trip time. The implementation of Linux effectively separates the two issues: the application is only responsible for adjusting the backlog so that it can receive (accept) the connection quickly enough to avoid filling the accept queue; The system administrator can adjust the/proc/sys/net/ipv4/tcp_max_syn_backlog according to the traffic characteristics
How the "translate" TCP backlog works in Linux