Linux TCP Delay confirmation

Last Update:2018-12-04 Source: Internet

Author: User

Tags sendfile

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Case 1:The logic of a stress testing program written by a colleague is: to send n 132-byte packets consecutively every second, then, N 132-byte packets are received consecutively from the backend service. The code is simplified as follows:

Char sndbuf [132];

Char rcvbuf [132];

While (1 ){

For (INT I = 0; I <n; I ++ ){

Send (FD, sndbuf, sizeof (sndbuf), 0 );

...

}

For (INT I = 0; I <n; I ++ ){

Recv (FD, rcvbuf, sizeof (rcvbuf), 0 );

...

}

Sleep (1 );

}

In actual tests, it is found that when n is greater than or equal to 3, after 2nd seconds, each third Recv call will always block about 40 milliseconds. However, when analyzing server logs, it is found that all requests are consumed less than 2 ms during server processing.

The specific positioning process at that time was as follows: first try to use strace to track the client process, But the strange thing is: Once the strace attach process, all the sending and receiving operations are normal and there will be no blocking, once you exit strace, the problem is reproduced. My colleague reminded me that strace may have changed some things of the program or system (this problem is not clear yet), so I used tcpdump to capture packets and analyze it, after the server backend returns the response packet, the client does not immediately confirm the ACK of the data, but waits for nearly 40 milliseconds to confirm. After Google, and read "TCP/IP details 1: Protocol", it is known that TCP Delay confirmation (delayed
ACK) mechanism.

The solution is as follows: after calling the Recv system, call the setsockopt function once to set tcp_quickack. The final code is as follows:

Char sndbuf [132];

Char rcvbuf [132];

While (1 ){

For (INT I = 0; I <n; I ++ ){

Send (FD, sndbuf, 132, 0 );

...

}

For (INT I = 0; I <n; I ++ ){

Recv (FD, rcvbuf, 132, 0 );

Setsockopt (FD, ipproto_tcp, tcp_quickack, (INT []) {1}, sizeof (INT ));

}

Sleep (1 );

}

Case 2:When performing performance tests on the memory-based cdkey version of the marketing platform, the distribution of requests is abnormal: 90% of requests are within 2 ms, and 10% of requests are always between 38-42 Ms, this is a regular number: 40 ms. Because I have experienced case 1 before, I guess it is also because of the time consumption problem caused by the delay validation mechanism. After a simple packet capture verification, I can solve the latency problem by setting the tcp_quickack option.

Delay confirmation Mechanism

In Chapter 19th of "TCP/IP explanation Volume 1: Protocol", the principle of TCP is described in detail: TCP is processing Interactive Data streams (Interactive Data Flow), which is different from bulk data flow, A typical interactive data stream, such as telnet and rlogin, uses the delayed ack mechanism and the Nagle algorithm to reduce the number of small groups.

The principles of these two mechanisms have been clearly explained in the book, so we will not repeat them here. The subsequent sections of this article will explain the TCP Delay confirmation mechanism by analyzing the implementation of TCP/IP in Linux.

1. Why does TCP Delay confirmation result in delay?

In fact, there is only a delay confirmation mechanism, which will not lead to request latency (initially thought it was necessary to wait until the ACK package was issued and the Recv system will return the call ). In general, time consumption may increase only when this mechanism is mixed with the Nagle algorithm or congestion control (slow start or congestion avoidance. Let's take a detailed look at how they interact:

Latency validation and Nagle Algorithm

Let's first look at the Nagle algorithm rules (see the tcp_nagle_check function comment in the tcp_output.c file ):

1) if the package length reaches MSS, it can be sent;

2) If the message contains fin, the message can be sent;

3) if the tcp_nodelay option is set, the message can be sent;

4) if the tcp_cork option is not set, if all outgoing packets are confirmed, or all outgoing small packets (with a smaller package length than MSS) are confirmed, send is allowed.

For rule 4), that is to say, a TCP connection can only have one unconfirmed small packet. Other small packets cannot be sent before the Group's confirmation arrives. If the confirmation of a small group is delayed (40 ms in the case), the subsequent sending of the small group will be delayed accordingly. That is to say, delayed validation does not affect the data packet to be delayed, but the response packet to be delayed.

1 00:44:37. 878027 IP 172.25.38.135.44792> 172.25.81.16.9877: s 3512052379: 3512052379 (0) Win 5840 <MSS 1448, wscale 7>

2 00:44:37. 878045 IP 172.25.81.16.9877> 172.25.38.135.44792: s 3581620571: 3581620571 (0) ack 3512052380 win 5792 <MSS 1460, wscale 2>

3 00:44:37. 879080 IP 172.25.38.135.44792> 172.25.81.16.9877:. Ack 1 win 46

......

4 00:44:38. 885325 IP 172.25.38.135.44792> 172.25.81.16.9877: P 1321: 1453 (132) ack 1321 win 86

5 00:44:38. 886037 IP 172.25.81.16.9877> 172.25.38.135.44792: P 1321: 1453 (132) ack 1453 win 2310

6 00:44:38. 887174 IP 172.25.38.135.44792> 172.25.81.16.9877: P 1453: 2641 (1188) ack 1453 win 102

7 00:44:38. 887888 IP 172.25.81.16.9877> 172.25.38.135.44792: P 1453: 2476 (1023) ack 2641 win 2904

8 00:44:38. 925270 IP 172.25.38.135.44792> 172.25.81.16.9877:. Ack 2476 win 118

9 00:44:38. 925276 IP 172.25.81.16.9877> 172.25.38.135.44792: P 2476: 2641 (165) ack 2641 win 2904

10 00:44:38. 926328 IP 172.25.38.135.44792> 172.25.81.16.9877:. Ack 2641 win 134

According to the tcpdump packet capture analysis, 8th packets are delayed, while the data of 9th packets is on the server (172.25.81.16) although it has already been placed in the TCP sending buffer (send called by the application layer has been returned), according to the Nagle algorithm, 9th packets need to wait until the first 7 packets (smaller than MSS) ack can be issued only after it is reached.

Latency validation and congestion control

We first disable the Nagle algorithm using the tcp_nodelay option, and then analyze how latency validation interacts with TCP congestion control.

Slow start: the sender of TCP maintains a congestion window, which is marked as cwnd. The TCP connection is established. This value is initialized to one packet segment. Each time an ACK is received, this value adds one packet segment. The sender takes the minimum value in the congestion window and notification window (corresponding to the sliding window mechanism) as the sending upper limit (the congestion window is the traffic control used by the sender, the notification window is the traffic control used by the receiver ). The sender starts to send one packet segment. After receiving ACK, cwnd increases from 1 to 2, that is, two packet segments can be sent. After receiving the ACK of the two packet segments, cwnd is increased to 4, that is, exponential growth: for example, in the first RTT, send a packet and receive its ACK, cwnd is increased by 1, and in the second RTT, you can send two packets, when two corresponding ack s are received, cwnd increases by 1 every time it receives an ACK and eventually changes to 4, achieving exponential growth.

In Linux, not every time an ACK packet is received, cwnd increases by 1. If no other packet is waiting for ACK when Ack is received, no additional packet is added.

I use the test code in Case 1. In the actual test, cwnd starts from the initial value of 2 and eventually maintains the value of 3 packet segments. The result of tcpdump is as follows:

1 16:46:14. 288604 IP 172.16.1.3.1913> 172.16.1.2.20001: s 1324697951: 1324697951 (0) Win 5840 <MSS 1460, wscale 2>

2 16:46:14. 289549 IP 172.16.1.2.20001> 172.16.1.3.1913: s 2866427156: 2866427156 (0) ack 1324697952 win 5792 <MSS 1460, wscale 2>

3 16:46:14. 288690 IP 172.16.1.3.1913> 172.16.1.2.20001:. Ack 1 win 1460

......

4 16:46:15. 327493 IP 172.16.1.3.1913> 172.16.1.2.20001: P 1321: 1453 (132) ack 1321 win 4140

5 16:46:15. 329749 IP 172.16.1.2.20001> 172.16.1.3.1913: P 1321: 1453 (132) ack 1453 win 2904

6 16:46:15. 330001 IP 172.16.1.3.1913> 172.16.1.2.20001: P 1453: 2641 (1188) ack 1453 win 4140

7 16:46:15. 333629 IP 172.16.1.2.20001> 172.16.1.3.1913: P 1453: 1585 (132) ack 2641 win 3498

8 16:46:15. 337629 IP 172.16.1.2.20001> 172.16.1.3.1913: P 1585: 1717 (132) ack 2641 win 3498

9 16:46:15. 340035 IP 172.16.1.2.20001> 172.16.1.3.1913: P 1717: 1849 (132) ack 2641 win 3498

10 16:46:15. 371416 IP 172.16.1.3.1913> 172.16.1.2.20001:. Ack 1849 win 4140

11 16:46:15. 371461 IP 172.16.1.2.20001> 172.16.1.3.1913: P 1849: 2641 (792) ack 2641 win 3498

12 16:46:15. 371581 IP 172.16.1.3.1913> 172.16.1.2.20001:. Ack 2641 win 4536

The package in the above table is when tcp_nodelay is set and cwnd has increased to 3. The size of the congestion window is limited after 7th, 8, and 9 are sent, even if data in the TCP buffer zone can be sent at this time, it cannot be sent any more, that is, 11th packets can be sent only after 10th packets arrive, while 10th packets have a 40 ms delay.

Note: You can use the tcp_info option (MAN 7 TCP) of getsockopt to view the details of TCP connections, such as the current congestion window size and MSS.

2. Why 40 ms? Can this time be adjusted?

First, in the RedHat official document, there are the following instructions:

When sending small packets, some applications may delay due to the TCP delayed ack mechanism. The default value is 40 ms. You can modify tcp_delack_min to adjust the minimum latency validation time at the system level. For example:

# Echo 1>/proc/sys/NET/IPv4/tcp_delack_min

That is, it is expected to set the minimum delay confirmation timeout time to 1 ms.

However, this option is not found in the slackware and Suse systems, that is, the minimum value of 40 ms, which cannot be adjusted through configuration in these two systems.

Linux-2.6.39.1/NET/tcp. H has the following macro definition:

# Define tcp_delack_min (unsigned) (Hz/25)/* minimal time to delay before sending an ACK */

Note: the Linux kernel sends timer interrupt (IRQ 0) Every fixed period. Hz is used to define the number of timer interrupts times per second. For example, if Hz is 1000, there are 1000 timer interrupts requests per second. Hz can be set during kernel compilation. The system running on our existing server has a Hz value of 250.

The minimum delay confirmation time is 40 ms.

The delay confirmation time of TCP connections is generally initialized to a minimum value of 40 ms, and then is continuously adjusted based on the connection Retransmission timeout (RTO), the time interval between the last received data packet and the current received data packet, and other parameters. Specific adjustment algorithm, you can refer to the linux-2.6.39.1/NET/IPv4/tcp_input.c, line 564 tcp_event_data_recv function.

3. Why does tcp_quickack need to be reset after each Recv call?

In man 7 TCP, the following descriptions are provided:

Tcp_quickack

Enable quickack mode if set or disable quickack mode if cleared. in quickack mode, acks are sent immediately, rather than delayed if needed in accordance to normal TCP operation. this flag is not permanent, it only enables a switch
Or from quickack mode. Subsequent operation of the TCP protocol will once again enter/leave quickack mode depending on internal protocol processing and factors such as delayed ack timeouts occurring and data transfer. This option shocould
Not be used in Code intended to be portable.

The manual clearly states that tcp_quickack is not permanent. What is the specific implementation? Refer to the setsockopt function for the implementation of the tcp_quickack option:

Case tcp_quickack:

If (! Val ){

Icsk-> icsk_ack.pingpong = 1;

} Else {

Icsk-> icsk_ack.pingpong = 0;

If (1 <SK-> sk_state )&

(Tcpf_established | tcpf_close_wait )&&

Inet_csk_ack_scheduled (SK )){

Icsk-> icsk_ack.pending | = icsk_ack_pushed;

Tcp_cleanup_rbuf (SK, 1 );

If (! (Val & 1 ))

Icsk-> icsk_ack.pingpong = 1;

}

Break;

In fact, in Linux, the socket has a pingpong attribute to indicate whether the current link is an interactive data stream. If its value is 1, it indicates an interactive data stream and uses a delay confirmation mechanism. However, the pingpong value changes dynamically. For example, when a TCP link sends a packet, it will execute the following function (linux-2.6.39.1/NET/IPv4/tcp_output.c, line 156 ):

/* Congestion state accounting after a packet has been sent .*/

Static void tcp_event_data_sent (struct tcp_sock * TP,

Struct sk_buff * SKB, struct sock * SK)

{

......

TP-> lsndtime = now;

/* If it is a reply for Ato After last modified ed

* Packet, enter pingpong mode.

If (u32) (now-icsk-> icsk_ack.lrcvtime) <icsk-> icsk_ack.ato)

Icsk-> icsk_ack.pingpong = 1;

}

The last two lines of code describe: If the interval between the current time and the last time the packet was received is less than the timeout value of the latency confirmation calculated, the interactive data flow mode is re-entered. It can also be understood that when the delay confirmation mechanism is confirmed to be valid, it will automatically enter the interactive mode.

According to the above analysis,The tcp_quickack option needs to be reset after each Recv call.

4. Why not all packages are delayed?

In TCP implementation, use the tcp_in_quickack_mode (linux-2.6.39.1/NET/IPv4/tcp_input.c, line 197) function to determine whether to send ACK immediately. The function implementation is as follows:

/* Send acks quickly, if "quick" count is not exhausted

* And the session is not interactive.

Static inline int tcp_in_quickack_mode (const struct sock * SK)

{

Const struct inet_connection_sock * icsk = inet_csk (SK );

Return icsk-> icsk_ack.quick &&! Icsk-> icsk_ack.pingpong;

}

Two conditions must be met before the quickack mode can be considered:

1. pingpong is set to 0.

2. Quick check must be non-0.

The pingpong value is described above. In the code of the quick attribute, the comment is: scheduled number of quick acks, that is, the number of packages quickly confirmed. Each time the package enters the quickack mode, quick is initialized to receive window divided by 2x MSS value (linux-2.6.39.1/NET/IPv4/tcp_input.c, line 174), each time an ACK packet is sent, quick is reduced by 1.

5. About the tcp_cork Option

Like tcp_cork, tcp_nodelay controls Nagle.

1. Enable the tcp_nodelay option, which means that no matter how small the data packet is, it is immediately sent (regardless of the congestion window ).

2. If a TCP connection is compared to a pipe, the tcp_cork option acts like a plug-in. When the tcp_cork option is set, the pipe is blocked by the plug-in, and the tcp_cork option is canceled to unplug the plug-in. For example, the following code:

Int on = 1;

Setsockopt (sockfd, sol_tcp, tcp_cork, & on, sizeof (on); // set tcp_cork

Write (sockfd,...); // e.g., HTTP Header

Sendfile (sockfd,...); // e.g., HTTP body

On = 0;

Setsockopt (sockfd, sol_tcp, tcp_cork, & on, sizeof (on); // unset tcp_cork

When the tcp_cork option is set, the TCP link will not send any packets, that is, it will be sent only when the data volume reaches MSS. When data transmission is complete, you usually need to cancel this option so that it can be blocked, but it is not enough for MSS package size to be released at the same time. If the application determines that multiple data sets can be sent together (such as the HTTP response header and body), we recommend that you set the tcp_cork option so that there is no latency between the data. This option is generally used for Web server and file server to improve performance and throughput.

The famous high-performance WEB server nginx, when using the sendfile mode, you can set the option to enable tcp_cork: Set tcp_nopush in the nginx. conf configuration file to on. (Tcp_nopush and tcp_cork have similar implementation functions, except that nopush is implemented in BSD and cork is implemented in Linux ). In addition, to reduce system calls and maximize performance, nginx targets short connections (usually closes connections after data is transmitted, except for HTTP persistent connections of keep-alive ), the program does not call setsockopt to cancel the tcp_cork option, because closing the connection will automatically cancel the tcp_cork option and issue the remaining data.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More