[Reprint] Details of TCP data retransmission time and data center optimization

Source: Internet
Author: User

Original: http://weibo.com/p/1001603821691477346388

In the data center network, the round-trip time (RTT) of data transmission between machines is generally within 10ms, the time-out of the internal service will be set to 50ms, 200ms, 500ms and so on, if the packet loss occurs during the transmission, the service timeout time, Does the TCP layer have an opportunity to discover and retransmit data again? If set to within 200MS, the answer is no chance, because the Linux system the first retransmission time is equal to the transmitted round-trip time of at least 200ms of the predicted deviation value, that is, if the RTT value is 7ms, the first retransmission time-out is at least 207ms, This way, if the time-out for an interface is set to within 200ms, even if the RTT time is small, it is not tolerated to drop packets at a time because the interface has timed out before TCP discovers the packet drops.

This paper probes into the calculation of the first retransmission time of TCP packets in Linux system, and the result is surprising. The proposed optimization method can theoretically reduce the internal service call delay and error amount.

After TCP sends a packet, it sets a timer that will retransmit the packet if it has not received a reply (ACK) after it expires. The interval between sending a packet to the first retransmission is called a retransmission timeout (RTO), and the RTO is calculated from the roundtrip time (RTT) of the packet plus the predicted deviation of the RTT (fluctuation value).

That is, RTO = Srtt + Rttvar, where Srtt is the smoothed value of the RTT, and Rttvar is the fluctuation value, which represents the possible prediction bias.

Next we'll do an experiment.

First ping the www.weibo.com and look at the round-trip time of the packet, as follows:

[Xiaohong@localhost ~]$ Ping www.weibo.com

PING www.weibo.com (123.125.104.197) bytes of data.

Bytes from 123.125.104.197:icmp_seq=1 ttl=55 time=3.65 ms

Bytes from 123.125.104.197:icmp_seq=2 ttl=55 time=3.38 ms

Bytes from 123.125.104.197:icmp_seq=3 ttl=55 time=4.34 ms

Bytes from 123.125.104.197:icmp_seq=4 ttl=55 time=7.82 ms

Take a look at the RTT-related data for TCP to www.weibo.com, the following command is for CENTOS7 (if the following is the version, the command to run is the IP Route List tab cache) as follows:

[xiaohong@localhost ~]$ sudo ip tcp_metrics

123.125.104.197 Age 22.255sec RTT 7375us Rttvar 7250us CWnd 10

As seen above, the smooth after the RTT value of about 7ms,rttvar about 7ms, that is supposed to be the RTO value should be about 14ms, that is, after 14ms, if not received the response of the other, will be re-transmission of data. Is that the actual situation?

In a command window, run the following command:

[Xiaohong@localhost ~]$ NC www.weibo.com 80

get/http/1.1

Host:www.weibo.com

Connection:

At the same time, open a command-line window and run the following command:

[Xiaohong@localhost iproute2-3.19.0]$ ss-eipn ' (dport =: www) '

TCP estab 0 0 10.209.80.111:56486 123.125.104.197:80 Users: (("NC", 1713,3)) uid:1000 ino : 14243 sk:ffff88002c992d00 <->

TS sack Cubic wscale:0,7 rto:207 rtt:7.375/7.25 mss:1448 cwnd:10 send 15.7Mbps rcv_space:14600

As can be seen from the above results, the actual RTO value is 207ms, equivalent to the RTT value plus 200ms, why?

The reason is analyzed below from the kernel TCP source code.

The function to set the time-out is Tcp_set_rto, in Net/ipv4/tcp_input.c, as follows:

static inline void Tcp_set_rto (struct sock *sk)

{

const struct Tcp_sock *TP = Tcp_sk (SK);

INET_CSK (SK)->icsk_rto = __tcp_set_rto (TP);

Tcp_bound_rto (SK);

}

As can be seen, the retransmission of the timing value Isck_rto is actually called __tcp_set_rto, and then look at its source code, this in the file Include/tcp/net/tcp.h, as follows:

Static inline u32 __tcp_set_rto (const struct Tcp_sock *tp)

{

Return (Tp->srtt >> 3) + tp->rttvar;

}

To avoid floating-point arithmetic, the RTT times 8 is saved in the socket data structure, which can be confirmed from the code:

Icsk_rto = Srtt + Rttvar

and the function to calculate and influence Srtt and Rttvar is Tcp_rtt_estimator, in the file net/ipv4/tcp_input.c, the code is as follows:

static void Tcp_rtt_estimator (struct sock *sk, const __U32 MRTT)

{

struct Tcp_sock *TP = Tcp_sk (SK);

Long m = Mrtt; /* RTT */

/* The following amusing code comes from Jacobson ' s

* Article in Sigcomm ' 88. Note that RTT and Mdev

* is scaled versions of RTT and mean deviation.

* This was designed to be as fast as possible

* M stands for "measurement".

*

* On a 1990 paper the RTO value was changed to:

* RTO = RTT + 4 * Mdev

*

* Funny. This algorithm seems to be very broken.

* These formulae increase RTO, when it should be decreased, increase

* Too slowly, when it should be increased quickly, decrease too quickly

* etc. I guess in BSD RTOs takes one value, so the it is absolutely

* Does not matter how to _calculate_ it. Seems, it was trap

* That VJ failed to avoid. 8)

*/

if (M = = 0)

m = 1;

if (Tp->srtt! = 0) {

M-= (Tp->srtt >> 3); /* M-now-error in RTT est */

Tp->srtt + = m; /* RTT = 7/8 RTT + 1/8 NEW */

if (M < 0) {

m =-M; /* M is now ABS (Error) */

M-= (Tp->mdev >> 2); /* Similar update on Mdev */

/* This is similar to one of Eifel findings.

* Eifel blocks Mdev updates when RTT decreases.

* This solution are a bit different:we use finer gain

* For Mdev in the case (Alpha*beta).

* Like Eifel It also prevents growth of RTO,

* but also it limits too fast rto decreases,

* Happening in pure Eifel.

*/

if (M > 0)

M >>= 3;

} else {

M-= (Tp->mdev >> 2); /* Similar update on Mdev */

}

Tp->mdev + = m; /* Mdev = 3/4 Mdev + new */

if (Tp->mdev > Tp->mdev_max) {

Tp->mdev_max = tp->mdev;

if (Tp->mdev_max > Tp->rttvar)

Tp->rttvar = tp->mdev_max;

}

if (After (Tp->snd_una, tp->rtt_seq)) {

if (Tp->mdev_max < Tp->rttvar)

Tp->rttvar-= (Tp->rttvar-tp->mdev_max) >> 2;

Tp->rtt_seq = tp->snd_nxt;

Tp->mdev_max = Tcp_rto_min (SK);

}

} else {

/* No previous measure. */

Tp->srtt = M << 3; /* Take the measured time to be RTT */

Tp->mdev = M << 1; /* Make sure RTO = 3*rtt */

Tp->mdev_max = Tp->rttvar = Max (Tp->mdev, Tcp_rto_min (SK));

Tp->rtt_seq = tp->snd_nxt;

}

}

As can be seen from the above code, SRTT = 7/8 old Srtt + 1/8 new RTT, this is consistent with the RfC, there is nothing to say.

When the first round-trip time data is obtained (usually when the connection is completed, the client sends a sync request, receives a response from the server, and the client's ACK is received when the Syc+ack is issued), the following is the calculation analysis:

} else {

/* No previous measure. */

/* Data with no RTT previously, this is the code logic for receiving sample data from the first RTT */

/* M is the RTT value of this time, multiplied by 8 to save in SRTT */

Tp->srtt = M << 3; /* Take the measured time to be RTT */

/* The initial deviation value of the RTT Mdev is twice times the RTT value */

Tp->mdev = M << 1; /* Make sure RTO = 3*rtt */

/* Set the maximum value of the Rttvar and RTT deviations Mdev_max the initial values of both */

/* Twice times the RTT value, between Tcp_rto_min, that big, choose that */

Tp->mdev_max = Tp->rttvar = Max (Tp->mdev, Tcp_rto_min (SK));

Tp->rtt_seq = tp->snd_nxt;

}

Then look at the Tcp_rto_min code, in the file include/net/tcp.h:

Static inline u32 tcp_rto_min (struct sock *sk)

{

struct Dst_entry *dst = __sk_dst_get (SK);

U32 rto_min = tcp_rto_min; /* 200ms */

if (DST && dst_metric_locked (DST, rtax_rto_min))

Rto_min = Dst_metric_rtt (DST, rtax_rto_min);

return rto_min;

}

In combination, if the first packet round-trip time is within 100ms, the RTT predicts that the initial deviation value is fixed to 200ms, when the packet round-trip time exceeds the initial value of the 100ms,rtt prediction deviation is twice times the RTT value, which means that the Rttvar minimum is 200ms.

Then the analysis of the functions that calculate and influence Srtt and Rttvar is the code of Tcp_rtt_estimator:

if (Tp->mdev > Tp->mdev_max) {

/* Track the deviation of the RTT, record the maximum deviation Mdev_max */

Tp->mdev_max = tp->mdev;

if (Tp->mdev_max > Tp->rttvar)/* deviation maximum is greater than Rttvar, Rttvar then becomes larger */

Tp->rttvar = tp->mdev_max;

}

if (After (Tp->snd_una, tp->rtt_seq)) {

/* When the maximum deviation is less than Rttvar, the Rttvar will be reduced */

if (Tp->mdev_max < Tp->rttvar)

Tp->rttvar-= (Tp->rttvar-tp->mdev_max) >> 2;

Tp->rtt_seq = tp->snd_nxt;

/* end of each send cycle, reset Mdev_max to Tcp_rto_min */

Tp->mdev_max = Tcp_rto_min (SK);

}

That is, the RTT prediction deviation value Rttvar will change with the actual RTT prediction deviation value, if the fluctuation becomes larger, then it becomes larger, conversely, if the fluctuation becomes smaller, it will also become smaller. However, because the maximum value of the deviation is reset to tcp_rto_min during each send cycle, the RTT prediction deviation value Rttvar not less than 200ms.

So what's the simple way to adjust the 200ms limit? Continue to see Tcp_rto_min code, the previous also posted, as follows:

Static inline u32 tcp_rto_min (struct sock *sk)

{

struct Dst_entry *dst = __sk_dst_get (SK);

U32 rto_min = tcp_rto_min; /* 200ms */

if (DST && dst_metric_locked (DST, rtax_rto_min))

Rto_min = Dst_metric_rtt (DST, rtax_rto_min);

return rto_min;

}

As can be seen from the code above, if the Rto_min value is set in the corresponding Target's routing table entry, the value set will prevail. This can be modified by the NetLink mechanism, which can be done by adding the Rto_min option to the IP route command.

Analyze the source code, and then try it out.

Run the following command to modify the 20ms:

sudo ip route add 123.125.104.197/32 via 10.209.83.254 rto_min 20

Look at the following modified results:

[Xiaohong@localhost ~]$ IP Route list

Default via 10.209.83.254 dev enp0s3 proto static metric 1024

10.209.80.0/22 Dev enp0s3 proto kernel scope link src 10.209.80.111

123.125.104.197 via 10.209.83.254 Dev enp0s3 rto_min lock 20ms

Clear the cache for the following routing table so that you can see the effect immediately:

sudo IP tcp_metrics flush

Re-test access weibo.com:

[Xiaohong@localhost ~]$ NC www.weibo.com 80

GET/

Confirm the results in another terminal:

[Xiaohong@localhost iproute2-3.19.0]$ ss-eipn ' (dport =: www) '

TCP estab 0 0 10.209.80.111:56487 123.125.104.197:80 Users: (("NC", 1786,3)) uid:1000 ino : 14606 sk:ffff88002c992d00 <->

TS sack Cubic wscale:0,7 rto:22 rtt:2/1 mss:1448 cwnd:10 send 57.9Mbps rcv_space:14600

As can be seen, this time the RTT value is 2ms,rto to 22ms, that is already in force.

Welcome to discuss together, shoot bricks can also. Oh.

[Reprint] Details of TCP data retransmission time and data center optimization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.