Implementation of RTO in Linux. Implementation of RTO in Linux. eliminate code logic problems, TCP-related bugs, and explore the implementation of RTO in Linux
You have encountered a Network timeout problem recently. troubleshoot the problem according to your ideas.
1. eliminate code logic issues, TCP-related bugs, kernel parameters, and other issues;
2. during KVM troubleshooting, the problem of timeout occurs again on different KVM of the same host.
It is found that most of the abnormal connections lasts about 1 S. through packet capture analysis, we can see that the packets are retransmitted, and the retransmission time is fixed to 1 second.
Why is the retransmission time 1 Second? what are the related standards and actual implementations?
This article mainly discusses this part of content (2.6.32-358 based on centos)
RFC standard
RTO is determined by the current network condition (RTT) and then by an algorithm. This part of related content is mentioned in "TCP/IP details Volume 1", but it is outdated.
After checking the RFC, RFC6298 is the latest related to retransmission timeout. RFC1122 is updated and RFC2988 is discarded.
I will give a brief introduction to the content. if you are interested, click it.
RFC6298
1. repeat the basic RTO calculation method:
First, there is a time parameter RTO_MIN obtained through the clock.
Initialization:
First Calculation:
Future Computing:
The minimum RTO value is 1 second. The maximum value must be greater than 60 seconds.
2 for multiple retransmission of the same package, the Karn algorithm must be used, that is, the double increase seen just now
In addition, RTT sampling cannot use retransmission packets unless the timestamps parameter is enabled (RTT can be accurately calculated using this parameter)
3 when 4 * RTTVAR tends to 0, the obtained value must be close to RTO_MIN.
The more accurate the clock, the better. the better the error is within Ms.
4 RTO timer management
(1) when sending data (including re-transmission), check whether the timer is started. if not, start. Delete the timer when the ACK of the data is received.
(2) use RTO = RTO * 2 for Backoff
(3) New FALLBACK feature: When the timer expires while waiting for SYN packets, and the current TCP implementation uses RTO of less than 3 seconds, the RTO of the connection pair must be reset to 3 seconds. The reset RTO will be used for the transmission of formal data (after the three-way handshake ends)
Analyze the actual implementation of linux
Send syn packet with three-way handshake
123456 |
01:00:00. 129688 IP 172.16.3.14.1868> 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length. 129065 IP 172.16.3.14.1868> 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length. 129063 IP 172.16.3.14.1868> 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length. 129074 IP 172.16.3.14.1868> 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length. 129072 IP 172.16.3.14.1868> 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length. 129128 IP 172.16.3.14.1868> 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length 0 |
Double increment from 1 second
It is worth noting that after the fifth timeout, the upper-layer connection will be notified of timeout only after the sixth timeout, which is 63 seconds in total.
Send a three-way handshake syncak packet
1234567 |
01:17:20. 084839 IP 172.16.3.15.2535> 172.16.3.14.80: Flags [S], seq 1297135388, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length. 084908 IP 172.16.3.14.80> 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length. 284093 IP 172.16.3.14.80> 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length. 284088 IP 172.16.3.14.80> 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length. 284095 IP 172.16.3.14.80> 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length. 284097 IP 172.16.3.14.80> 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length. 284093 IP 172.16.3.14.80> 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length 0 |
Double increment from 1 second
Normal packet transmission
12345678910111213141516 |
01:32:20. 443757 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 110. 644600 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 110:32:21. 046579 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 110:32:21. 850632 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 110. 458555 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 110. 674594 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 110. 106601 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 110. 970567 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 110:33:11. 698415 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 110:34:03. 154300 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 110:35:46. 065892 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 110:37:46. 065382 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 110:39:46. 064917 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 110. 064466 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 110. 064060 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 110:45:46. 063675 IP 172.16.3.15.2548> 172.16.3.14.80: Flags [P.], seq 3319667389: 3319667400, ack 1233846614, win 115, length 11 |
Increase from 0.2 seconds, up to 120 seconds, a total of 15 times
It is worth noting that it starts from 32 minutes and ends at 47 minutes, that is, about 15 minutes and 25 seconds.
Does linux support the FALLBACK feature? let's do a simple test.
123456789101112131415161718192021222324252627282930 |
After the server enables iptables, the client connects to the server and closes iptables23: 35: 01.036565 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [S], seq 2364912154, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length. 036152 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [S], seq 2364912154, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length 023:35:04. 036126 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [S], seq 2364912154, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length 023:35:08. 036127 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [S], seq 2364912154, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length. 036131 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [S], seq 2364912154, win 14600, options [mss 1460, nop, nop, sackOK, nop, wscale 7], length. 036842 IP 172.16.10.40.12345> 172.16.3.14.6071: Flags [S.], seq 3634006739, ack 2364912155, win 14600, options [mss 1460], length. 036896 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [.], ack 3634006740, win 14600, length 0 after the server enables iptables, the client sends a packet and closes iptables23: 35: 48.129273 IP 172.16.3.14.6071> 172.16.10.40.12345 within 15 times of timeout: flags [P.], seq 2364912155: 2364912156, ack 3634006740, win 14600, length. 129120 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [P.], seq 2364912155: 2364912156, ack 3634006740, win 14600, length. 129070 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [P.], seq 2364912155: 2364912156, ack 3634006740, win 14600, length. 129068 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [P.], seq 2364912155: 2364912156, ack 3634006740, win 14600, length. 129802 IP 172.16.10.40.12345> 172.16.3.14.6071: Flags [.], when the server does not enable iptables, the client sends a packet at 23:36:15. 217231 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [P.], seq 2364912156: 2364912157, ack 3634006740, win 14600, length. 217766 IP 172.16.10.40.12345> 172.16.3.14.6071: Flags [.], ack 2364912157, win 14600, length 0, server enable iptables, and client sends the packet at 23:36:26. 658172 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [P.], seq 2364912157: 2364912158, ack 3634006740, win 14600, length. 859055 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [P.], seq 2364912157: 2364912158, ack 3634006740, win 14600, length. 261065 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [P.], seq 2364912157: 2364912158, ack 3634006740, win 14600, length. 065106 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [P.], seq 2364912157: 2364912158, ack 3634006740, win 14600, length. 673132 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [P.], seq 2364912157: 2364912158, ack 3634006740, win 14600, length. 889068 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [P.], seq 2364912157: 2364912158, ack 3634006740, win 14600, length. 321091 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [P.], seq 2364912157: 2364912158, ack 3634006740, win 14600, length. 185135 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [P.], seq 2364912157: 2364912158, ack 3634006740, win 14600, length. 913091 IP 172.16.3.14.6071> 172.16.10.40.12345: Flags [P.], seq 2364912157: 2364912158, ack 3634006740, win 14600, length 1 |
From this test, we can find that when the RTT exceeds 1 second during three handshakes, the RTO of the data sending phase is 3 seconds (as is the case when the server SYNACK times out)
After a normal RTT, RTO converges to about MS.
Let's see how timestamps supports
1234567891011121314151617 |
After the server enables iptables, the client connects to the server and closes iptables23: 47: 47.754316 IP 172.16.3.14.8603> 172.16.10.40.12345: Flags [S], seq 479022248, win 14600, options [mss 1460, sackOK, TS val 2336007392 ecr 0, nop, wscale 7], length. 754079 IP 172.16.3.14.8603> 172.16.10.40.12345: Flags [S], seq 479022248, win 14600, options [mss 1460, sackOK, TS val 2336008392 ecr 0, nop, wscale 7. 754088 IP 172.16.3.14.8603> 172.16.10.40.12345: Flags [S], seq 479022248, win 14600, options [mss 1460, sackOK, TS val 2336010392 ecr 0, nop, wscale 7. 754083 IP 172.16.3.14.8603> 172.16.10.40.12345: Flags [S], seq 479022248, win 14600, options [mss 1460, sackOK, TS val 2336014392 ecr 0, nop, wscale 7. 754094 IP 172.16.3.14.8603> 172.16.10.40.12345: Flags [S], seq 479022248, win 14600, options [mss 1460, sackOK, TS val 2336022392 ecr 0, nop, wscale 7. 754683 IP 172.16.10.40.12345> 172.16.3.14.8603: Flags [S.], seq 697602971, ack 479022249, win 14480, options [mss 1460, nop, nop, TS val 4044659641 ecr 2336022392], length. 754742 IP 172.16.3.14.8603> 172.16.10.40.12345: Flags [.], ack 697602972, win 14600, options [nop, nop, TS val 2336022392 ecr 4044659641], length 0. after the server enables iptables, the client sends data packets and closes iptables23 within 15 times of timeout: 48: 11.944170 IP 172.16.3.14.8603> 172.16.10.40.12345: Flags [P.], seq 479022249: 479022250, ack 697602972, win 14600, options [nop, nop, TS val 2336031582 ecr 4044659641], length. 145036 IP 172.16.3.14.8603> 172.16.10.40.12345: Flags [P.], seq 479022249: 479022250, ack 697602972, win 14600, options [nop, nop, TS val 2336031783 ecr 4044659641], length. 547084 IP 172.16.3.14.8603> 172.16.10.40.12345: Flags [P.], seq 479022249: 479022250, ack 697602972, win 14600, options [nop, nop, TS val 2336032185 ecr 4044659641], length. 351106 IP 172.16.3.14.8603> 172.16.10.40.12345: Flags [P.], seq 479022249: 479022250, ack 697602972, win 14600, options [nop, nop, TS val 2336032989 ecr 4044659641], length. 959080 IP 172.16.3.14.8603> 172.16.10.40.12345: Flags [P.], seq 479022249: 479022250, ack 697602972, win 14600, options [nop, nop, TS val 2336034597 ecr 4044659641], length. 175092 IP 172.16.3.14.8603> 172.16.10.40.12345: Flags [P.], seq 479022249: 479022250, ack 697602972, win 14600, options [nop, nop, TS val 2336037813 ecr 4044659641], length. 607088 IP 172.16.3.14.8603> 172.16.10.40.12345: Flags [P.], seq 479022249: 479022250, ack 697602972, win 14600, options [nop, nop, TS val 2336044245 ecr 4044659641], length 1 |
After timestamps is enabled, the FALLBACK mechanism does not work if RTO is reset to 3 seconds.
Fine-tuning RTO computing in linux
The actual implementation of RTO computing in linux is different from that in RFC documents. if you only follow the RFC document to search for details, then the actual RTO estimation will go astray.
1 According to the previous section, we can find that he sets the minimum RTO value to 200 ms (even 50 ms on ubuntu, and 1 second is recommended for RFC ), the maximum value is set to 120 seconds (RFC enforces 60 seconds or more)
2. Based on my analysis of linux code, in the case of sharp RTT jitter, the implementation of linux reduces the RTT interference caused by sharp changes, making the RTO trend chart smoother.
This is reflected in two points of fine-tuning:
Fine-tuning 1
When the following conditions are met:
RTTVAR "/>
It indicates that r' fluctuates too much, and the RTT value ratio is also greater than RTTVAR.
Therefore
The RFC document is
As you can see, the smoothing factor multiplied by 1/8 compared to the RFC document indicates that r' has less impact on RTTVAR, making RTTVAR smoother and RTO smoother.
Fine-tuning 2
When RTTVAR is reduced, it will perform a smooth processing on RTTVAR, so that RTO will not fall too far and a steep trend chart will appear.
Here, RTTVAR refers to the value calculated based on RTT. this value limits the RTTVAR value after the lower limit (RTO_MIN) and compared with the RTTVAR value when the previous RTT is detected, smooth processing with a 1/4 coefficient
Why not handle the increase? I think it is okay to increase RTO, but if you reduce a small amount, it may cause spurous retransmission (For more information about this term, see the RFC document mentioned above)
Manual intervention to modify RTO
Back to the initial question, can we shorten the RTO value, and how can this RTO value be estimated based on the actual implementation of linux?
Obviously, RTO initial values (including FALLBACK) cannot be changed. This part is fixed and written in the code.
The RTO value other than the three-way handshake is predictable.
Assuming that the network is stable during estimation, the RTT never changes to R (otherwise, it will be extremely complicated due to fine-tuning 1 and 2)
SRTT will always be R, and RTTVAR will always be 0.5R
Otherwise
Therefore, you only need to change the RTO_MIN value to significantly affect the RTO value.
RTO_MIN settings
RTO_MIN settings are implemented based on ip route
12345678910111213 |
Root@localhost.localdomain ~ # Ping www. baidu. comPING www.a.shifen.com (180.97.33.107) 56 (84) bytes of data.64 bytes from 180.97.33.107: icmp_seq = 1 ttl = 51 time = 30.8 ms64 bytes from 180.97.33.107: icmp_seq = 2 ttl = 51 time = 29.9 ms after obtaining Baidu's IP address [root@localhost.localdomain ~] # Ip route add 180.97.33.108/32 via 172.16.3.1 rto_min 20 [root@localhost.localdomain ~] # Nc www.baidu.com 80 [root@localhost.localdomain ~] # Ss-eipn' (dport =: www) 'state Recv-Q Send-Q Local Address: Port Peer Address: PortESTAB 0 0 172.16.3.14: 14149 180.97.33.108: 80 users :( ("nc", 7162,3) ino: 48057454 sk: ffff88023905adc0sack cubic wscale: 27/13 rto: 81 rtt. 5 cwnd: 10 send 4.3 Mbps rcv_space: 14600 |
Because RTO_MIN <2R, RTO = 3R = 27*3 = 81
If it is an intranet, the RTT is very small.
1234567 |
Root@localhost.localdomain ~ # Ip route add 172.16.3.16/32 via 172.16.3.1 rto_min 20 [root@localhost.localdomain ~] # Nc 172.16.3.16 22SSH-2.0-OpenSSH_5.3 [root@localhost.localdomain ~] # Ss-eipn '(dport =: 22) 'state Recv-Q Send-Q Local Address: Port Peer Address: PortESTAB 0 0 172.16.3.14: 57578 172.16.3.16: 22 users :( ("nc", 7272,3) ino: 48059707 sk: ffff88023b7c7000sack cubic wscale: 7,7 rto: 21 rtt: 1/0. 5 ato: 40 cwnd: 10 send 116.8 Mbps rcv_space: 14600 |
Because RTO_MIN> 2R, RTO = R + RTO_MIN = 1 + 20 = 21
If you are confident about the entire intranet network, you can directly apply it to all connections without setting the target IP address.
1 |
Ip route change dev eth0 rto_min 20 ms |
Summary
1 linux's timeout retransmission implementation is generally referred to in RFC, but there are some minor adjustments:
RFC has only one RTO initial value, which is 1 second. In linux, the RTO of the three-way handshake package is set to 1 second, and the initial time of other packages is set to 0.2 seconds.
Due to the imperfect RFC algorithm, the actual implementation of linux reduces the RTT interference caused by sharp RTT jitter, making the RTO trend chart smoother.
2. the SYN retransmission time of the connection cannot be adjusted unless the kernel is re-compiled, but the push package can adjust the retransmission time.
3. in a stable network, assume that the minimum rto value is RTO_MIN.
2RTT), RTO = RTT + RTO. _ MIN "/>
The implementation of explain (RTO) has recently encountered a Network timeout problem, which should be checked according to the general idea. 1. eliminate code logic problems and possible TCP-related bugs ,...