TCP TIME_WAIT quick recovery and reuse

Source: Internet
Author: User

TCP TIME_WAIT fast recovery and reuse Declaration: in Linux, the tcp TIME_WAIT value cannot be modified, unless re-compiled, at least I did not find how to change it. It is worth noting that the net. ipv4.tcp _ fin_timeout parameter is the value of FIN_WAIT_2, not the value of TIME_WAIT. I don't know why many people think of it as the value of TIME_WAIT. I think there are two points: 1. TIME_WAIT is so dazzling that all timeouts, coupled with a tcp configuration, will be taken for granted to TIME_WAIT; 2. FIN_WAIT_2 is so obscure that few people know that it is also a state. Therefore, I don't think everyone can take it for granted when learning. The role of TIME_WAIT TCP was first seen in the early days of the Internet. At that time, the network was very unstable and a large number of packet loss occurs. For redundancy, a large number of packet copies were transmitted in multiple paths, and the network speed was slow-because of this, TCP makes more sense. To ensure the strict semantics of TCP, it is necessary to avoid the above Redundancy mechanism and problems caused by slow network speed. These problems are mainly reflected in the four waves when the connection is closed. Because TCP is full-duplex, closing the connection must be performed in two directions. The first party that initiates the closure is the active closing party, and the other party is the passive closing party. Many people will be dizzy here. In fact, four handshakes are easier than three handshakes. Four waves are simply divided into three processes: Process 1. send the FIN by the active closing party, and send the ACK of the FIN after the passive closing party receives the FIN; process 2. send the FIN by the passive closing party, and send the ACK of the FIN after the active closing party receives the FIN; process 3. the passive closing party receives ACK. After the above three steps, the circle is closed! That is to say, after process 3, the passive closing party can 100% confirm that the connection has been closed, so it can directly enter the CLOSE state, but the active closing party, it cannot determine whether the last ACK sent to the passive closing party has been received. According to the TCP protocol specification, ACK is not performed on the ACK, so it cannot receive any data from the passive closing party, so here we are in a deadlock. How can the active closing side of the TCP connection ensure the closure of the circle? Here, something out of the protocol works. Like STP (Spanning tree), it converges based on various timeout values. The IP address also has a timeout value, that is, MSL, which is very important, because they provide an insurmountable physical boundary, they are the only external inputs of self-consistent protocols. MSL indicates that this is the longest time for IP packets to survive on Earth. If it is on Mars, the Linux code must be redefined and re-compiled. The problem is solved, and one Party proactively closes the connection and waits for MSL time to release the connection. The status is TIME_WAIT. For the passive closing party, after the FIN is sent, it is in the LAST_ACK state. Since the FIN has been sent, nothing more than an ACK is missing, and the connection itself is actually closed, therefore, the passive shutdown party has no TIME_WAIT status. In fact, only two times of MSL can indicate the complete loss of a message, because it must be recorded in the MSL when its ACK is returned. The problem of TIME_WAIT is not much said. Due to the existence of TIME_WAIT, the socket closed during short connections will occupy a large amount of tuple space for a long time. TIME_WAIT fast recovery Linux implements a fast recovery mechanism of TIME_WAIT status, that is, it does not need to wait for two times of MSL for so long, but wait for a Retrans time to release, that is, wait for a retransmission time (usually super short, so that you cannot see the TIME_WAIT status in netstat-ant) and release immediately. After being released, the tuple element information of a connection will be lost. At this time, the newly established TCP is in danger. What is the danger? I. e. g.: 1. It may be terminated by a late FIN package; 2. It may be hijacked by the previous connection; so some measures are required to avoid these risks. What means? Although the previously connected tuple information is lost, a peer information can be saved on the IP layer. Note that this information is not only used for the layer-4 TCP protocol, but also used by the routing logic, its fields include but are not limited to: the timestamp of the peer IP address peer last touched By TCP... after the TIME_WAIT connection is quickly released, the peer is retained. Only port information is lost. However, it is enough to have peer IP address information and the timestamp of the last TCP touch. The TCP specification provides an optimization, that is, a new connection violates the following points at the same time, other services can be quickly accessed, even if it is in the TIME_WAIT status (but it is quickly recycled): 1. the TCP connection from the same machine carries the timestamp; 2. previous TCP data of the same peer machine (only identifying IP addresses, because the connection was quickly released, without port information) was transferred to the local machine within MSL seconds; 3. the timestamp of the new connection is smaller than the timestamp of the peer machine when the last TCP arrives, and the difference value is greater than the replay timestamp. It seems that a new connection can be rejected only when the above three points are met at the same time, which is much less likely than the connection rejection caused by the obstacles set by the TIME_WAIT mechanism, the above fast release mechanism does not have port information! This increases the probability by 65535 times. However, for a single machine, this is nothing, so the timestamp of a single machine cannot be reversed. When all the above three points are met, the old repeated data packets must have returned. However, once a NAT device is involved, it will be miserable because the NAT device changes the source IP address of the packet to an address (or a small number of IP addresses ), however, the timestamp of the TCP packet is basically not modified, which causes problems. Assume that the TCP timestamp is enabled for both PC1 and PC2. They are connected to port 22 of server S1 through the NAT device N1: PC1: 192.168.100.1PC2: 192.168.100.2N1 Internet port (the address after NAT ): 172.16.100.1S2: 172.16.100.2 configurations of all involved machines: net. ipv4.tcp _ tw_recycle = 1net. ipv4.tcp _ tw_reuse = 1net. ipv4.tcp _ timestamps = 1TCP timestamp is calculated based on the jiffers and uptime of the Local Machine. Now I can ensure that the timestamp of PC2 is much smaller than pc1. Now, telnet 172.16.100.2 22 is successfully established on PC1, and packets are captured on S1 to obtain the timestamps: TS val 698583769. In order to enable S1, run kill $ (ps-ef | grep [s] sh | grep acce | awk-F ''' {print $2} ') on S1 }'); the purpose is to terminate the three-way handshake without touching the connected ssh. Telnet on PC2 now: telnet 172.16.100.2 22 is unavailable! Capture packets on S1 and get the timestamps: TS val 27727766. Smaller than PC1! Because of the NAT device, S1 seems to have been issued by the same machine, and the timestamp is reversed, the connection is rejected! Check the Count value on S1: cat/proc/net/netstat finds that the value of PAWSPassive increases by 1, and the count value increases by 1 every time PC2 resends SYN, the connection is successful only after an MSL time. If there is no problem in turn, that is, telnet on PC2 first, then S1 closes actively, and then telnet on pc1. this is because the timestamp is increasing progressively, does not meet the above third point. This problem occurs on only two machines. How can a large number of source machines encounter NAT devices at the entrance of the server? A layer-3 NAT device is deployed at the entrance of a High-load website... no one can ensure that a machine with a small timestamp must initiate a connection first. After the machine is frequently disconnected, it will still connect to the machine in the ascending order of the timestamp !! TIME_WAIT quick recovery is enabled on Linux Through net. ipv4.tcp _ tw_recycle. Since it is determined based on the time stamp, the TCP timestamp must be enabled to be valid. Suggestion: If a three-or four-layer NAT device is deployed on the front end, disable quick recovery to avoid SYN denial caused by chaotic timestamps on the real machines behind the NAT. TIME_WAIT reuse if TIME_WAIT (input method switching is too annoying, or TW) recycling is only an optimization implementation for a specific system, then TW reuse has relevant specifications, namely: if you can guarantee any of the following, a TW-state quad-Key (that is, a socket connection) can be used by the new SYN connection again: 1. the initial serial number is larger than the last serial number of the old TW connection. if the timestamp is enabled, the timestamp of the new connection is greater than the timestamp of the old connection. Linux has implemented the above features perfectly. We can confirm through the following experiment: service Program on S1: listening port 1234, new connection with accept, call close to close the connection after sending a piece of data. Additional configuration on S1: Use iptables to disable the RESET package from entering the fourth layer because it will terminate the TW state. Client Program on PC1: bind port 192.168.100.1 and port 2000 to connect to S1. After obtaining data, call close to close the connection. Client Program on PC2: Use the IP_TRANSPARENT option to bind the 192.168.100.1 address and port 2000. Other programs are the same as those on pc1. Start S1: 172.16.100.2 on the server side, constantly listen on port 1234, start C1: 192.168.100.1 on PC1, port 2000, Connect Port 1234 of S1, and capture packets on S1, obtain the normal three-way handshake, data transmission, and four-way data packet. At this time, netstat-ant on S1 can see a TW connection. Start C2: 192.168.100.1, port 2000 on PC2, and Connect Port 1234 of S1. Capture packets on S1, SYN serial number seq 3934898078 is later than the last serial number when PC1 initiates a connection [F.], seq 2513913083, ack 3712390788, S1 normally replies SYNACK: Flags [S.], seq 3712456325, ack 3934898079 ,... in this case of TW reuse, the initial serial number of SYNACK of S1 is calculated by the last ack of the old connection in TW state, that is, 3712390788, plus the constant 65535 + 2! The above experiments are completed when the timestamp is disabled. In fact, if the timestamp is enabled, reuse is more likely, after all, whether a TW connection can be reused is determined by one of the above conditions! When the TIME_WAITTIME_WAIT status is killed from the outside, it is an appendix! In Linux, except for enabling recycle tw, you cannot shorten the TW state time in Linux, but 80% of the problems are caused by the TW state, in Windows, you need to add an implicit entry in the registry. A slight spelling error will cause silence failure! TW is really irritating, so I always want to kill connections in TW state, especially connections in TW state on the server! We can use tcp reset to terminate the TW connection. What should I do? According to the TCP specification, when receiving any data sent to a non-listening port or the serial number is out of the window, the receipt must be RESET. This is usable! A connection is waiting in TW. It can't do anything about it, but it can be killed from the outside! Specifically, the IP_TRANSPARENT socket option is used. It can bind non-local addresses. Therefore, you can bind the peer address and port of the TW connection from any machine and then initiate a connection, after the TW connection is received, an ACK will be sent directly due to the disordered serial number, and the ACK will return to the peer of the TW connection, because 99% may have released the connection to the peer (because the peer cannot receive the ack of the FIN-ACK, and then do not worry whether the ACK has reached the peer, wait for MSL so that all the old data has been lost. Therefore, the peer will reply to the RESET because it does not have this connection. After the TW connection receives the RESET, the connection will be released, freeing up space for subsequent connections! Linux TipsLinux takes care of a special situation, that is, killing a process. When the system kill a process, it directly calls the close function of the connection to unilaterally close a connection, then, it does not wait for the peer end to close the connection process in another direction to exit. The problem now is that the TCP specification directly conflicts with the file descriptor specification of UNIX processes! When the process is closed, the socket will be closed, but TCP is full-duplex. You cannot ensure that the Peer end agrees at the same time and implements the closing action. Since the connection cannot be closed, as a file descriptor, the process will not be completely shut down! Therefore, Linux uses a "sub-state" mechanism, that is, when a process exits, it unilaterally sends FIN, then, wait for the subsequent close sequence to copy the connection to a TW socket that consumes less resources. The status is directly transferred to TIMW_WAIT, and a substate FIN_WAIT_2 is recorded, the next socket will have nothing to do with the connection to the original process descriptor. When a new connection arrives, the connection will be directly matched to the TW connection in the master state of TW and the child state of FIN_WAIT_2. The connection will process data such as FIN and fin ack. TIME_WAIT fast recovery and reuse through the above description, we can see that the connection in the TW state can be quickly recycled and reused, But the side effects of the two are different. For fast recovery, because the port information of the TW connection is lost and all information is mapped to the IP address information, the entire IP address, that is, the entire machine, is included in the test object, which is no problem in itself, because fast recovery only takes into account timestamp information, as long as it keeps increasing monotonically, the average machine time will not go back, but it will not work in case of NAT merge, the NAT device acts as an IP address (host ID) proxy for all internal devices, but does not touch the timestamp. However, the timestamp of each machine does not meet any rule... TW reuse solves the problem of access denied on the entire machine, but faces resource consumption problems. One of the basis for this practice is that it is generally impossible for a single host to use the same port in MSL to connect to the same service, unless it has been bind. Therefore, it is promising to wait for the loss or arrival of some legacy data. One thing I disagree with is that, if you are waiting silently for TW, do not send ACK if you are waiting silently without increasing SYN or increasing the timestamp SYN, as long as you discard it silently, if you send an ACK message and the other party terminates the connection, a RESET will be sent to terminate the connection. TIME_WAIT's 80/20 tragedy 80% problems are all caused by 20% of TW. Even in various TCP implementations, a lot of code is processing TW! I personally think this is a bit too much! The TW status was introduced to confirm that the old data has arrived or disappeared and waited for so long. This was already a problem many years ago. At that time, I may have just been born, the phone may not be installed at home... the network conditions at that time were indeed needed to introduce these mechanisms, but with the development of network technology, TW has gradually become a weakness. Even if the new TCP connection is terminated by the old FIN, what can happen even if the new connection is hijacked by the old one? Even if you don't consider this, MSL is too long, it has been a long time since the DDN era... do not try to keep TCP secure, even in the face of man-in-the-middle? Can we use SSL? TCP, as a bottom-layer transmission protocol, must be simple. But now? Although the kernel remains original, the details of TCP are too complicated for anyone who is eager to learn. Even a good writer, nor can I write a thoroughly understandable book about TCP details. Look at the specifications, various formulas, various non-pluggable algorithms, and various magic words, even though the author's estimation is hard to tell the details inside. I have to say that TCP is a little over-designed. As the design product of the year, it is not suitable for the more advanced layer migration era. Nowadays, more and more protocols or open-source software are using simple UDP for expansion. In order to achieve sequential arrival, confirmation, denial, timestamp, reliable connection, and other mechanisms, we need to implement what we need instead of all, from TLS to OpenVPN, UDP is regarded as the next generation of tianyao. I hate TCP. You may Refute me, but I think you are brainwashed. You need to know that if you design a reliable connection protocol, you may actually do better than TCP.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.