This article is next, so if you are not familiar with TCP, but also please take a look at the last part of the "TCP" in the previous chapter, we introduced the TCP protocol header, state machine, data heavy crosses. But TCP to solve a very big thing, it is to be in a network according to different circumstances to dynamically adjust their contract speed, small to make their connection more stable, large will make the entire network more stable. Before you read the next article, you need to be prepared, this article has a number of algorithms and strategies, may trigger your various thinking, let your brain allocate a lot of memory and computing resources, so it is not suitable to read in the toilet. RTT Algorithm for TCP
From the previous TCP retransmission mechanism we know that the timeout settings are important for retransmission, set long, heavy hair on the slow, inefficient, poor performance, set short, the rapid, will increase network congestion, resulting in more time out, more time out to cause more heavy.
And, this timeout time in different network situations, there are different times, there is no way to set a dead. Can only be set dynamically. In order to set up dynamically, TCP introduces Rtt--round trip times, which is the time from which a packet is sent out to the back. This allows the sender to know approximately how much time it takes to easily set up Timeout--rto (retransmission TimeOut) to make our retransmission mechanism more efficient. Sounds like a simple, as if in the delivery of the contract to write down the t0, and then receive the ACK back again to remember a T1, so RTT = T1–t0. It's not that simple, it's just a sample, and it's not representative of the general situation.
Classical algorithm
The classic algorithm defined in RFC793 is this: 1 First, the RTT is sampled first, and the RTT value of the last few times is noted. 2) Then do a smooth calculation srtt–smoothed RTT. The formula is: (of which the alpha value between 0.8 to 0.9, the algorithm in English is called exponential weighted moving average, Chinese called: Weighted moving average) SRTT = (α* SRTT) + ((1-α) * RTT) 3) Start Compute RTO. The formula is as follows: RTO = min [UBOUND, Max [LBOUND, (β* SRTT)] Where: UBOUND is the largest timeout time, the upper limit LBOUND is the smallest timeout time, the lower bound value of beta is generally 1.3 to 2 .0. Karn/partridge algorithm
But the above algorithm will have a final problem when it comes to retransmission-whether you do a RTT sample when you return with the first time or an ACK, or do a RTT sample with the retransmission time and ACK time. This question whether you first the head is presses the gourd to scoop up. As shown in the following figure: the condition (a) is the ACK did not return, sent retransmission. If you calculate the time for the first send and ACK, then it is significantly larger. situation (b) is the ACK back slowly, the retransmission will be, before the ACK came back. If you are counting the retransmission time and the ACK back time, it will be short.
So in 1987 years, a call Karn/partridge algorithm, the biggest feature of this algorithm is--ignore retransmission, do not sample the retransmission RTT (you see, you do not need to solve the problem). But that, in turn, would trigger a big bug--. The network flashes, and suddenly slows down, resulting in a relatively large delay, which causes all the packets to be redirected (since the previous RTO is very small), so that RTO will not be updated because of the redirection, which is a disaster. So the Karn algorithm uses a tricky approach-doubling the existing RTO value (this is the so-called exponential backoff) jacobson/karels algorithm whenever a retransmission occurs
The previous two algorithms use a weighted moving average, and the biggest problem with this approach is that if the RTT has a large fluctuation, it is difficult to find because it is smoothed out. So, in 1988, someone pushed out a new algorithm called Jacobson/karels algorithm (see RFC6289). This algorithm introduces the newest RTT sampling and smoothed Srtt gap as a factor to compute. The formula is as follows: (the Devrtt is the meaning of the deviation RTT) srtt = srtt +α (Rtt –srtt) devrtt = (1-β) * devrtt +β* (| rtt-srtt|) rto=µ* srtt +∂*devrtt (of which: under Linux, α= 0.125,β= 0.25,μ= 1,∂ = 4--This is the algorithm in the "Good tuning parameters," Nobody Knows Why, it just works ... The final algorithm is used in today's TCP protocol (Linux source code: tcp_rtt_estimator). TCP sliding window
To illustrate, if you don't understand the TCP sliding window, you're not aware of the TCP protocol. We all know that TCP must solve the problem of reliable transmission and packet chaos, so TCP must know the actual network data processing bandwidth or data processing speed, so that will not cause network congestion, resulting in packet loss. Therefore, TCP introduced some technology and design to do network flow control, sliding window is one of the technology. As we said earlier, the TCP header has a field called Window, also called Advertised-window, which is the receiver tells the sender how many buffers they have to receive data. The sender can then send the data based on the processing power of the receiver without causing the receiver to handle it. To illustrate the sliding window, we need to look at some of the data structures of the TCP buffer: in the above illustration, we can see that the receiver Lastbyteread points to the position read in the TCP buffer, where the nextbyteexpected points to the last position of the successive packets received. Lastbytercved points to the last position of the received package, we can see that some of the data in the middle has not arrived, so there is a blank area of data. The lastbyteacked of the sender points to the location of the receiving ACK (indicating that the acknowledgement was sent successfully), Lastbytesent said it was sent out, but the ack,lastbytewritten that did not receive a successful confirmation point to where the upper application is being written.
So: The receiver in the Send back ACK will report their advertisedwindow = maxrcvbuffer–lastbytercvd–1; The sender will control the size of the sent data according to this window to ensure that the receiver can handle it.
Now let's take a look at the sender's sliding window sketch:
(Photo source)
The above figure is divided into four parts, respectively: (The black model is the sliding window) #1已收到ack确认的数据. #2发还没收到ack的. #3在窗口中还没有发出的 (there is room for the receiver). #4窗口以外的数据 (there is no room for the receiving party)
Below is a sliding schematic (36 ack received and 46-51 bytes sent): Let's look at a diagram of the receiver-side control transmitter:
(picture source) Zero Window
Above, we can see how a slow server is going to reduce the TCP sliding window to 0. At this point, you will certainly ask what TCP will do if window becomes 0. is not sending the end of the data. Yes, the sender does not send data, you can imagine as "window Closed", then you will certainly ask, if the sender does not send data, the receiver for a while Window size is available, how to notify the sender.
To solve this problem, TCP uses the zero Window probe technology, abbreviated as ZWP, that is, the sender will send ZWP packets to the receiver, let the receiver to ack his Window size, generally this value will be set to 3 times, the first time about 30-60 seconds (depending on the implementation). If 3 times after or 0, some TCP implementation will send RST to break the link.
Note: DDoS attacks can occur wherever there is a wait. Zero window is no exception, some attackers will build a chain with HTTP after the GET request, the window set to 0, and then the server can only wait for zwp, so the attacker will be concurrent with a large number of such requests, Run out of resources on the server side. (For this attack, you can visit Wikipedia's sockstress entry)
In addition, in Wireshark, you can use Tcp.analysis.zero_window to filter packets and then use the follow TCP stream in the right-click menu. You can see the zerowindowprobe and zerowindowprobeack bags. Silly Window syndrome
Silly window syndrome translated into Chinese is "confused Windows syndrome." As you can see above, if our receiver is too busy to take the data from receive windows, it will cause the sender to become smaller. In the end, if the receiver frees up a few bytes and tells the sender that there are now several bytes of Windows, our sender will send these bytes without hesitation. You know, our Tcp+ip head has 40 bytes, for a few bytes, it's so expensive that it's not economical. In addition, you need to know that there is an MTU on the network, for Ethernet, the MTU is 1500 bytes, minus the Tcp+ip header 40 bytes, the real data transmission can have 1460, this is the so-called MSS (Max Segment Size) attention, The RFC definition for TCP is 536, because  RFC 791 says that any IP device has to receive a minimum size of 576 size (576 is actually the MTU of the Dial-up network). If your network pack can fill the MTU, you can use full bandwidth, and if not, you will waste bandwidth. (Packages larger than the MTU have two endings, one is to be lost directly, the other is to be repackaged and sent. You can imagine that an MTU is equivalent to the maximum number of people a plane can hold, if the plane is full of words, the highest efficiency, if only one person, no doubt the cost increased. So, the Silly Windows syndrome is like a one or two-person plane that you could have taken 200 people. It is also easy to solve this problem by avoiding responding to small window size until there is a large enough window size to respond, which can be implemented at both ends of the sender and receiver. If the problem is caused by the receiver end, then the david D Clark ' s scheme is used. At the receiver end, if the data received causes the window size to be less than a certain value, you can return a direct ACK (0) back to sender, which closes the window and prevents the sender from sending data again. When the receiver is processed with some data, Windows size is greater than or equal to the MSS, or if half of the receiver buffer is empty, you can open the window so that send sends the data over. If this problem is caused by the sender end, then the famous nagle ' s algorithm will be used. The idea of this algorithm is also delay processing, he has two main conditions (more conditions can look at the Tcp_nagle_cHeck function): 1 to wait until the Window SIZE>=MSS or data Size >=mss,2 wait time or timeout 200ms, these two conditions have a satisfied, he will send data, otherwise is in saving data.
In addition, the Nagle algorithm is open by default, so you need to close this algorithm for programs that require a small package scenario-such as a more interactive program like telnet or SSH. You can set the Tcp_nodelay option in the socket to turn off this algorithm 1 setsockopt (SOCK_FD, Ipproto_tcp, Tcp_nodelay, (char *) &value, sizeof (int));
In addition, some articles on the internet say tcp_cork socket option is also closed Nagle algorithm, this is not accurate enough. Tcp_cork is not allowed to send packets, but not to prohibit the packet sent, but a large number of packets are prohibited to send. It is best not to set two options. To be honest, I think the Nagle algorithm only added a delay, nothing else, I think it is best to shut him down, and then by their own application layer to control the data, I think should not be everything to rely on kernel algorithm. TCP congestion processing-congestion handling
As we know above, TCP uses sliding window for streaming control, but TCP doesn't think it's enough because sliding windows relies on the sender and receiver of the connection, and doesn't know what's going on in the middle of the network. TCP's designers feel that a great and awesome deal is just not enough, because the flow control is only 4 levels above the network model, and TCP should be more intelligent about the whole network. Specifically, we know that TCP sampled the RTT through a timer and calculated RTO, however, if there is a sudden increase in latency on the network, TCP's response to this issue is to retransmit the data, but retransmission can lead to heavier burdens on the network, resulting in greater latency and more packet loss, so This situation goes into a vicious circle and is constantly amplified. Imagine that if there are thousands of TCP connections in a network that act like this, then a "network Storm" will soon be formed, and the TCP protocol will drag down the entire network. It was a disaster. Therefore, TCP can not ignore what is happening on the network, but without the brain to repeat the data, causing greater damage to the network. The design concept for this TCP is that TCP is not a selfish protocol, and that when congestion occurs, self-sacrifice is done. Like traffic jams, every car should get out of the way instead of robbing the road. For the paper on congestion control, see "Congestion Avoidance and control" (PDF) The main four algorithms: 1 slow start, 2 congestion avoidance, 3 congestion, 4 fast recovery. These four algorithms are not all made in one day, the development of this four algorithms has gone through a lot of time, today are still in the optimization. Note: 1988, Tcp-tahoe proposed 1 slow start, 2 congestion avoidance, 3 congestion occurs in the rapid retransmission of 1990, TCP Reno on the basis of Tahoe increased by 4) Fast recovery slow hot start algorithm –slow start
First, let's take a look at the slow hot start of TCP. Slow start means that the connection just joined the network, 1.1 points to speed up, do not come up like those privileged cars as overbearing way to fill the road. The new students on the high speed or to slow down, do not have the order has been on the highway to confuse the. The slow start algorithm is as follows (CWnd full name congestion Window):
1 The beginning of the connection is initialized CWnd = 1, indicating that the data can be transmitted to an MSS size.
2 Whenever a ack,cwnd++ is received; Linear rise
3 every time after a rtt,cwnd = cwnd*2; To rise exponentially.
4 There is also a ssthresh (slow start threshold), which is an upper bound, and when CWnd >= Ssthresh, it goes into the "congestion Avoidance algorithm" (which is said later)
So, we can see that if the speed is fast, the ACK will return quickly, the RTT will also be short, then, this slow start is not slow. The following figure illustrates this process. Here, I need to mention is a Google paper "an Argument for increasing TCP ' s Initial congestion Window" Linux 3.0 after the adoption of this paper's recommendations--the CWnd initialization into a 10 MSS. Before Linux 3.0, for example, 2.6,linux used the Rfc3390,cwnd to change with the MSS value, if mss< 1095, then CWnd = 4, mss>2190 if cwnd=2, and 3 in other cases. Congestion avoidance algorithm-congestion avoidance
As mentioned earlier, there is also a ssthresh (slow start threshold), which is an upper bound, and when CWnd >= Ssthresh, it goes into the "congestion avoidance algorithm". Generally speaking, the value of Ssthresh is 65535, the unit is byte, when CWnd reaches this value, the algorithm is as follows:
1 When an ACK is received, CWnd = CWnd + 1/cwnd
2 when each RTT is CWnd = CWnd + 1
This will prevent the growth of the network congestion, slowly increase the optimal adjustment to the network value. Congestion State algorithm
As we said before, there are two kinds of situations when you lose a packet:
1 wait until RTO timeout, retransmission packet. TCP thinks the situation is too bad and the reaction is strong. Sshthresh = CWND/2 CWnd reset to 1 enter slow START process
2 Fast retransmit algorithm, that is, when you receive 3 duplicate ACK, turn on retransmission, without waiting for RTO timeout. The TCP Tahoe implementation is the same as the RTO timeout. The implementation of TCP Reno is: CWnd = cwnd/2 Sshthresh = CWnd Enter fast recovery algorithm--fast Recovery
Above we can see the RTO timeout, Sshthresh will become CWnd half, which means that if the Cwnd<=sshthresh occurs when the packet loss, then TCP Sshthresh will be reduced by half, Then when the CWnd and quickly climb to the place with the point of magnitude increase, it will gradually increase linearly. We can see how TCP is quick and careful to find the balance of Web site traffic through this intense shock. Fast Recovery algorithm –fast Recovery
TCP Reno This algorithm is defined in RFC5681. Fast retransmission and fast recovery algorithms are commonly used at the same time. The fast recovery algorithm is that you still have 3 duplicated acks that the network is not so bad, so there is no need to be as strong as the RTO timeout. Note that, as previously mentioned, CWnd and Sshthresh have been updated before entering fast recovery: CWnd = CWND/2 Sshthresh = CWnd
Then, the real fast recovery algorithm is as follows: CWnd = Sshthresh + 3 * MSS (3 means to confirm that 3 packets have been received) retransmission duplicated Acks specified packets if you receive duplicated Acks, then C WND = CWnd +1 If a new ACK is received, then CWnd = Sshthresh, then the congestion avoidance algorithm is entered.
If you think about the algorithm above, you will know that there is a problem with this algorithm--it relies on 3 repetitive acks. Note that 3 duplicate ACKs does not mean that only one packet has been lost, it is possible to lose a lot of packets. But this algorithm will only retransmit one, and the rest of those packets can only wait until the RTO timeout, so, into the nightmare mode-timeout one to halve, multiple timeouts will exceed TCP transmission rate is a series drop, and will not trigger fast recovery algorithm. In general, as we said earlier, the sack or d-sack approach allows fast recovery or sender to be smarter when making decisions, but not all TCP implementations support sack (sack need both sides), so Requires a solution that is not sack. The algorithm for congestion control via sack is Fack (later) TCP new Reno So, in 1995, the TCP new Reno (see RFC 6582) algorithm was proposed to improve the fast sack algorithm with no recovery support--when SE NDEr this side received 3 duplicated Acks, into the fast retransimit mode, the development of retransmission repeat Acks instructions that the packet. If only this one bag is lost, then the ACK returned after retransmission of this packet will return the entire data that has been sender transmitted. If not, it means that there are multiple packages missing. We call this ACK partial ack. Once sender this side found the partial ack appears, then sender can deduce that there are multiple packets lost, so continue to retransmit sliding window is not ack of the first packet. Until you can not receive the partial Ack, the real end of the fast recovery this process
As we can see, this "fast recovery change" is a very aggressive play, and he also extends the fast retransmit and fast recovery process. Algorithm diagram
Let's take a look at a simple diagram to see the various algorithms above at the same time: Fack algorithm
Fack full name Forward acknowledgment algorithm, thesis address here (PDF) Forward acknowledgement:refining TCP congestion control This algorithm is its sack, before we said sack is using the TCP extension field ACK what data received, which data did not receive, he than fast retransmit 3 duplicated ACKs advantage is that the former only know that there are packets lost, I don't know if it's one or more, and sack can know exactly which packages are missing. So, sack can let the sender side in the retransmission process, the lost packets retransmission, rather than a pass, but this way, if the retransmission packet data is more, it will lead to a busy network more busy. Therefore, Fack is used to do the congestion flow control in the retransmission process. This algorithm will put the largest sequence number in the sack Save in Snd.fack this variable, snd.fack update by the ACK band autumn, if the network is all well and snd.una the same (Snd.una is not yet received the ack of the place, that is, the sliding in front of the window of the village # The first place in 2) then define a Awnd = Snd.nxt–snd.fack (snd.nxt point to the place where the sender sliding window is being sent-the first sliding of the Windows icon in front category#3), So awnd means the data on the network. (the so-called Awnd means: actual quantity of the data outstanding in the network) if you need to retransmit, then Awnd = snd.nxt–snd.fack + Retran_data, which is Said, Awnd is outgoing data + retransmission data. The condition that triggers the fast Recovery is: ( (Snd.fack–snd.una) > (3*MSS) ) | | (dupacks = 3)) 。 In this way, there is no need to wait until 3 duplicated ACKs to retransmit, but as long as the largest data in sack and the data for the ACK are longer (3 MSS), then a retransmission is triggered. The CWnd is unchanged throughout the retransmission process. Until the first time the packet was lost Snd.nxt<=snd.una (that is, the retransmission data was confirmed) and then enteredTo congestion avoidance mechanism--cwnd linear rise.
We can see that if there is no fack, then the conservative algorithm will underestimate the size of the window that needs to be used when there are more packets lost, and the time required for the RTT to complete the recovery, and Fack will do it more aggressively. However, Fack can be a big problem in a network where a network packet will be reordering. Introduction to other congestion control algorithms TCP Vegas congestion control algorithm
This algorithm was proposed in 1994, and it mainly makes some modifications to the TCP Reno. This algorithm computes a baseline RTT by very heavy monitoring of the RTT. The baseline RTT is then used to estimate the current network actual bandwidth, and if the actual bandwidth is smaller or more active than our desired bandwidth, the CWnd will begin to be reduced or increased linearly. If the calculated RTT is larger than the timeout, then the unequal ACK timeout is transmitted directly. (Vegas's core idea is to use the value of the RTT to affect the congestion window, rather than by dropping the packet) the thesis of this algorithm is "TCP vegas:end to" congestion avoidance on a Global Internet, this paper gives Vegas and Comparison of New Reno: For the implementation of this algorithm, you can refer to Linux source code:/NET/IPV4/TCP_VEGAS.H,/NET/IPV4/TCP_VEGAS.C hstcp (High Speed TCP) algorithm
This algorithm comes from RFC 3649 (Wikipedia entry). It changed the most basic algorithm, and he made congestion window rise fast and slow down. Where congestion avoidance window growth mode: CWnd = CWnd kit α (CWnd)/CWnd dropped packet after window drop mode: CWnd = (1-β (CWnd)) *cwnd
Note: Alpha (CWnd) and Beta (CWnd) are functions, and if you want them to be like standard TCP, let Alpha (CWnd) =1,β (CWnd) =0.5. The value of alpha (CWnd) and Beta (CWnd) is something that is dynamically transformed. For the implementation of this algorithm, you can refer to Linux source code:/NET/IPV4/TCP_HIGHSPEED.C TCP BIC algorithm
2004, produced in the BIC algorithm. Now you can find the news. Google: US scientists develop BIC-TCP protocol speed is DSL 6,000 times times the BIC full name binary increase congestion control, in Linux 2.6.8 is the default congestion controller algorithm. The creator of the BIC has made so many congestion control algorithms trying to find a suitable cwnd–congestion Window, and Bic-tcp's advocates see through the nature of the matter, which is actually a search process, so the BIC algorithm is mainly binary search--Two-point search to do this. For the implementation of this algorithm, you can refer to Linux source code:/NET/IPV4/TCP_BIC.C TCP Westwood algorithm
Westwood adopts the same slow start algorithm and Reno algorithm. The main improvement aspect of Westwood: bandwidth estimation at the transmitter, when the packet loss is detected, the congestion window and the slow start threshold value are set according to the bandwidth value. So, this algorithm is how to measure the bandwidth. For each RTT time, the bandwidth is measured once, and the formula for measuring bandwidth is simple, that is, how many bytes have been successfully ack in the RTT. Because this bandwidth is the same as using RTT to calculate RTO, it also needs to be smoothed from each sample to a value--a formula that uses a weighted shift average. In addition, we know that if the bandwidth of a network is to send X bytes per second, and the RTT is a data sent out to confirm the need, so the X RTT should be our buffer size. So, in this algorithm, the value of Ssthresh is EST_BD * MIN-RTT (minimum RTT value), if the packet loss is caused by duplicated acks, then CWnd = Ssthresh if Cwin > Ssthresh. If it is caused by RTO, CWnd = 1, enter slow start. For the implementation of this algorithm, you can refer to Linux source code:/NET/IPV4/TCP_WESTWOOD.C Other
More algorithms, you can find the relevant clue PostScript from Wikipedia's TCP congestion avoidance algorithm entry
Well, here I think it's over, and TCP has developed to this day, and there are several books in it that can be written. The main purpose of this article is to take you into these classical basic technologies and knowledge, hope this article can let you understand TCP, but also hope this article can let you start to learn these basic or low-level knowledge of interest and confidence.
Of course, there are too many TCP things, different people may have different understanding, and this article may have some absurd or even wrong, but also want your feedback and criticism.
(End of full text)