TCP those things (bottom)

Source: Internet
Author: User
Tags benchmark rfc

This article is the next, so if you are not familiar with TCP, but also please take a look at the "TCP" in the previous article, we introduced the TCP protocol header, state machine, data re-crossing things. But TCP to solve a very big thing, that is to be in a network according to different circumstances to dynamically adjust the speed of their own contract, small to make their own connection more stable, large to make the whole network more stable. Before you read the next article, you need to be prepared, and this article has a number of algorithms and strategies that might trigger your thinking and allow your brain to allocate a lot of memory and computing resources, so it's not suitable for reading in the toilet.

RTT Algorithm for TCP

From the previous TCP retransmission mechanism we know that timeout setting is very important for retransmission,

    • Set long, heavy hair on the slow, no efficiency, poor performance;
    • Set short, re-send fast, will increase network congestion, resulting in more timeouts, more timeouts lead to more re-hair.

Moreover, this timeout time in different network circumstances, there are different times, there is no way to set a dead. Can only be set dynamically. In order to set up dynamically, TCP introduces Rtt--round trip time, which is when a packet comes back from the issue. This way the sender knows how much time it takes to conveniently set the Timeout--rto (retransmission TimeOut) to make our retransmission mechanism more efficient. It sounds very simple, it seems that when sending the end of the packet, note t0, and then receive the ACK back again remember a T1, so RTT = T1–t0. It's not that simple, it's just a sample, and it doesn't represent the general situation.

Classic Algorithms

The classic algorithm defined in RFC793 is this: 1) First, sample the RTT first, and note the RTT values that have been several times. 2) Then do a smooth calculation of srtt–smoothed RTT. The formula is: (where the alpha value between 0.8 to 0.9, the algorithm is called exponential weighted moving average, Chinese is called: Weighted moving average)SRTT = (α* SRTT) + ((1-α) * RTT) 3) begins to calculate RTO. The formula is as follows:RTO = min [UBOUND, Max [LBOUND, (β* SRTT) ], where:

    • UBound is the maximum timeout time, the upper value
    • LBound is the minimum timeout time, the lower value
    • The β value is generally between 1.3 and 2.0.

Karn/partridge algorithm

But the algorithm above has the ultimate problem of retransmission-do you make RTT samples with the first time and ACK back, or do RTT samples with retransmission time and ack time? The question whether you start with the first is to press the gourd up the scoop. As shown in the following:

    • The condition (a) is that the ACK is not returned, and the retransmission is sent. If you calculate the time of the first send and ACK, then it's obviously big.
    • The situation (b) is that the ACK comes back slowly, and the retransmission will be repeated, before the ACK comes back. If you are counting the retransmission time and the ACK back time, it will be short.

So 1987 years, a call Karn/partridge algorithm, the most important feature of this algorithm is- ignore retransmission, do not take the retransmission of the RTT sampling (You see, you do not need to solve the problem of non-existent). But, in this way, will trigger a big bug-- if at some time, the network flashing, suddenly slowed down, resulting in a relatively large delay, this delay caused to re-turn all the package (because the RTO is very small), so, because the re-rotating does not count, so, RTO will not be updated, it is a disaster . So the Karn algorithm uses a trickery way--as long as a retransmission occurs, the existing RTO value is doubled (this is called the exponential backoff)

Jacobson/karels algorithm

The first two algorithms used are "weighted moving averages", the biggest problem is that if the RTT has a large fluctuation, it is difficult to find, because it is smoothed out. So, in 1988, someone pushed out a new algorithm called Jacobson/karels algorithm (see RFC6289). This algorithm introduces the latest RTT sampling and smoothing over the SRTT gap to do the factor to calculate. The formula is as follows: (where the Devrtt is the meaning of deviation RTT)SRTT= Srtt+ α(rtt–s RTT)devrtt= (1-β) *devrtt* (| Rtt-srtt |) Rto=µ*srtt +∂*devrtt(among them: under Linux, α= 0.125,β= 0.25,μ= 1,∂= 4--This is the algorithm "good tuning parameters", nobody knows why, it just Works ... The final algorithm is used in today's TCP protocol (the source code for Linux in: Tcp_rtt_estimator).

TCP Sliding window

To illustrate, if you do not understand the TCP sliding window, you are not aware of the TCP protocol. As we all know,TCP must solve the problem of reliable transmission and packet Chaos , so TCP must know the actual network data processing bandwidth or data processing speed, so as not to cause network congestion, resulting in packet loss. Therefore, TCP introduces some technology and design to do network flow control, Sliding window is one of the technology. As we said earlier, theTCP header has a field called Window, also called Advertised-window, this field is the receiver tells the sender how many buffers can receive data . The sender can then send the data based on the processing power of the receiving end, without causing the receiver to process it. To illustrate the sliding window, we need to look at some of the data structures of the TCP buffers: in, we can see:

    • The receiving end Lastbyteread points to the read location in the TCP buffer, where nextbyteexpected points to the last position of the successive packets received, lastbytercved points to the last position of the received packet, We can see that some of the data in the middle has not yet arrived, so there is a blank area.
    • The lastbyteacked of the sending side points to the location where the receiver has been ACK (indicating a successful confirmation), lastbytesent indicates that it has been sent, but has not received a successful confirmation Ack,lastbytewritten points to the upper-level application is writing.


    • The receiving end will report its own Advertisedwindow = maxrcvbuffer–lastbytercvd–1 in the return ACK to the sending end;
    • The sender will control the size of the sending data according to this window to ensure that the receiver can handle it.

Let's take a look at the sender's sliding window:

(Photo source)

is divided into four parts, namely: (One of the black models is the sliding window)

    • #1已收到ack确认的数据.
    • #2发还没收到ack的.
    • #3在窗口中还没有发出的 (there is room for the receiver).
    • #4窗口以外的数据 (receiver has no space)

Below is a sliding (receive 36 ack and emit 46-51 bytes): Let's look at an illustration of the receiving end control sender:

(Photo source)

Zero Window

, we can see how a slow-processing server reduced the TCP Sliding window to 0. At this point, you must ask, if window becomes 0, what happens to TCP? is the sending end not send data? Yes, the sender does not send data, you can imagine a "window Closed", then you must also ask, if the sender does not send data, the receiver a moment Window size is available, how to notify the sending side?

To solve this problem, TCP uses the Zero window probe technology, abbreviated to ZWP, that is, the sending side will send ZWP packets to the receiver, let the receiver to ack his Window size, generally this value will be set to 3 times, the first time about 30-60 seconds (depending on the implementation). If after 3 times or 0, some TCP implementations will send RST to break the link.

Note : A DDoS attack can occur wherever there is a wait, Zero window is no exception, and some attackers will set the window to 0 after making a link to HTTP, and then the server will only wait for ZWP. The attacker would then run a large number of such requests to drain the server-side resources. (For this attack, you can take a look at Wikipedia's sockstress entry)

In addition, Wireshark, you can use Tcp.analysis.zero_window to filter the package, then use the follow TCP stream in the right-click menu, You can see Zerowindowprobe and Zerowindowprobeack's bags.

Silly Window syndrome

Silly window syndrome translated into Chinese is "confused Windows syndrome." As you can see above, if our receiver is too busy to take away the data from the receive Windows, it will cause the sender to become smaller. In the end, if the receiver frees up a few bytes and tells the sender that there are now a few bytes of window, our sender will not hesitate to send these bytes. You know, our Tcp+ip head has 40 bytes, and for a few bytes, it's not economical to be up to this big expense.

In addition, you need to know that there is an MTU on the network, for Ethernet, the MTU is 1500 bytes, remove the Tcp+ip header 40 bytes, the real data transmission can have 1460, which is called MSS (Max Segment Size) Note, RFC definition for TCP the default value for this MSS is 536, because RFC 791 says that any IP device has to receive a minimum size of 576 size (in fact 576 is the MTU of the Dial-up network). If your network packet can fill the MTU, then you can use full bandwidth, if not, then you will waste bandwidth . (Packets larger than the MTU have two endings, one is lost directly, and the other is being re-packaged and sent). You can imagine an MTU equivalent to the maximum number of people a plane can fit, if the plane is loaded, the most efficient, and if there is only one person, there is no doubt that the cost increases. So,Silly Windows syndrome is like a 200-person plane that you could have done only one or two people . To solve this problem is not difficult, is to avoid the small window size to respond, until there is a large enough window size to respond again, this idea can be implemented at both sender and receiver ends.

    • If the problem is caused by the receiver side, then the David D Clark ' s scenario will be used. On the receiver side, if the received data causes window size to be less than a certain value, you can directly ack (0) back to sender, so that the window is closed, also prevents sender to send the data come over, Wait until the receiver side processing some data windows size is greater than the MSS, or receiver buffer half is empty, you can open the window let send send data.
    • If the problem is caused by the sender, then the famous Nagle ' s algorithm will be used. The idea of this algorithm is also delay processing, he has two main conditions (more conditions can see the Tcp_nagle_check function): 1) Wait time or timeout 200ms to wait for Window Size>=mss or Data Size >=mss,2) , these two conditions have a satisfaction, he will send data, otherwise is to save data.

In addition, the Nagle algorithm is turned on by default, so you need to close the algorithm for some programs that require a small package scenario-such as a more interactive program like telnet or SSH. You can turn this algorithm off by setting the Tcp_nodelay option in the socket

SetSockOpt (SOCK_FD, Ipproto_tcp, Tcp_nodelay, (char *) &value,sizeof (int));

In addition, some articles on the internet say Tcp_cork's socket option is also off the Nagle algorithm, which is not accurate enough. Tcp_cork is forbidden to send packets, and not prohibit packets sent, just prohibit a large number of packets sent . It is best not to set all two options. Frankly speaking, I think the Nagle algorithm actually only added a delay, nothing else, I think it is best to shut him down, and then by their own application layer to control the data, I think should not be anything to rely on the kernel algorithm .

TCP Congestion processing-congestion handling

As we know above, TCP uses sliding window for flow control, but TCP does not think it is enough because sliding window needs to rely on the sending and receiving sides of the connection, and it does not know what is happening in the middle of the network. TCP designers feel that a great and good deal is only enough to flow control, because flow control is only the network model above 4, TCP should also be more intelligent to know the whole network. Specifically, we know that TCP samples the RTT from a timer and calculates the RTO, but if the latency on the network suddenly increases, then TCP's response to this is to retransmit the data, but the retransmission can cause the network to be heavier, resulting in greater latency and more drops. As a result, the situation goes into a vicious circle and is constantly magnified. Imagine that if there are thousands of TCP connections within a network that are acting like this, a "cyber storm" is immediately formed, and the TCP protocol will drag down the entire network. it was a disaster.

Therefore, TCP can not ignore what is happening on the network, and do not kept the data in the brain, causing more damage to the network. The design concept for this TCP is thatTCP is not a selfish protocol, and when congestion occurs, self-sacrifice. Like traffic jams, every car should be able to get out of the way and not take the road again. for a paper on congestion control, see "Congestion Avoidance and control" (PDF) congestion control is mainly four algorithms: 1) slow start, 2) congestion avoidance, 3) congestion occurs, 4) fast recovery. These four algorithms are not a day out, the development of this four-algorithm has been a lot of time, to today are still in the optimization. Note:

    • 1988, Tcp-tahoe proposed 1) slow start, 2) congestion avoidance, 3) Rapid retransmission when congestion occurs
    • 1990, TCP Reno increased by 4 on Tahoe basis) Fast recovery

Slow hot start Algorithm –slow start

First, let's take a look at the slow hot start of TCP. Slow start means that the connection just joined the network, 1.1 points to speed up, do not come up like those privileged cars as overbearing to fill the road. New students on the high-speed or slow, do not put in the high-speed order to mess up. The algorithm for slow start is as follows (CWnd full name congestion Window):

1) The connection is set up start by first initializing CWnd = 1, indicating that a MSS-sized data can be passed.

2) Whenever a ack,cwnd++ is received; Linearly rising

3) whenever a rtt,cwnd = cwnd*2; To raise the index

4) There is also a ssthresh (slow start threshold), which is an upper limit, and when CWnd >= Ssthresh, it enters the "congestion Avoidance algorithm" (the algorithm is later said)

So, we can see that if the speed is fast, the ACK will return quickly and the RTT will be short, then this slow start is not slow.  Illustrates the process. Here, what I need to mention is a Google paper "an Argument for increasing TCP's Initial congestion Window" Linux 3.0 followed the recommendations of this paper--the CWnd was initialized into a 10 MSS. Before Linux 3.0, such as 2.6,linux used Rfc3390,cwnd is with the value of MSS to change, if mss< 1095, then CWnd = 4, if mss>2190, then cwnd=2; in other cases, 3.

Congestion avoidance algorithm-congestion avoidance

As mentioned earlier, there is also a ssthresh (slow start threshold), which is an upper limit, and when CWnd >= Ssthresh, it enters the "congestion avoidance algorithm". In general, the value of Ssthresh is 65535, the unit is byte, and when CWnd reaches this value, the algorithm is as follows:

1) When receiving an ack, CWnd = CWnd + 1/cwnd

2) When every RTT is over, CWnd = CWnd + 1

In this way, you can avoid increasing the network congestion and slowly increase the optimal value of the adjustment to the network.

Congestion State algorithm

As we said earlier, there are two situations when a packet is dropped:

1) Wait until the RTO expires and retransmit the packet. TCP considers this situation too bad and reacts strongly.

    • Sshthresh = CWND/2
    • CWnd Reset to 1
    • Enter the slow START process

2) Fast retransmit algorithm, that is, when receiving 3 duplicate ACK, turn on retransmission, instead of waiting for the RTO timeout.

    • The implementation of the TCP Tahoe is the same as the RTO timeout.
    • The implementation of TCP Reno is:
      • CWnd = CWND/2
      • Sshthresh = CWnd
      • Enter the fast recovery algorithm--fast Recovery

Above we can see that the RTO time-out, Sshthresh will become half of CWnd, which means that if the Cwnd<=sshthresh occurs when the packet drops, then TCP Sshthresh will be reduced by half, Then, when CWnd quickly climbs up to this place with exponentially increasing numbers, it will slowly increase linearly. We can see how TCP is using this intense concussion to quickly and carefully find the balance point of website traffic.

Fast Recovery algorithm –fast Recovery

TCP Reno This algorithm is defined in RFC5681. Fast retransmission and fast recovery algorithms are commonly used simultaneously. The fast recovery algorithm is that you have 3 duplicated acks that the network is not so bad, so there is no need to be as strong as the RTO timeout. Note, as mentioned earlier, that CWnd and Sshthresh have been updated before entering fast recovery:

    • CWnd = CWND/2
    • Sshthresh = CWnd

Then the real fast recovery algorithm is as follows:

    • CWnd = Sshthresh + 3 * MSS (3 means to confirm that 3 packets have been received)
    • Retransmission duplicated ACKs the specified packet
    • If you receive duplicated Acks again, then CWnd = CWnd +1
    • If a new ACK is received, then CWnd = Sshthresh, then the congestion avoidance algorithm is entered.

If you think about the algorithm above, you will know that there is a problem with the algorithm above--it relies on 3 repetitive acks. Note that 3 duplicate ACKs does not mean that only one packet has been lost, most likely lost a lot of packets. But this algorithm will only retransmit one, and the rest of those packages only wait for the RTO timeout, so, into the nightmare mode-the timeout of a half-time, a number of timeouts will be super-TCP transmission speed is a series decline, and will not trigger the fast recovery algorithm. In general, as we said earlier, sack or D-sack methods can make fast recovery or sender more intelligent when making decisions, but not all TCP implementations support sack (sack need both sides), so Requires a solution without sack. The algorithm for congestion control via sack is Fack (later)TCP new Reno So, in 1995, the TCP new Reno (see RFC 6582) algorithm was proposed, mainly to improve fast with no sack support The recovery algorithm--

    • When sender receives 3 duplicated Acks, enter fast Retransimit mode and develop the packet that repeats the Acks instructions. If only this one packet is lost, then the ACK returned after retransmission of the packet will send the entire data that has been transmitted by sender ACK back. If not, it means that several packages have been lost. We call this ACK for partial ack.
    • Once sender has found a partial ack on this side, then sender can infer that a number of packets have been lost, and then continue to retransmit the first packet in the sliding window that was not ack. Until the partial Ack is no longer received, the process of really ending fast recovery

As we can see, this "fast recovery change" is a very aggressive play that also extends the process of fast retransmit and fast recovery.


Let's look at a simple diagram to see the various algorithms above:

Fack algorithm

Fack full name Forward acknowledgment algorithm, paper address here (PDF) Forward acknowledgement:refining TCP Congestion Control This algorithm is its sack, Before we said sack is using the TCP extension field ACK what data received, which data did not receive, he than fast retransmit 3 duplicated ACKs advantage is that the former only know that there is a packet lost, do not know whether it is one or more, And sack can know exactly which packages have been lost. So, sack can let the sender side in the retransmission process, the lost packets are re-transmitted, rather than one pass, but this way, if the retransmission of packet data is more, it will lead to a busy network is more busy. Therefore, Fack is used to do the congestion flow control during retransmission.

    • This algorithm saves the largest sequence number in the sack in the snd.fack variable, Snd.fack update by ACK with autumn, if the network is all right and snd.una the same (Snd.una is not received the ACK, that is, the category in front of the sliding window #2的第一个地方)
    • Then define a Awnd = Snd.nxt–snd.fack(snd.nxt points to the sending side sliding in the window where it is being sent-the sliding first position in front of the category#3 Windows diagram), So awnd means data on the web. (The so-called Awnd means: Actual quantity of data outstanding in the network)
    • If you need to retransmit the data, thenawnd =snd.nxt–snd.fack + retran_data, that is, the Awnd is the outgoing data + retransmission data.
    • Then the conditions for triggering fast Recovery are: ((Snd.fack–snd.una) > (3*MSS)) | | (Dupacks = = 3)) 。 In this way, it is not necessary to wait until the 3 duplicated ACKs to retransmit, but as long as the largest data in the sack and the ACK data is longer (3 MSS), then trigger retransmission. CWnd does not change throughout the retransmission process. Until the first drop of Snd.nxt<=snd.una (that is, retransmission data is confirmed), then come in congestion avoidance mechanism--cwnd linear rise.

We can see that if there is no fack, then in the case of more drops, the original conservative algorithm will underestimate the size of the window that needs to be used, and it takes a few RTT to complete the recovery, and Fack will do it more aggressively. However, Fack can be a big problem if a network packet is reordering in the network.

Introduction to other congestion control algorithms

TCP Vegas Congestion Control algorithm

This algorithm was proposed in 1994, and it mainly made some modifications to TCP Reno. This algorithm calculates a benchmark RTT by very heavy monitoring of the RTT. This benchmark RTT is then estimated to be the actual bandwidth of the current network, and if the actual bandwidth is smaller or more active than our expected bandwidth, then the size of CWnd is reduced or increased linearly. If the calculated RTT is greater than timeout, then the unequal ACK timeout is transmitted directly. (Vegas's core idea is to use the value of the RTT to influence the congestion window, not by dropping packets) This paper is the "TCP vegas:end to End congestion avoidance on a Global Internet" this paper gives Vegas and New Reno Comparison: For this algorithm implementation, you can refer to the Linux source code:/NET/IPV4/TCP_VEGAS.H,/NET/IPV4/TCP_VEGAS.C

HSTCP (High speed TCP) algorithm

This algorithm comes from the RFC 3649 (Wikipedia entry). It makes changes to the most basic algorithm, making the congestion window rise faster and slower. which

    • Window growth mode when congestion avoidance: CWnd = CWnd + α (CWnd)/CWnd
    • The window drops after the packet loss method: CWnd = (1-β (CWnd)) *cwnd

Note: Alpha (CWnd) and Beta (CWnd) are all functions, so if you want them to be the same as the standard TCP, then let the alpha (CWnd) =1,β (CWnd) =0.5. For the values of Alpha (CWnd) and β (CWnd) is a dynamic transformation of something. For the implementation of this algorithm, you can refer to the Linux source code:/NET/IPV4/TCP_HIGHSPEED.C

TCP BIC algorithm

In 2004, the production of the BIC algorithm. Now you can also find the relevant news "Google: The United States Scientists research and development BIC-TCP protocol speed is DSL 6,000 times times" BIC full name binary increase congestion control, in Linux 2.6.8 is the default congestion control algorithm. The inventor of the BIC so many congestion control algorithms are trying to find a suitable cwnd–congestion Window, and Bic-tcp's advocates see through the nature of the matter, in fact, this is a search process, so BIC This algorithm is mainly used binary search--Two-point search to do this thing. For the implementation of this algorithm, you can refer to the Linux source code:/NET/IPV4/TCP_BIC.C

TCP Westwood algorithm

The Westwood uses the same slow-start algorithm as the Reno and the congestion avoidance algorithm. The main improvement aspect of Westwood: bandwidth estimation is done on the sending side, and when packet loss is detected, the congestion window and slow boot threshold are set according to the bandwidth value. So, how does this algorithm measure bandwidth? Each RTT time, the bandwidth is measured once, and the formula for measuring the bandwidth is simple, which is how many bytes have been successfully ack in this RTT. Because this bandwidth is the same as using RTT to calculate the RTO, it also needs to be smoothed from each sample to a value--a formula with a weighted shift average.

In addition, we know that if a network's bandwidth is capable of sending X bytes per second, and RTT is a data sent out after confirming the need, so X * RTT should be our buffer size. So, in this algorithm, the value of Ssthresh is EST_BD * MIN-RTT (the smallest RTT value), if the drop is caused by duplicated acks, if CWnd > Ssthresh, then Cwin = Ssthresh. If it is caused by RTO, CWnd = 1, enter slow start. For the implementation of this algorithm, you can refer to the Linux source code:/NET/IPV4/TCP_WESTWOOD.C


For more algorithms, you can find related clues from Wikipedia's TCP congestion avoidance algorithm entry


Well, here I think it's over, TCP has evolved to this day, and there are several books in the stuff that can be written. The main purpose of this article, or bring you into these classical basic technology and knowledge, I hope this article can let you understand TCP, but also hope that this article will allow you to learn these basic or underlying knowledge of interest and confidence.

Of course, there are too many TCP things, different people may have different understandings, and this article may have some absurd words or even mistakes, but also hope to get your feedback and criticism.

TCP those things (bottom)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.