What about TCP (below)
This is the next article, so if you are not familiar with TCP, please take a look at the previous article "TCP's things (I, we introduced TCP protocol headers, state machines, and data retransmission. However, TCP has to solve a major problem, that is, to dynamically adjust the speed of packet sending in a network based on different situations. If it is small, it makes the connection more stable, if it is large, the entire network is more stable. Before reading the next article, you need to be prepared. This article has some algorithms and strategies that may lead to various thoughts and allow your brain to allocate a lot of memory and computing resources, it is not suitable for reading in the restroom.
Tcp rtt Algorithm
From the preceding TCP retransmission mechanism, we know that the timeout setting is very important for retransmission.
- If it takes longer, re-transmission will be slow. If it takes longer than half a day, re-transmission will fail, with no efficiency and poor performance;
- If it is set to short, it may cause re-sending if it is not lost. As a result, re-transmission is fast, which increases network congestion and leads to more timeouts. more timeouts lead to more re-transmission.
Moreover, there is no way to set a dead value for the timeout time in different networks. It can only be set dynamically. For dynamic settings, TCP introduces the RTT -- round trip time, that is, the time when a packet is sent out and returned. In this way, the sender knows the required time, so that you can easily set timeout -- RTO (Retransmission timeout) to make our retransmission mechanism more efficient. It sounds very simple. It seems that the t0 is recorded when the sender sends the packet, and the receiver then records the ACK back with another T1, so RTT = T1-T0. Not that simple. This is just a sample and cannot represent a general situation.
Classic Algorithms
The classic algorithm defined in rfc793 is as follows:
1) First, sample the RTT and write down the RTT values of the past several times.
2) then perform the smooth computing SRTT (smoothed RTT ). Formula: (the α value is between 0.8 and 0.9. This algorithm is called exponential weighted moving average in English. The Chinese name is weighted moving average)
SRTT = (α * SRTT) + (1-α) * RTT)
3) start calculating RTO. The formula is as follows:
RTO = min [ubound, Max [lbound, (β * SRTT)]
Where:
- Ubound is the maximum timeout time, the upper limit
- Lbound is the minimum timeout time and the lower limit.
- The Beta value is generally between 1.3 and 2.0.
Karn/Partridge Algorithm
However, the above algorithm will have an ultimate problem during retransmission-you use the time when the first data is sent and the time when the Ack is returned to make the RTT sample value, or is the RTT sample value based on the retransmission time and ACK return time?
No matter which one you choose, the issue is caused by the hoist. As shown in:
- Case (a) The Ack is not returned, so it is re-transmitted. If you calculate the first sending time and the ACK time, it is much larger.
- (B) The Ack is slow, but it causes retransmission. However, the Ack is back soon after the retransmission. If you calculate the difference between the retransmission time and the ACK return time, it will be short.
So in 1987, we created a algorithm named Karn/Partridge algorithm. The biggest feature of this algorithm is --Ignore retransmission and do not sample the retransmission RTT.(You see, you don't need to solve a non-existent problem ).
However, this will lead to another big bug --If the network flash suddenly slows down at a certain time, resulting in a relatively large delay, this delay will lead to the re-conversion of all the packets (because the previous RTO is very small), so, RTO won't be updated because it's not a disaster.. Therefore, the Karn algorithm uses a clever method-as long as a retransmission occurs, it doubles the existing RTO value (this is called Exponential Backoff). Obviously, this dead rule is not reliable for an RTT that requires accurate estimation.
Jacobson/Karels Algorithm
The first two algorithms use weighted moving average. The biggest problem with this method is that if there is a large fluctuation in RTT, it is difficult to find it because it is smoothed out. So in 1988, another new algorithm was introduced, called jacbson/Karels algorithm (see rfc6289 ). This algorithm introduces the latest RTT sampling and sliding SRTT gaps for factor calculation. The formula is as follows: (devrtt indicates deviation RTT)
SRTT= SRTT+ α (RTT-SRTT)-- Smooth RTT calculation
Devrtt= (1-β)*Devrtt+ β* (|RTT-SRTT|)-- Calculate the gap between the smooth RTT and the real (weighted moving average)
RTO = µ* SRTT + optional * devrtt-- God-like Formula
(In Linux, α = 0.125, β = 0.25, μ = 1, limit = 4 -- this is the "well-tuned parameter" in the algorithm, nobody knows why, it just works ...) The final algorithm is used in today's TCP protocol (the Linux source code is tcp_rtt_estimator ).
TCP Sliding Window
If you do not know about TCP sliding windows, you do not know about TCP. We all know that,TCP must solve the problems of reliable transmission and reorderingTherefore, TCP must know the actual data processing bandwidth or data processing speed of the network to avoid network congestion and packet loss.
Therefore, TCP introduces some technologies and designs for network traffic control. Sliding Window is one of them. As we said earlier,There is a field in the TCP Header called Window, also called advertised-window. This field is used by the receiving end to tell the sending end how many buffers can receive data..Therefore, the sender can send data based on the processing capability of the receiver, without causing the receiver to be unable to process the data.. To illustrate the sliding window, we need to first look at some data structures in the TCP buffer zone:
, We can see:
- The acceptor lastbyteread points to the position read in the TCP buffer, nextbyteexpected points to the last position of the received continuous packet, and lastbytercved points to the last position of the received packet, we can see that some data has not yet arrived, so there is a blank data area.
- The sender's lastbyteacked points to the location where the ACK has been sent by the receiver (indicating that the ACK has been sent successfully). The lastbytesent indicates that the ACK has been sent but has not received the Ack, lastbytewritten points to the place where upper-layer applications are writing.
So:
- The receiving end reports its advertisedwindow = maxrcvbuffer-lastbytercvd-1 to the sending end back to ack;
- The sender controls the size of the sent data based on this window to ensure that the receiver can process the data.
Next let's take a look at the sender's sliding window:
(Image source)
It is divided into four parts: (the black model is the sliding window)
- #1 Data confirmed by ack has been received.
- #2. The attacker has not received the ACK yet.
- #3 events that have not been sent in the window (the receiver has space ).
- #4 data outside the window (the receiver has no space)
Below is a slide (receives 36 ack and emits 4-51 bytes ):
The following figure shows a receiver controlling the sender:
(Image source)
Zero Window
We can see how a slow server (receiver) reduces the TCP Sliding Window of the client (sender) to 0. At this point, you will certainly ask, what will happen if the window turns to 0? Is it because the sender does not send data? Yes, the sender will not send data. You can think of it as "window closed". Then you will certainly ask, if the sender does not send data, the receiver will be able to use window size later, how can I notify the sender?
To solve this problem, TCP uses the Zero Window probe technology, abbreviated as zwp. That is to say, the sender sends a zwp package to the receiver after the window is changed to 0, let the receiver ack his window size. Generally, this value is set to 3 times, and the first time is about 30-60 seconds (different implementations may be different ). If it is still 0 after three times, some TCP implementations will issue RST to disconnect the link.
Note:: As long as there is a waiting place where DDoS attacks may occur, zero window is no exception. Some attackers will set the window to 0 after sending a GET request with the HTTP link, then the server can only wait for zwp, so attackers will concurrently send a large number of such requests to exhaust the resources on the server. (You can take a look at the sockstress entry in Wikipedia)
In Wireshark, you can use TCP. analysis. zero_window to filter packets, and then right-click the follow TCP stream in the menu. You can see the zerowindowprobe and zerowindowprobeack packages.
Silly window syndrome
Silly window syndrome is translated into Chinese as "confused window syndrome ". As you can see above, if our recipient is too busy to take the data in receive windows, it will cause the sender to become smaller and smaller. At the end, if the receiver makes several bytes and tells the sender that there are several bytes of windows, our sender will send these bytes without hesitation.
You know, our TCP/IP Header has 40 bytes. For the sake of a few bytes, it is too economic to reach such a large overhead.
In addition, you need to know that there is an MTU on the network. For Ethernet, the MTU is 1500 bytes. Apart from the 40 bytes of the TCP/IP header, the real data transmission can be 1460, this is the so-called MSS (max segment size). Note that the default value of the MSS defined in the RFC of TCP is 536, this is because RFC 791 says that any IP device must receive a minimum size of 576 (in fact, 576 is the MTU of the dial-up network, and 576 minus the 20 bytes of the IP header is 536 ).
If your network package can be filled with Mtu, you can use the whole bandwidth. If not, you will waste the bandwidth.. (A package larger than MTU has two final results: one is lost directly, and the other is re-packaged and sent) as you can imagine, an MTU is equivalent to the person who can install the most planes. If the plane is fully loaded, the bandwidth is the highest. If one plane is only carrying one person, the cost will undoubtedly increase, it is also equivalent to two.
So,Silly windows syndrome is like you could have taken one or two people on a plane of 200 people.. It is not difficult to solve this problem, that is, to avoid responding to a small window size until there is a large enough window size to respond again. This idea can be implemented at both ends of sender and receiver.
- If this problem is caused by the hacker, the David D Clark's solution will be used. On the worker er side, if the received data causes the window size to be smaller than a certain value, you can directly ack (0) back to the sender. In this way, the window is closed and the sender is blocked from sending data again, after the worker er processes some data, the Windows size is greater than or equal to MSS. Alternatively, if half of the worker er buffer is empty, you can open the window and send the data.
- If this problem is caused by the sender, the famous Nagle's algorithm will be used. The idea of this algorithm is latency processing. It has two main conditions (for more conditions, see the tcp_nagle_check function): 1) wait until window size> = MSS or data size> = MSS, 2) wait time or time-out 200 ms. If either of the two conditions is met, the system sends data, otherwise, data is collected.
In addition, the Nagle algorithm is enabled by default, so for some programs that require small packet scenarios --For example, you need to disable this algorithm for highly interactive programs such as telnet or ssh.. You can set the tcp_nodelay option in the socket to disable this algorithm (disable the Nagle Algorithm without global parameters and disable it based on the characteristics of each application)
1 |
setsockopt(sock_fd, IPPROTO_TCP, TCP_NODELAY, ( char *)&value, sizeof ( int )); |
In addition, some articles on the Internet say that tcp_cork's socket option also disables the Nagle algorithm, which is not accurate enough.Tcp_cork is used to prohibit packet sending, while the Nagle algorithm does not prohibit packet sending, but only disables sending a large number of packets.. It is best not to set both options.To be honest, I think the Nagle algorithm only adds a latency and there is nothing else. I think it is best to close it and then control the data at its application layer, I don't think everything should depend on Kernel algorithms..
TCP congestion handling-congestion handling
As we can see above, TCP uses the sliding window for flow control, but TCP thinks this is not enough, because the sliding window needs to depend on the connection sender and receiver, it does not know what happened in the middle of the network. TCP designers think that a great and awesome protocol is not enough to achieve traffic control only, because traffic control is only a task of Layer 4 or above the network model, TCP should also be more intelligent in understanding the entire network.
Specifically, we know that TCP samples RTT through a timer and calculates RTO. However,If the latency on the network suddenly increases, TCP only retransmits data to cope with this problem. However, retransmission will cause a heavier network burden, as a result, it will lead to a greater delay and more packet loss. Therefore, this situation will go into a vicious circle and be magnified. Imagine if tens of thousands of TCP connections in a network do this, a "network storm" will be formed immediately, and the TCP protocol will drag the entire network.This is a disaster.
Therefore, TCP cannot ignore what happens on the network, and resends data without thinking about it, causing more damage to the network. The TCP design philosophy is:TCP is not a selfish protocol. When congestion occurs, we need to sacrifice ourselves. Just like a traffic jam, every car should let the road out, rather than grab the road again.
For more information about congestion control, see congestion avoidance and control (PDF)
Congestion Control mainly involves four algorithms:1) Slow Start,2) avoid congestion,3) congestion,4) Quick Recovery. These four algorithms have not been developed in one day. The development of these four algorithms has gone through many times and are still being optimized today. Note:
- In 1988, TCP-Tahoe proposed 1) slow start, 2) congestion avoidance, and 3) Fast retransmission when congestion occurs.
- In 1990, TCP Reno added 4) fast recovery based on Tahoe.
Slow Hot Start algorithm-slow start
First, let's take a look at TCP's slow hot start. The slow start means that the connection just added to the network is accelerated at. Don't overhead the road just like the privileged cars. It is still time for new students to go to High Speed. do not disrupt the order at high speed.
The slow start algorithm is as follows (cwnd stands for congestion window ):
1) initialize cwnd = 1 at the beginning of the connection, which indicates that an MSS data can be uploaded.
2) When an ACK is received, the cwnd ++; is linearly increased.
3) every time an RTT is passed, cwnd = cwnd * 2; exponentially increasing
4) there is also an ssthresh (slow start threshold), which is an upper limit. When cwnd> = ssthresh, it will enter the "congestion avoidance algorithm" (This algorithm will be described later)
Therefore, we can see that if the network speed is fast, Ack will return fast and RTT will also be short, so the slow start will not be slow at all. This process is described.
Here, I need to mention a Google paper named "an argument for increasing TCP's initial congestion window". After Linux 3.0, I used this article's suggestion-initializing cwnd into 10 MSS. Before Linux 3.0, for example 2.6, Linux adopted rfc3390, and cwnd was changed with the MSS value. If MSS <1095, cwnd = 4; If MSS> 2190, then cwnd = 2; otherwise, it is 3.
Congestion Avoidance
As mentioned above, there is also an ssthresh (slow start threshold), which is an upper limit. When cwnd> = ssthresh, it will enter the "congestion avoidance algorithm ". Generally, the value of ssthresh is 65535 in bytes. When cwnd reaches this value, the algorithm is as follows:
1) When an ACK is received, cwnd = cwnd + 1/cwnd
2) For every RTT passed, cwnd = cwnd + 1
In this way, you can avoid network congestion caused by rapid growth, and gradually increase and adjust to the optimal value of the network. Obviously, it is a linear ascending algorithm.
Algorithm for congested state
As we mentioned earlier, when packet loss occurs, there are two situations:
1) When RTO times out, data packets are retransmitted. TCP thinks this situation is too bad and the response is very strong.
- Sshthresh = cwnd/2
- Cwnd is reset to 1
- Enter the slow start Process
2) The Fast retransmit algorithm enables retransmission when three duplicate ACK packets are received, instead of waiting until RTO times out.
- TCP Tahoe implementation is the same as RTO timeout.
- TCP Reno is implemented as follows:
- Cwnd = cwnd/2
- Sshthresh = cwnd
- Go to the quick recovery algorithm-fast recovery
As we can see above, after RTO times out, sshthresh will become half of cwnd, which means that if the packet loss occurs during cwnd <= sshthresh, the TCP sshthresh will be halved, then when cwnd quickly climbs to this place with an exponential increase, it will gradually increase linearly. We can see how TCP quickly and carefully finds the balance between website traffic through this strong shock.
Fast recovery algorithm-fast recovery
TCP Reno
This algorithm is defined in rfc5681. Fast retransmission and fast recovery algorithms are generally used at the same time. The quick recovery algorithm assumes that three duplicated acks indicate that the network is not so bad, so there is no need to be as strong as RTO timeout. Note: As mentioned earlier, cwnd and sshthresh have been updated before fast recovery:
- Cwnd = cwnd/2
- Sshthresh = cwnd
Then, the real fast recovery algorithm is as follows:
- Cwnd = sshthresh + 3 * MSS (3 indicates that three packets have been received)
- Retransmits the specified data packet of duplicated ACKs.
- If you receive duplicated acks again, then cwnd = cwnd + 1
- If you receive a new ACK, cwnd = sshthresh and then enter the congestion avoidance algorithm.
If you think about the algorithm above, you will know,The above algorithm also has a problem, that is, it depends on three duplicate ACKs.. Note that three duplicate acks do not mean that only one packet is lost, but many packets are probably lost. However, this algorithm only retransmits one packet, and the rest of the packets can only wait until RTO times out, so it enters the nightmare mode-a timeout window is halved, if multiple timeouts occur, the transmission speed over TCP decreases, and the fast recovery algorithm is not triggered.
Generally, as we mentioned earlier, the sack or D-SACK approach can make fast recovery or sender smarter when making decisions, however, not all TCP implementations support sack (both ends of sack are required). Therefore, a solution without sack is required. The algorithm for implementing congestion control through sack is fack (will be discussed later)
TCP New Reno
Therefore, in 1995, the TCP New Reno (see RFC 6582) algorithm was proposed to improve the fast recovery algorithm without the support of sack --
- When the sender receives three duplicated acks, it enters the fast retransimit mode and develops the package that duplicates ACKs. If only one packet is lost, the ACK returned after the packet is re-transmitted will return the whole data that has been transmitted by the sender. If not, multiple packages are lost. We call this ack partial ack.
- Once the sender finds a partial ACK, the sender can infer that multiple packets are lost, so that the first packet in the sliding window is retransmitted. The process of fast recovery was not completed until the partial ack was no longer received.
We can see that this "Fast remit "is a very radical gameplay, and it extends the fast retransmit and fast recovery processes at the same time.
Algorithm
Let's take a look at a simple illustration and look at the above algorithms:
Fack Algorithm
Fack stands for the forward acknowledgment algorithm. The paper address is here (PDF) Forward Acknowledgement: Refining TCP congestion control, which is based on sack, we have mentioned earlier that sack uses the TCP extended field ack to receive and not receive data. The advantage of sack over the three duplicated acks of fast retransmit is that, the former only knows that one or more packages are lost, and the sack can accurately know which packages are lost. Therefore, sack allows the sender to re-transmit the discarded packets in the retransmission process, instead of transmitting the packets one by one. However, if there is a large amount of data in the re-transmitted packets, it will also make the network busy. Therefore, fack is used for congestion control during retransmission.
- This algorithm stores the largest sequence number in sack inSND. fackIn this variable, SND. fack is updated by ack, and SND is used if everything on the network is good. same as una (SND. una is the place where Ack is not received, that is, the first place of category #2 in the previous sliding window)
- Then defineAwnd = SND. NXT-SND. fack(SND. NXT points to the place where the sender is being sent in the sliding window -- the first position in the previous sliding windows icon category #3). awnd means the data on the network. (Awnd means: actual quantity of data outstanding in the Network)
- If you need to re-transmit data,Awnd = SND. NXT-SND. fack + retran_dataThat is to say, awnd is the outgoing data + retransmission data.
- The condition for triggering the fast recovery is :((SND. fack-SND. Una)> (3 * MSS)) | (Dupacks = 3 )). In this way, you do not need to wait for three duplicated acks to re-upload. Instead, if the largest data in the sack is longer than the ACK data (Three MSS), re-upload is triggered. Cwnd remains unchanged throughout the retransmission process. Until the first packet loss of SND. NXT <= SND. Una (that is, the re-transmitted data is confirmed), and then comes in to the congestion prevention mechanism-cwnd linear increase.
We can see that if there is no fack, the original conservative algorithm will underestimate the size of the window to be used in the case of many packet loss, it takes a few RTTs to complete the restoration, and fack will do it more aggressively. However, if fack is reordering in a network package, there will be a big problem.
Introduction to other Congestion Control Algorithms
TCP Vegas Congestion Control Algorithm
This algorithm was proposed in 1994 and mainly made some modifications to TCP Reno. This algorithm computes a benchmark RTT by monitoring the RTT very heavily. Then, we use this benchmark RTT to estimate the actual bandwidth of the current network. If the actual bandwidth is smaller or more active than we expected, then we start to linearly reduce or increase the cwnd size. If the calculated RTT is greater than timeout, the task will be re-transmitted without ACK timeout. (The core idea of Vegas is to use the RTT value to influence the congestion window, rather than packet loss: end to end congestion avoidance on a global Internet This paper compares Vegas with New Reno:
For more information about this algorithm, see the Linux source code:/NET/IPv4/tcp_vegas.h,/NET/IPv4/tcp_vegas.c.
Hstcp (High Speed TCP) Algorithm
This algorithm comes from RFC 3649 (Wikipedia entry ). He made changes to the most basic algorithm, which made the congestion window grow fast and decrease slowly. Where:
- Window growth mode when congestion is avoided: cwnd = cwnd + α (cwnd)/cwnd
- Window drop mode after packet loss: cwnd = (1-β (cwnd) * cwnd
Note: α (cwnd) and β (cwnd) are both functions. If you want them to be the same as standard TCP, then make α (cwnd) = 1, β (cwnd) = 0.5. The values of α (cwnd) and β (cwnd) are dynamic transformations. For the implementation of this algorithm, see the Linux source code:/NET/IPv4/tcp_highspeed.c.
Tcp bic algorithm
In 2004, the BIC algorithm was developed. Now you can also find the related news "Google: US scientists R & D BIC-TCP protocol speed is DSL six thousand times" Bic full name binary increase congestion control, in Linux 2.6.8 is the default congestion control algorithm. The inventor of BIC has made so many congestion control algorithms all trying to find a suitable cwnd-congestion window, And the initiators of the BIC-TCP have seen through the nature of things, in fact, this is a search process, so the BIC algorithm mainly uses binary search-binary search to do this. For more information about this algorithm, see the Linux source code:/NET/IPv4/tcp_bic.c.
TCP Westwood Algorithm
Westwood adopts the same slow start algorithm and congestion avoidance algorithm as Reno. Major improvements of Westwood: It estimates the bandwidth of the sender. When packet loss is detected, it sets the congestion window and slow start threshold based on the bandwidth value. So how does this algorithm measure bandwidth? Each RTT time will measure the bandwidth once. The formula for measuring the bandwidth is very simple, that is, the number of bytes of the success of the RTT ack. This is because the bandwidth is the same as the RTO calculation using RTT. It also needs to be smoothed from each sample to a value. It also uses a weighted moving average formula. In addition, we know that if the bandwidth of a network is X bytes per second, and RTT is the time for confirmation after a data is sent, x * RTT should be the buffer size. Therefore, in this algorithm, the value of ssthresh is est_bd * Min-RTT (the smallest RTT value). If the packet loss is caused by duplicated acks, if cwnd> ssthresh, then CWIN = ssthresh. If it is caused by RTO, cwnd = 1, and the process starts slowly. For more information about this algorithm, see the Linux source code:/NET/IPv4/tcp_westwood.c.
Others
For more algorithms, you can find relevant clues from the TCP congestion avoidance algorithm entry in Wikipedia.
Postscript
Well, here I think we can end it. TCP has evolved to this day, and several books can be written in it. The main purpose of this article is to bring you into these basic classical technologies and knowledge. I hope this article will help you understand TCP, I also hope that this article will give you the interest and confidence to learn these basic or underlying knowledge.
Of course, there are too many TCP things, and different people may have different understandings. In addition, this article may have some absurd words or even errors, and hope to get your feedback and criticism.
(Full text)
What about TCP (below)