Introduction to Linux Kernel Engineering-network: TCP congestion control

Source: Internet
Author: User
Tags ack

This article was originally in the TCP article, but the article is too long, not exclusive. It was perfected and extracted.

TCP Congestion Control

Congestion control is a discussion of how many simultaneous TCP connections should plan their own packet sending and receiving speeds to share bandwidth with each other, while competing with other entities ' machines to compete fairly with bandwidth, rather than taking full account of themselves.
The core of congestion control is AIMD (additive-increase/multiplicative-decrease), which linearly increases multiplicative decrease. Why not linearly increase the linear decrease, or multiply sex to increase the multiplicative decrease? This has been specifically studied, only AIMD can converge to make the link fair. I am not clear about the specific process, I only know the conclusion.

Congestion Detection window (each paragraph should complement the example)

Because the sequence mechanism naturally has the ability to perceive the network congestion. So the perception of how to react after congestion is a problem that should be considered. This part is somewhat partial mathematics, so the academia likes to study this question. Naturally, the relative congestion control algorithm is numerous.
The most widely accepted ideas and algorithms are as follows: The probability of a sudden sharp change in the speed of the network is small, and congestion is gradual (people still seem very fast, but the machine seems to have a process). And in this process the data is gradually becoming unreachable. Therefore, the two sides should have a learning mechanism. TCP leaves the domain used to implement this: the window. This window also represents the size of the memory receive cache, and the cache itself has a buffering effect, so even a sudden change in the network, there is a process in the cache. The normal communication cache is not full, but a party sends a lot of data (which is sent to the cache until it is confirmed), but it is not confirmed, and the available send cache is smaller, so its sending volume begins to contract. Similarly, the receiver knows the size of its receiving cache, that is, its ability to receive. So it will also be timely to inform the sender to allow the other party to send the highest speed. Meaning, I can now receive n bytes at a time, you send me at once, not much, also better not less.
Another important feature of this cache is the chaotic reorganization. The received packets are placed in the cache, received, and then returned to the upper layer, and that is how the receive cache can easily be filled in an unstable network.
To summarize: The role of the receive cache is to specify the maximum receive speed and disorderly order rearrangement. The role of send cache is to lose packet retransmission and control the sending speed.
TCP Specifies that this window is related to the size of the cache, and the sequence combination that confirms the data is sent, indicating the sequence of data that can be received next. The window is used only to inform the other person of their own ability to receive, not to express their ability to send. This ability to transmit depends on the receiver's ability to receive and the current estimation of the channel itself to adjust. The core idea is that you don't need to tell yourself what to do in a packet, and you should suggest how to do it.
The main algorithms for detecting and controlling data traffic through windows are:

Rtt

To determine whether congestion, not only through the window to detect congestion, but also through the direct measurement of round-trip time. Window detection is observed, and round trip time belongs to the measurement. The first congestion control algorithm Vegas is monitored and controlled based on RTT. However, because RTT is not based on the actual packet loss rate, but based on the round-trip time, and the Internet, especially the wireless network, the RTT is not unreachable or congestion, the use of the Vegas algorithm began to actively reduce their speed (because it determines the network congestion), Other algorithms that are based on packet loss rates do not reduce the window. cause Vegas to serve the people and give the available bandwidth to others. Such a loss of self-altruism, like Nagle, is destined to die.
However, RTT still plays a critical role in congestion control (except for the cubic algorithm, which is not dependent on RTT at all), and most of the algorithms add an MSS to the window when the ACK response is received. This is the principle of the slow-start part (but at a fairly rapid pace).

Congestion avoidance

It is the most reasonable method of law and order to rule the chaos. Fear of congestion is the first way to avoid congestion. To avoid congestion, we should analyze the causes of congestion, which is undoubtedly the problem of network transmission. But things are not so simple, what is causing the congestion of network transmission? You may say that there is too much data to transmit. However, most of this is not a business problem, but a technical problem.
Congestion here avoids the term two dimensions. One is that we don't understand the details of the technology when it comes to avoiding congestion. The other is a technical slow-start algorithm. The entire congestion control process consists of three processes (the latter two are technically the same): Slow start (Window exponential growth)/congestion Avoidance process (window linear growth)/congestion processing.

Bandwidth Savings

Naturally, the best way to avoid congestion is not to send so much data. But this is not realistic for users, after all, transmission of data is the fundamental meaning of the existence of the network. But the kernel is also as technical as possible to reduce the user's data. Typical is the Nagle algorithm.

Nagle

There are two types of data that TCP transmits: commands and data. TCP does not know whether it is a stream of commands or data. Commands are characterized by short, and require immediate response. The characteristics of the data are long and can receive some more to respond. For the data, the general throughput is large, usually sent immediately, because the upper layer each commit to the kernel to send a lot of content. But for commands, if each time it is sent immediately, then the packets that can be combined together are split into multiple, so that the same data is sent to consume more bandwidth. For this reason, Linux has designed the Nagle algorithm.
The Nagle algorithm specifies that a packet is sent out until the reply of the package is sent to the second package, in which the data is accumulated in the Send cache cache. This allows you to combine as many command data as possible to save upstream bandwidth. But the original idea of the algorithm is good, but the effect is sad, more pathetic is still open by default. Because, the Nagle algorithm to the bandwidth of the savings is through the command of their own time delay, or to send out immediately. But this delay allows your application to feel the slow response of the system. And this slow or oneself to others to save the upstream bandwidth (send commands generally do not account for their own bandwidth) caused. This time, as long as you shut down the Nagle algorithm, you will find their transmission response significantly faster. A typical application is samba if the transfer speed of the shutdown nagle,tcp is generally increased. Nagle This sacrifice algorithm design is unpopular in the market, but the original intention is good.

Congestion control

We can detect congestion and we have to avoid congestion. All modern congestion avoidance algorithms are based on 4 core concepts: Slow start, congestion avoidance and fast retransmission, fast recovery. These 4 basic algorithms are first proposed by Reno congestion avoidance algorithm, and later in TCP Newreno, the "Fast Recovery" algorithm has been improved, and in recent years, selective response (selective Acknowledgement,sack) algorithm has appeared, There are other aspects of the big and small improvement, become a hotspot of network research.

Slow start (slow start) and congestion avoidance

The receiver always informs its own window, such as 300 bytes. The sender calculates the number of the data that can be sent in the current window according to the window sent by the receiver, for example, from 200-500 to 300, the next can be wayward send, do not need to wait for ACK. Assuming that the receiving side replies to the next window is 500, receive confirmation is 400 sequence number, then the sending side of the calculation can then be wayward to send the sequence number is 400-900, a total of 500 bytes. So the cycle.
Slow start algorithmis: When you create a new connection, CWnd is initialized to 1 maximum segment (MSS) size (or integer multiples, linux default is 65535), the sender starts to send data according to the Congestion window size, each time a message segment is confirmed, CWnd increases the size of 1 MSS. In this way, the value of CWnd increases exponentially with the network roundtrip time (Round trip Time,rtt), in fact, the slow start speed is not slow at all, but its starting point is relatively low.
The situation above is that the receiver window is constantly increasing, which typically occurs at the beginning of all TCP connection establishment. Since two nodes to establish a TCP connection is not aware of the quality of the link, so the sending side is also not good to confirm how much data can be wayward at a moment, so as a detection of the channel, the initial TCP connection establishment, the receiver will generally set the window is very small, and then multiply the increase. This is called slow start. Increases to a certain value will increase slowly (linearly), is a gradual adaptation of the learning process. The threshold for switching from exponential increments to linearly increased is called Slow Start Threshold (Ssthresh). Many algorithms are on this value (for example, dynamic changes to this value)
The process of entering a slow increase in window values is the process of congestion avoidance. Because there is usually no congestion at the beginning, this time the initial window of slow start is too small, in order to quickly reach the maximum speed, the window is exponential increase, but to a certain set of thresholds, the increase of the window will have to increase linearly to prevent excessive congestion (fast to the limit when the slow test), The process of adding a window linearly isCongestion AvoidanceThe process. The threshold value for this setting is called the slow-start threshold (Ssthresh).
In the slow-start phase, if congestion occurs, the receiver shrinks its own receive window to 1 (or the system's default value of 65535) so that the sender does not send so much data, or the slow-start algorithm is executed again. This algorithm is TCP Tahoe. In the case of TCP Reno, when congestion is detected during slow boot, the slow-start threshold Ssthresh is set to half the current window, so it is forced to immediately enter the congestion avoidance phase.

The flaw of slow start

Slow start and congestion avoidance is based on a common sense hypothesis: a message is received stating that there is room for improvement in the quality of the network, and that the sending side has to speculate on how much space is being raised? This is related to the RTT and the current window size. So the algorithm is incremented after the acknowledgement is received. In the case of the loss of more serious environment, such as WiFi, in fact, the bandwidth is very large, just lose the packet weight, this time the basis of ACK confirmation mechanism performance is not good, the growth will be relatively slow, and it is easy to be mistaken for bandwidth is not enough. That is, the sending side can not distinguish between network congestion and link quality (both of which result in the loss of packets).
And because it is slow to start, the initial growth rate is very fast, but the base is very small, so the effect of this short connection is very poor. People are not trying to use a better algorithm for short connections, but instead think of a lot of countermeasures based on the algorithm that is now used by everyone. For example, open multiple TCP connections at the same time, or try to reuse the same connection as much as possible. However, such services as Web server are not properly satisfied. So you'll find that when you open a website, the current traffic will use a number of TCP connections to download different resources.

Fast recovery and fast retransmission (how to reduce window adjustments)

The process of window enlargement is the process of congestion avoidance, and the process of window reduction is the process of congestion control. Congestion control is the detection of the occurrence of congestion (or possible congestion), the response of both sides of the communication. Congestion control can be achieved by the adjustment of congested windows. What is said earlier is part of the slow start algorithm.

Fast re-transmission

In addition to increasing the window, there is a case for reducing the window. The decrease is generally drastic. After detecting congestion in the network (no data received), the received will shrink its own receive window, but how to reduce the algorithm is different. The so-called AIMD algorithm is here to play a role. The additive growth of AIMD refers to the linear enlargement of the window of the conflict avoidance process, and the multiplicative decrease is the circumvention mechanism after congestion (the window keeps growing until there is no congestion until the physical cache memory is not enough). Fast retransmission is the algorithm to avoid window fluctuations caused by window reduction.
The network is unreliable, and the performance of this unreliable on both sending and receiving ends is to receive duplicate packets and not receive packets. In TCP, receiving duplicate packets can cause confusion when the sender receives a duplicate ACK. It may be considered a network problem, or it may be a duplicate ACK sent by the receiving party that has not received a certain data. TCP does not design an additional mechanism for this situation so that the receiver can make a difference on the packet each time the same ACK is sent. This brings trouble to the sender. The traditional receipt of multiple ACK senders will be considered a network retransmission until its TCP time-out mechanism initiates discovery before sending a packet timeout without being replied to the ACK, sent to determine that more than one ACK is sent by itself to the packet that the receiver does not receive, and not the sender receives the network cause duplicate packets. The fast recovery algorithm is the conclusion that when a duplicate ACK is received from 3 (or 4, see implementation), it is determined that the packet is lost due to its own transmission, thus judging the network congestion to occur. The retransmission is initiated when the congestion is determined, but the retransmission is not like a slow-start algorithm, returning the window to 1 and recreating the execution algorithm. At this time a large probability is only accidental loss of the network, so it just resend the lost packet, according to the original speed continue to send. It's called fast retransmission. This mechanism can significantly reduce the probability of the wrong shrinkage window.
It is obvious that TCP mechanism design is a model of asynchronous problem solving. This design is based entirely on the characteristics of the physical network. Because today's network is the best-effort network. When demand exceeds its load, its response is to reduce the quality of service, rather than limiting the number of accesses. Such network design, let the work on it all the protocols have to consider the problem of packet loss and retransmission.

Fast Recovery

Fast retransmission and fast recovery is an algorithm, but for historical reasons, it is said to be like the name of two algorithms. They describe two processes for the algorithm: Fast recovery and fast retransmission. Fast re-transmission success, the natural rapid recovery.

Other mechanisms that serve the fast recovery algorithm sack

SACK (Selective acknowledgement) is the TCP option, which allows the receiver to tell the sender which segments of the packet are missing, which segments have been re-transmitted, and which segments have received information in advance.
Based on this information, TCP can only retransmit the truly missing segments of the message. It is important to note that the Sack,tcp ACK may be sent only if the group receiving the disorder is received or based on the cumulative acknowledgment. This means that if the received message segment is the same as the sequence number of the message segment expected to be received, the accumulated ack,sack will be sent only to the segment of the message that the order arrives.

D-sack

Repeat the sack. The sack is extended in RFC2883. The information in the sack describes the received segment of the message, which may be either normal or repeated, and by extending the sack, D-sack can describe the segment of the message it received repeatedly in the sack option. However, it is important to note that the D-sack is only used to report the last received
A repeating portion of a message and a message that has been received

FACK

The FACK (advance confirmation) algorithm takes an aggressive strategy to treat all sack of the unacknowledged interval as missing segments. While this strategy often leads to better network performance, it is too aggressive because sack's unconfirmed interval segments may simply send a reflow instead of being lost.

Implementation of real congestion control algorithm

The front is the theoretical basis, but the implementation of the need to consider different things, design different parameter ratios. And the network is constantly emerging new situations, need to be targeted to adjust. In recent years, with the popularization of high bandwidth delay network (HI bandwidth-delay product network), there are many new TCP protocol improvements based on packet loss feedback, including hstcp, STCP, Bic-tcp, Cubic and h-tcp. Now cubic is the default algorithm for the most Linux-based congestion control algorithm, Ubuntu. Here is a list of common algorithms:

Generally speaking, the protocol based on packet loss feedback is a kind of passive congestion control mechanism, which is based on the packet loss event in network to make the network congestion judgment. Even if the load in the network is high, the Protocol will not actively reduce its sending speed if congestion drops are not generated. This protocol maximizes the amount of bandwidth available to the network and improves throughput. However, because of the aggressiveness of the packet-loss feedback protocol in the near-saturation of network, the bandwidth utilization of the network is greatly improved. On the other hand, for the congestion control protocol based on packet loss feedback, greatly improving the network utilization means that the next congestion packet loss event is far off. Therefore, these protocols increase the network bandwidth utilization and also indirectly increase the packet loss rate, causing the jitter of the whole network to intensify. This algorithm is equivalent to know that the network is saturated, but there is no saturation of the margin I want to occupy, so only some people will take advantage of the cheap, but we all use will be too fast saturation. TCP Congestion control is a typical game process of minimizing the collective benefit of individual interests.
Bic-tcp, Hstcp, stcp and so on, the protocol based on packet loss feedback has greatly improved its throughput rate, and also seriously affected the throughput rate of other streams. Packets based on packet-loss feedback generate such a poor TCP-friendly group because of the aggressive congestion window management mechanism of these protocol algorithms, which usually assume that the network must have excess bandwidth as long as it does not produce packet loss, and thus continuously improve its sending rate. The transmission rate shows a trend of concave shape from the macroscopic angle of time, and the faster the peak transmission rate of network bandwidth increases. This not only brings a large number of congestion drops, but also maliciously annexing the bandwidth resources of other coexistence streams in the network, resulting in a fair decline of the entire network. But it is because there is no higher authority, so this algorithm is destined to become a universal algorithm. In fact, I also think that congestion avoidance should not be the server's own business, but should be the router QoS things. I tend to all TCP directly do not congestion control, Windows set to the maximum size of memory support is good, can send success, whether congestion, to the router to worry about it. I only need to detect the packet loss rate, too high when the window to reduce their own.
Technically, these algorithms are a change in the control strategy of when the Reno algorithm decreases in the amplitude of the window. As can be seen from the table, in addition to the opportunity to drop the feedback, there are also based on the delay and synthesis of the signal (less discussion range). Because now is the opportunity loss the unification world, here only discusses this.

HSTCP (High speed TCP)

HSTCP (High speed Transmission Control Protocol) is a new congestion controlling algorithm based on AIMD (additive growth and multiplicative reduction) in high speed network, which can improve the throughput rate of network more effectively in high speed and delay network. It modifies the standard TCP congestion avoidance algorithm and reduces the parameters to achieve a fast window growth and slow down, so that the window is kept in a large enough range to make full use of the bandwidth, it can get much higher bandwidth than TCP Reno in the high-speed network, But it has a very serious RTT inequity. Fairness refers to the equality of network resources that are held between multiple streams that share the same network bottleneck.
The TCP send side dynamically adjusts the increment function of the hstcp congestion window by the desired packet loss rate of the network.
Window growth mode when congestion avoidance: CWnd = CWnd + A (CWnd)/CWnd
The window drops after the packet loss method: CWnd = (1-b (CWnd)) *cwnd
Among them, a (CWnd) and B (CWnd) is two functions, in standard TCP, a (CWnd) =1,b (CWnd) = 0.5, in order to achieve TCP friendliness, in the case of a low window, that is, in a non-BDP network environment, HSTCP uses the same A and B as the standard TCP to ensure the friendliness of the two. When the window is large (critical value lowwindow=38), new A and B are adopted to achieve high throughput requirements. You can see the RFC3649 documentation for details.

Westwood

In wireless networks, it is found that Tcpwestwood is an ideal algorithm based on a large number of researches, and its main idea is to make bandwidth estimation by continuously detecting the arrival rate of ACK at the transmitting end, when congestion occurs, the congestion window and the slow start threshold are adjusted with the bandwidth estimate, using Aiad ( Additive increase and adaptive decrease) congestion control mechanism. It not only improves the throughput of the wireless network, but also has good fairness and interoperability with the existing network. The problem is that congestion drops and wireless drops are not well differentiated during transmission, resulting in frequent call of congestion mechanism.

H-tcp

High performance Network in the overall performance of the best algorithm is: H-TCP, but it has the RTT unfairness and low bandwidth of the unfriendly issues.

Bic-tcp

Bic-tcp disadvantage: The first is the preemptive strong, bic-tcp growth function in small link bandwidth latency is relatively short of the standard TCP to preempt the strong, it in the detection phase quite so restart a slow start algorithm, and TCP in the stable after the window is always linear growth, The slow boot process is not performed again. Secondly, BIC-TCP's window control stage is divided into binary search increase, max probing, and then Smax and Smin, which increases the difficulty of the algorithm, and also adds complexity to the analysis model of protocol performance. In low-RTT networks and low-speed environments, the BIC may be too "aggressive", so the BIC has been further improved, i.e. cubic. is the default algorithm for Linux prior to adopting Cubic.
But under the long-fat pipeline, Bic's aggressiveness is just right.

CUBIC

Cubic simplifies the bic-tcp window adjustment algorithm in the design, in the Bic-tcp window adjustment will appear a concave and convex (here concave and convex refers to the mathematical significance of concave and convex, concave function/convex function) growth curve, Cubic used a three-time function (that is, a cubic function), There is also a concave and convex part in the three-time function curve, which is similar to the BIC-TCP curve, which replaces the bic-tcp growth curve. In addition, the key point in cubic is that its window growth function depends only on the time interval value of the successive two congestion events, thus the window growth is completely independent of the network latency RTT, and the hstcp described earlier has a serious RTT inequity, The independent nature of the rtt of the cubic allows cubic to maintain a good RTT fairness between multiple TCP connections that share a bottleneck link.

Stcp,scalable TCP.

The STCP algorithm was proposed by Tom Kelly in 2003 to adjust the size of the sending window to accommodate high-speed network environments by modifying the TCP window to increase and decrease the parameters. The algorithm has high link utilization and stability, but the mechanism window increases inversely with the RTT, there is a certain degree of the RTT unfairness, and with the traditional TCP flow coexistence, excessively occupy the bandwidth, its TCP friendliness is also poor.

TCP Proportional Rate Reduction

This is the protagonist of the finale, which is used by default after kernel 3.2. However, the RfC is written on probation.

Introduction to Linux Kernel Engineering-network: TCP congestion control

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.