Very good optimizations for TCP Zero Window Update awareness

Source: Internet
Author: User

This article begins with "However there is a drop ...". The previous nonsense can be skipped directly.
This weekend is the second weekend to move into a new home, feeling that the whole person is better off than at Lo Wu. Maybe the color of this House is more like my home in Shanghai ... Anyway, this is the first place I felt refreshed after I arrived in Shenzhen, I have wanted to leave a few times before, but this family let me decide to continue to insist. To tell the truth I do not like Shenzhen, although I prefer rain, but like the kind of continuous rain, rather than the sub-tropical rainforest 10-minute rainstorm 10 minutes of the scorching sun kind. I like the city is the kind of depth, 30 kilometers home Road can read a book of the kind ... Originally, today want to have a good sleep, the past weekend is three points in the middle of the night to climb up to summarize this week's harvest write articles, today five o'clock more than get up, also is sleep enough. Did not expect two hours not to one go the article, is simply. Alas, the environment affects the mood, environmental impact efficiency, especially the temperament of the human ...
My composition habits, are written in an article after the beginning to add an occasional text, this article is no exception. Now time 2016/07/09 08:25, the above is what I added.
TCP Retransmission Overview TCP uses retransmission mechanism to deal with packet loss, which is generally considered to occur in TCP end-to-end, the reason can be divided into line error, network congestion.
TCP uses two mechanisms for retransmission, namely time-out retransmission and fast retransmission, the most fundamental is the time-out retransmission, as for the fast retransmission, you can think of it as an optimization measure. Timeout retransmission is the last line of defense after the ACK clock is lost, it is an external clock to compensate for the momentary lockout of the ACK clock, and attempts to start the ACK clock again.
Ack instead of Nak Koshi TCP at the beginning of the design, and did not adopt the NAK mechanism to explicitly notify the data sent end which data has been lost, because on the one hand in the 80 's line resources is expensive, in the current situation, the Internet occupies 30% of packets are ACK packets, This situation in that time is very exaggerated, plus nak notice, it is more unacceptable, on the other hand, the TCP ACK mechanism is fully capable to let the sender to determine which packets are missing. As a series of optimization measures, Nagle, delay ACK algorithms are invented, many of these optimization measures are fighting each other, the cause of this situation is that the throughput and delay is inherently incompatible, like time and space, you have to make a trade-offs or tradeoffs.
The advent of sack with the development of network technology, the interconnection topology of nodes is more and more complex, and layered design is the fundamental way to deal with the complexity. There have been a variety of support hierarchies of domains, organizations and organizations are also divided into various autonomous domains or routing domains, these for end-to-end sequential protocols such as TCP, is a challenge!
Originally at the beginning of the design, the transmission of data only use the packet switching technology based on the statistical multiplexing to replace the strict TDM/FDM and other circuit switching technology to further improve the line utilization, as to say the node in the dynamic routing protocol under the management of large-scale adaptive interconnection is later, So end-to-end protocols do not work well at this level, and the challenge is to arrive "in order"! The large-scale random interconnection of network nodes cannot guarantee the arrival time of the transmitted data, however, according to the principle of the layered model, the end-to-end Transport layer protocol is indifferent to the topology change of the network layer and the non-state of the network layer itself, which creates a set of contradictions. The solution is still to modify TCP and maintain the simplicity of the network layer. Add a TCP option, set by the receiver in Ack, when the data arrives in a disorderly order, explicitly tell the sender what data it received, by the sender itself to determine whether the data received by the chaos between the ordered data is lost or disorderly sequence, The decision to either immediately retransmit or continue waiting for the retransmission timer to expire.
Timeout timer is an essential external clock to adopt the timer time-out retransmission mechanism, because the TCP sender sends the data, in addition to the receiving side of the acknowledgement can clearly know the success of the transmission, completely do not know what happened after the data issued, so the sending side needs to be judged, the data must be lost ( In fact, may also be the network cache, or Detour road, if there is a mechanism to tell the sender of the cause of data loss, it will help the sender to make better decisions.
For example, the data because of congestion by the router discarded, then if the router can send a small piece of data to tell the sending side "line is already congested, please do not continue to send", this is the legendary "source suppression message", but because this is an out-of-band control message, not necessarily be received or processed by the sender. On the other hand, if the data is discarded due to a line error, then if a message can be sent to tell the sender, the sender can immediately retransmit the lost packet. Unfortunately, TCP does not guarantee that the intermediary network will send such messages, even if they are sent, there is no guarantee that these messages will not be intercepted on the way back ... Difficulties. So the sender uses an adaptive time-out mechanism to timing the data without being confirmed. This time-out is calculated based on the RTT, which is not said anywhere. In particular, TCP time-out retransmission takes a form of avoidance, that is, when the first retransmission is still not confirmed, will be a longer period of time to re-transfer, with the increase in retransmission times, the data has not been confirmed, the avoidance of time will be more and more long, can be up to a few minutes ... Attention! This is the root of the discussion in this article.


However, there is a kind of packet loss ... However, there is a kind of packet loss can be notified to the sender, but it seems that the packet loss is caused by insufficient buffer in the data receiving side. Let me start with a conclusion and then analyze it in detail.
Conclusion when the notification window from 0 to non-0 is received from the data receiving end, it should be judged whether there is data being sent but not confirmed, and if it exists, it should be re-transmitted immediately and reset the time-out retransmission timer.
This place will lead to performance bottlenecks, the essential reason is: Obviously there is a packet loss is the data received by the end of the notification to the sender, the sender is not aware, still do their own thing. The solution is also simple, so that the notification is sent to the sense that can!
Cause analysis of packet loss can be optimized due to insufficient receive buffer quota 1. Why is packet loss caused by insufficient receive-side buffers? Perhaps you have to ask, the notification window can be sent again how much data is not the receiver sent to the end of it? How is it possible to send the end of the sender and receive end of the situation?
We throw aside the sending side to ignore the receive window and aggressively send it without saying, just from the point of view of the receiving side to discuss. In fact, the receiving window just said that the sender can only send so much data, it represents the receiver sends the ACK of the notification window when the remaining buffer size, but after about a RTT, when the sender really sent so much data, the receiving end at this time can be successfully received, but also to see a global quota, If this is the case, the other TCP connections occupy more quotas, resulting in insufficient quotas, and the packets are discarded.
In the TCP implementation of Linux, there is a kernel parameter:
Tcp_mem-vector of 3 integers:min, pressure, Max Min:below This number of pages TCP was not bothered on its
Memory appetite.
Pressure:when amount of memory allocated by TCP exceeds this number
of pages, TCP moderates its memory consumption and enters memory pressure mode, which was exited when memory consumption FA Lls
Under "min".
Max:number of pages allowed for queueing by all TCP sockets.
Defaults is calculated at boot time from amount of available
Memory.

That's what this is about. Other implementations have this parameter, I am not very clear, if who clear the details, can inform, we discuss together. In this part of Linux to determine the implementation, the logic is still relatively simple, as shown in the following pseudo-code:

The window that the receiving side advertises to the sending end in ACK is calculated as follows:
window = socket.receive_buffer-socket.alloced;

When the receiving side receives a SKB, the following judgment is made to decide whether to receive the SKB or discard it:
# check_local is only limited by its own notification window, as long as the sending side strictly follow the receiving end of the notification window sent, generally can pass.
if (check_local (socket, skb_size)! = 0)
DROP;
# Check_global is limited by the memory allocations of all TCP connections, and other TCP connections can rob more quotas.
else if (Check_global (socket, skb_size)! = 0)
DROP
Else
ACCEPT


Let's take a look at the details of check_local:
BOOL Check_local (socket, skb_size)
{
if (socket.alloced + skb_size > Socket.receive_buffer)
return 1;
Else
return 0;
}


When Check_local is passed, it means that the data does not exceed the size of its last advertised window, and further checks whether the global quota allows this SKB to be received by TCP:
BOOL Check_global (socket, skb_size)
{
# First Look at your pre-assigned quotas.
if (Socket.forward_alloc >= skb_size)
{
# This SKB used a quota of its length size
Socket.forward_alloc-= skb_size;
return 0;
}
# If the pre-allocation quota is not sufficient, assign yourself a pre-assigned quota in the global quota one time
Else
{
New_forward = Aligin (skb_size, page_size);
# If the TCP maximum memory usage quota for the kernel configuration is exceeded, it fails! This means that the newly received SKB will be discarded!
if (G_mem + new_forward > Sysctl_mem)
{
return 1;
}
Else
{
# The pre-allocation quota is extended successfully and added to the socket, after which SKB can use this quota
Socket.forward_alloc + = New_forward;
G_mem + = New_forward;
return 0;
}
}
}

Note that this does not involve memory allocation operations, just check the quota, since SKB already exists and has reached the TCP layer, the memory allocation is definitely successful, the purpose of this check is to avoid the TCP connection crazy eat memory, so check a limit, the limit is discarded SKB and free memory, After all, when SKB is allocated, the logic at the bottom is not able to fine-tune the nature of the SKB, and after reaching the TCP layer, the protocol stack can check the TCP quotas globally. Together with the above check_local, this is a layered inspection system:
Allocation of SKB: global memory check.
check_local (Notification window check): Check within the scope of this TCP connection.
Check_global (TCP quota check): A check within the scope of the TCP protocol.

It is also important to note that in the pseudo code above, I omit more complex logic. In fact, the kernel parameter tcp_mem has three values, namely the security threshold, the warning threshold value, the threshold value, the global quota in different intervals, the kernel will take a different response strategy, more smooth treatment of SKB receive, where possible, The protocol stack frees up a portion of the SKB occupied by a disorderly arrival to optimize performance for SKB to be used in order to arrive. This is partly due to the fact that it does not involve the fundamental optimization described in this article.
2. How do I confirm that the retransmission was not caused by network congestion if there was a retransmission of unconfirmed data when the window update was received? First of all, using my optimization strategy, that is, "receive the data receiving end of the notification window from 0 to non-0 update notifications, should determine if there is data sent but not confirmed, if there is, should first re-transmit this part of the data and reset the time-out retransmission timer." ”
What happens if a miscarriage occurs? In other words, this retransmission is not caused by insufficient receiver buffer quotas, but is caused by network congestion. If that is the case, what if we are still transmitting the data that has been retransmitted instead of sending new data?
In fact, the result is invalid retransmission of a data segment only. If there is a real network congestion, even sending new data is still a big chance of losing packets. But the next thing I'm going to say is that after receiving the window update and finding that there are still packets that haven't been confirmed, the "unconfirmed" event is that the probability of network congestion is so low that you can assume that as long as you receive the window When update changes the window from 0 to non-0, there is still data emitted that is not confirmed, almost due to insufficient receive-side buffer quotas.
Assuming that the receiving side advertises the N-size receive window, the sending side sends N data, considering the current network and host configuration, the network is basically hundred trillion, gigabit level, the host's memory in 4G units, so window n should be a relatively large number of MSS more than a larger value, that is, in The number of flight messages is more than a few MSS:
1). If the initial N-3 data segment congestion drops due to packet loss, the receiving end of the data segment will not fill the notification window, packet loss will trigger 3 duplicate ACK, will cause the window to slow down the slippage and even stagnation, so the notification window can be filled to trigger retransmission.
2). If congestion drops occur in the last 3 data segments if this is the case, it does not trigger a fast retransmission, but will trigger a timeout retransmission and then continue with 1) or 2.
Therefore, it is almost certain that if the data is not confirmed when the zero window message is received from the receiving end, the reason is that the receiving end quota is full, resulting in packet loss. At this point the sending side will only be able to retransmit the data over time. Note that the Window at the receiving end is shrunk to zero, so even if multiple duplicated ACK is received (in fact, each zero Window probe The ACK is a duplicate ACK) and cannot be quickly re-transmitted. Can I perform a time-out retransmission at this time? Note the TCP_RETRANSMIT_SKB function in the TCP protocol implementation of Linux, we know that when the retransmission timer expires, the TCP_RETRANSMIT_SKB is unconditionally called, and the first SKB in the transmission queue is transmitted, and there is a logic in the function:
TCP_RETRANSMIT_SKB (SK, Tcp_write_queue_head (SK)):    ...    /* If receiver have shrunk his window, and SKB are out     of * New window, does not retransmit it.     the exception is the "case" when window was shrunk to zero.     Retransmit serves as a zero window probe.     *    /if (!before (TCP_SKB_CB (SKB)->seq, Tcp_wnd_end (TP))        && TCP_SKB_CB (SKB)->seq! = Tp->snd_ Una)        return-eagain;    ...    Err = TCP_TRANSMIT_SKB (SK, SKB, 1, gfp_atomic);

Note It is clear that only the retransmission timer timeout will take the execution stream to TCP_RETRANSMIT_SKB when the sending window is not enough, why? As noted in the note, as a zero window probe, the purpose of doing so seems to be thoughtful, not wasting a byte of bandwidth!
3. Will the data that is still unconfirmed be re-transmitted as soon as the window update is received? No!
We know that the TCP data transmission under normal circumstances is not dependent on the external clock, it is self-clock driven, that is, by the ACK driver, however, once the receiver buffer quota is insufficient, will be generated packet loss does not allow the sender to send the data, until the receiver to re-have enough buffer quotas! During this time, the ACK since the clock is shut down, we think this clock has been lost!
TCP relies on an external timer to replace the clock during the ACK from clock loss, which is the zero window probe mechanism, the TCP sender will periodically send a probe packet, inducing the receiver to respond to an ACK, this ACK will contain its current receive window, This receive window reflects its own buffer quota status, once the sender has received a notification window is not 0 ACK, it can continue to send data, whether it is sent immediately depends on algorithms such as Nagle,cork. The sender can send data that means it can restart the ACK from the clock. It all seems reasonable, but there is a performance problem.
If a window update is received, the data has been re-transmitted and is still not confirmed (we already know that this is the probability that the receiver buffer quota is not enough), if this time can immediately send this data, will be in a RTT to get confirmation, so that the TCP send stream continues to continue , however, due to the absence of the ACK clock, the sender relies entirely on the external retransmission timer to determine when to re-transmit this unacknowledged data, and we know that the retransmission timer is the execution of the Backoff strategy, the simple point is to re-transmit the interval gradually extended, the sender received with window When the ACK of update is reached, this retransmission timer may be a long time from expiry, there are 3 possible scenarios, assuming the sender has disabled Nagle:
1). The window space on the receiving end is too small

The following TCP time series diagram describes this scenario:




2). On the basis of the window released by the receiving end just at 1), please see the following timing diagram:




3). The window of the receiving end is large enough for this situation to have no essential difference from the above 2, the red "Wasted Time" in this scenario in the end can be made up or how much, depending on "the receiver empty window size" and "how long before the SEG 13 timeout" relationship, if the SEG 13 timeout before the new data in the vacated window has been sent, then there will be a red "wasted Time", if the arrival of the SEG 13 retransmission timer expires, the data in the idle window has not been sent, this will not waste time. But how much of a probability does this happen?
This kind of thing will happen very low probability! Because the host's transmission delay and network delay (reflected in the RTT) is not at all a magnitude, if the host delay on the super-high-speed network is a large sum of overhead and may be greater than the network delay, but if it is really on the ultra-high-speed network, the receiver will produce zero window is very low probability, There is not a principle called "on ultra-high-speed network, optimize the end of the host more than the benefits of optimizing the network itself"? This is the end of the host is the bottleneck, if even this has not been done, the other still hair Ah!

In summary, in either case, it is a good choice to first retransmit the data that has been re-transmitted and reset the retransmission timer.

How to optimize the performance loss of this phenomenon so, how to optimize is a very simple thing. I did the test based on Tcp_probe, modified the hook function inside the tcp_probe, in this case I hook the Tcp_ack function:
static inline int Tcp_may_update_window_ok (const struct Tcp_sock *tp, const U32 ACK                 , const U32 ACK_SEQ, const u32 Nwin) {return (after (ACK, Tp->snd_una) | |                After (Ack_seq, tp->snd_wl1) | | (Ack_seq = = Tp->snd_wl1 && nwin > Tp->snd_wnd));}        static int jtcp_ack (struct sock *sk, struct sk_buff *skb, int flag) {const struct Tcp_sock *TP = Tcp_sk (SK);        const struct Inet_sock *inet = Inet_sk (SK);        struct TCPHDR *th = TCP_HDR (SKB);        struct Inet_connection_sock *icsk = INET_CSK (SK);        U32 ack_seq = TCP_SKB_CB (SKB)->seq;        U32 ack = TCP_SKB_CB (SKB)->ack_seq;        U32 Rnwin = Ntohs (Tcp_hdr (SKB)->window);        U32 Nwin = Rnwin;        U32 Owin = tp->snd_wnd;        if (Likely (!TCP_HDR (SKB)->syn)) Nwin <<= tp->rx_opt.snd_wscale; if (port = = 0 | | ntohs (inet->dport) = = Port | | ntoHS (inet->sport) = = port) && TCP_MAY_UPDATE_WINDOW_OK (TP, ACK, ACK_SEQ, Nwin) && (Owin < Nwin && Owin <= 2*tp->mss_cache)) {PRINTK ("hit!                Owin:%u, Nwin:%u\n ", Owin, Nwin);                icsk->icsk_retransmits++;                TCP_RETRANSMIT_SKB (SK, Tcp_write_queue_head (SK));         Inet_csk_reset_xmit_timer (SK, Icsk_time_retrans, Icsk->icsk_rto, Tcp_rto_max);        } jprobe_return (); return 0;}

In the formal code, the above logic fills in the tcp_ack. The test results are also well-suited to expectations. The only difficulty is that the phenomenon is not too good to simulate. In order to simulate drops caused by insufficient buffer quotas, I need to execute the following script:
id=$ (ps-e|grep curl|awk-f ' {print $} ');
echo $id;
While true; Do
Kill-stop $id;
Sleep 10;
Kill-sigcont $id;
# sleep 10ms, to avoid freeing up too much space at once!
Sleep 0.01;
Done
However, TCP will automatically reduce the transmission rate, in order to create bursts and drops, I try to create a burst queue on the sending side, but this is not the use of eggs. The simplest is to use the tcp_probe mechanism on the sender side ignores the receiving end of the notification window, directly modify the Tp->snd_wnd field. Simulation is painful, the best way is to grasp the packet on the real network, compare the capture results before and after the optimization, and compare the efficiency statistic results before and after the optimization.
About the simulation test this kind of thing can not be accurate every test, if there is no statistical significance, testing is meaningless, after all, any one TCP test anyone using any technology can not control the details, it is related to the behavior of people around the world, it is possible that a person in Shijiazhuang suddenly feel bad mood, Then turn on the computer and start watching movies, which will affect your test results! This is a chaotic system, and the butterfly effect will affect the test results at all times.
Understanding the nature of the phenomenon is more important and useful than knowing how to optimize it, because as long as you have a deep enough understanding of a phenomenon, you will find that the optimization scheme is far more than one, in this case, there is an optimization scheme is to re-send a data as zero Window probe, if you receive a zero Window ACK, then no longer perform the backoff, I do not know Linux is now not so realized, take me to think this is reasonable, why say reasonable? Since avoidance is the avoidance of network congestion, however, when you are being probe and indeed responding to a retransmission, you can conclude that the network is not congested, and that this packet loss is caused by insufficient buffer quotas on the receiving side. Explosion!

Congestion window and Receive window This article does not mention the congestion window, because this is an end-to-end control of the article, and network congestion is not related, also ignore the network congestion, of course, do not mention the congestion window. But I still want to say a few things at the end. The reception window and the Congestion window are different.
Almost everyone knows that the TCP sender sends data as a send window with the minimum value between the congestion window and the receiving window. In fact, this understanding is insufficient, because the congestion window is less constrained. Directly point out that the congestion window is not a sliding window, it is a scalar, only control how much data can be sent without controlling the sequence between the data, and the receiving window is a sliding window, it is a vector, not only control how much data is sent (determined by the peer buffer quota), but also control the order of sending data ( Determined by the sequential reception of TCP). So the rule is reasonable, because the current network is basically a stateless, non-connected IP network, the network does not require the number of TCP sequence, it only cares about whether the data sent by TCP exceeds its processing power, but the receiving side in addition to pay attention to whether the data is more than its own processing power, but also care whether the order. So, when the TCP data is sent or re-sent, the quota detection for the two windows is different, as follows:
1). Check Congestion window quotas The check here is very simple, just check the size:
if (Segments_in_flight < Congestion_window)
TRUE;
Else
FALSE;

As long as it is true, you can continue to send SKB, and no matter which skb!
2). Check the receive window quota for the receive window, not only to check the size, but also to check the sequence, so this needs to be associated with the currently sent SKB:
if (Skb.end_seqence-tcp.una < Receive_window)
TRUE;
Else
FALSE;

Only the quota detection for both Windows is passed, and the data can be sent.
Write it in the back.

This problem is not out of thin air, nor look at the code to see, it comes from a real problem, the problem has been in the past more than a year, just a few days ago was a colleague grabbed a package to reproduce, although it is difficult to simulate, but it was caught in the current is good. I'm putting this grab bag map as follows:




It is very clear that when the window update is received, the retransmission of the 1461th data timeout time has been retired to the minute level, although from the capture packet, and there is no window 0 ack, this is because the ACK in the QIDSC queue is lost, at this time if the window is received When you update it, you can save about a minute by immediately re-transmitting it. There is another example:


So there's the article.

Very good optimizations for TCP Zero Window Update awareness

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.