Do you want TCP to work in a pipelined way?
Perhaps you have heard of Mptcp, perhaps you have heard the peer download is how the world for me and I am the world's people.
If I could split a TCP stream into multiple TCP streams, in theory the transfer speed would be greatly improved because the TCP congestion control algorithm had to carry the Fairness Convergence feature (otherwise paper would not pass ...), and the TCP feedback system would assign a ticket to each added stream, Feedback system does not care whether you are a gang, it only by the head count, not grouped, so there are a lot of fun things to do.
I usually prefer to work with some non-working relations with peers to do technical exchanges, casually find a place, or call, mostly after work, this will appear more relaxed, comfortable and will be a great harvest, generally speaking, communication is two-way, if you always listen and do not say, Slowly people will feel that you are nothing more than a steal technology or students angry scholar, the kind of staring at others every day, for fear that they know too much to throw themselves off the kind of students ... Last weekend I chatted with my colleague for a couple of hours, and he is currently conceiving the concept of a TCP virtual stream, which I am more interested in, coincides with.
A super simple idea, that is, by bundling multiple streams to achieve the purpose of speed up, this technology has nothing to do with any company and individual copyright or patent, sharing out also harmless, time passed a week, every day work no time, and to the weekend three o'clock in the evening to get up the refreshing moment, summed up, Here is a memo on this technical exchange.
Let's start with a brief look at the fairness and fairness indicators.
1.TCP Fairness TCP Fairness refers to the fairness of the allocation of resources, for TCP, refers to the bandwidth allocation of fairness. The industry has a fairness index, as follows:
If the TCP algorithm existing in the network is to some extent to satisfy the fairness index convergence in a range of close to 1, then the world is good, if there is an algorithm that causes its service TCP stream to occupy a little more bandwidth will make the Fairness index down:
Therefore, the preemptive algorithm can not be recognized, it must be secretly quietly engaged.
Theoretically speaking, no algorithm can be guaranteed in the premise of fairness can be more than other streams of bandwidth, the reason is also like the Chinese innovation, is entirely based on the "other algorithm is not good enough" on the basis of the edge of the ball. is the BIC algorithm not also preemptive in a particular scenario? The industry has focused on 30 years without a perfect algorithm, so each company or even individual can put forward a "on the road" algorithm, note, can be recognized in the path of the algorithm, which is said that really considered the fairness but the edge of the algorithm, the industry does not stipulate the perfect fairness, but only the provision of "available" fairness, As for the kind of data sent two times and so on, and so on, will never be presentable.
Apart from the inability of the selfish congestion control algorithms to be accepted, they themselves will also face huge costs of algorithmic failure. In our country, it is difficult to imagine a common company will have mathematicians, and dare not imagine such an impetuous environment can let a company or individual adhere to a direction for decades, needless to say, almost all of the selfish congestion algorithm has been out of fairness, they even their own "efficiency-flow" can not be weighed, is completely to win the volume, This is also in line with our roots, but operators where the flow is to pay, who pay that is another topic.
So, rather than trying to develop a "on-the-road" congestion control algorithm that might ultimately be unsuccessful, it's easier to design a new protocol over the algorithm.
2. Fairness between multiple streams the current Linux system default congestion control algorithm is cubic, it works very well, fairness performance is also excellent, if I believe its performance, then if I can turn a TCP stream into two or n TCP streams, then this
N TCP streams will equitably share the bandwidth reaching the receiving end. To make the discussion more intuitive, I simplified the model.
Assuming that node A is the sending side, Node B is the receiving end, the total bandwidth of the intermediate link is w, and the current link has m TCP stream sharing bandwidth W, according to fairness, the bandwidth of each TCP stream is
w/m, as a member of the A-tcp-b, naturally its bandwidth is also
w/mNow, if I split the TCP stream and split it into n streams, the number of TCP flows that exist on the current link becomes
M+nPer stream with a bandwidth of
w/(M+n), where
N (w/(m+n))Bandwidth belongs to me. Now prove
N (w/(m+n)) >w/m
The programmer is not a mathematician, if you can have intuitive things do not want to go to the classification of discussion, the proof of the above conclusion is as follows:
The eyes look very direct, but are there more direct? But if it is mathematically deduced, it will certainly devote a lot of energy to the classification discussion of the real number domain, a classical problem is to find the integer solution of X+y=xy, the real solution, but for our particular scenario, this quantitative deduction is not necessary, Because the scene itself can throw away a lot of mathematical significance but does not have the practical significance of the situation, we just aim at more than 90% of the market can be, this is the difference between reality and mathematics. This time I want to change a way, with gnuplot this artifact to see, I put two functions in a picture, it is easy to do:
Intuitively, it is like exposing the mobile phone film, XY is greater than x+y, but note the edge of the color, some xy is less than x+y, we enlarge to see:
Unlike the mobile phone film, one is wedged into the other, from the other under the membrane through the past ... Does it matter? What does this mean for our conclusion that "XY is greater than x+y" will affect how much of the program? Do we need care? Instead of scaling, I define the x-axis interval as [1,50],y axis interval is defined as [1,3], and then look at:
Because of the symmetry of x and y, such unequal scaling means fixing a variable and observing the long-term trend of another variable affecting it. Why should it be so troublesome to compare the relationship of two surfaces, using z=x+y-x*y directly is not better? Compared with 0. The reason is still that sentence, not intuitive. Because we can not intuitively determine the surface of the concave and convex, in case of 0 near the wave-like how to do, with partial derivative may be done, but that back to the mathematical deduction. In fact, it might be more intuitive for us to draw the z=0 surface and compare it to Z=x+y-x*y:
Note that the small purple area, we can conclude that these "abnormal points" is not our care, that is, we do not have to pay attention to the XY than X+y is smaller than the scene, in our case, X is m,y N, we know that n equals 1 is meaningless, this back to the standard TCP, X equals 1 does not make sense, because anyway the link on a TCP stream, it will monopolize the link, mathematically, can prove that x and Y are equal to 2 o'clock, will have no effect, because 2+2=2*2. Even in our case, in order to start from the simplest, we can set n simple to 2, and in the real world, it can be expected that M is a very large number (and cannot be 2), now we can calculate the speedup.
When n is far less than m, the acceleration ratio converges to n. This formula does not look very intuitive, and again, we use gnuplot to see:
In fact, mptcp and thunder are relying on this speedup ratio!
In the actual implementation, N is often constrained by the cost of the system can not be too large, 10 levels can see the effect, in my two-window single-sided solution, I natural selection 2 can.
3.TCP flow Inter-group fairness above we mentioned, MPTCP and Thunderbolt are used to bundle the flow of the way to obtain the best speedup, the two schemes are different, MPTCP is the TCP layer to bind the stream, and thunder and other peer-to software is directly the application of data is split, What better way to create multiple streams directly at the application level? In fact, regardless of which way, it involves the original application or protocol stack modification, most of the scenarios can not be promoted, but their practice has directly given us some ideas.
They can do this because TCP specifies that the bandwidth must be shared fairly between flows and flows, but does not specify the details of the collaboration between the stream and the stream. We know that competition and mutual assistance are always the two sides of anything that has been transforming each other.
Some principles of analogy process scheduling, in recent years, due to the emergence of containers, light virtualization technology, group scheduling strategy has become more and more complex. In the early days, the fairness of scheduling was carried out between processes and processes, and then a similar packet schedule was introduced because fewer sessions were starved to death by processes. In the end, process scheduling completely becomes a layered mechanism, and no one can exploit the loopholes. At present, TCP congestion control is not a hierarchical mechanism, at least at the level of protocol specification, which leaves a lot of loopholes.
However, the current hierarchical packet congestion control mechanism is ubiquitous outside of TCP, such as router switches, many implementations of weighted fair queue, they can be grouped according to the source ip/target IP pairs, can also be grouped by the target port, or even through the initial serial number of TCP fingerprint to group, thus, As long as all TCP flows from the same host, to the same server, may be treated as a stream, this creates a number of obstacles for TCP acceleration implementations of multi-stream bundles.
But anyway, we don't need the perfect n speedup, it's a usable n-speedup, right?
4. The double (N) Congestion window is preceded by TCP side acceleration, which describes respectively:
1). Why is TCP's new congestion control algorithm difficult to develop? -Because efficiency and fairness must be weighed but difficult to weigh.
2). Why is the multi-stream bundle earning? -based on mathematical deduction, gnuplot is very useful and intuitive.
3). Why the multi-bundle version can be implemented? -Because TCP does not stipulate that this cannot be done.
4). Why is it difficult to get n speedup even with multi-stream bundles? -Because the intermediary device will schedule traffic packets.
5). Why can't the molded solution be used directly? -Because MPTCP is a bilateral solution, it is too complex to change both sides of the protocol stack, which is the application layer solution.
In the end, we only have one question:
6). Based on the above 1-5, how do we implement our own TCP acceleration scheme for multi-stream bundles?
We know that TCP is sent sequentially, in the congestion window (we ignore the restrictions on the end of the notice window) in accordance with the sequence number increments of 1 sent sequentially, in order to implement the two congestion window, we introduce the concept of "TCP virtual Stream" and "virtual sequence", the virtual sequence will no longer guarantee that the sequence number in the sending increment 1, Instead, the value of the increment w,w is controlled by the smoothing parameters, as shown in:
The so-called smoothing parameter refers to the number of data sent each time to switch to another window, in the example shown, it is clear that the smoothing parameter is the size of a data segment. In this way, on the original sequence of serial numbers, we built a two sets of virtual serial number interval 1, two congestion window data transmission in accordance with the sequence of the two sets of virtual serial number to send, for the receiving end, the data is still ordered to arrive, above, Each send sequence that maintains a set of virtual serial numbers is a TCP virtual stream. It is important to note that, conceptually, the Shuang Yong Plug window or n congestion window that we are currently discussing is not multiple TCP virtual streams concurrently sending, but just multiple TCP virtual streams cross-rotation, similar to a CPU cross-rotation to execute multiple processes or threads. The parallel version of the Shuang Yong Plug window is then discussed in the discussion of speedup.
How the TCP virtual stream responds to Acktcp when an ACK is received, it clears the confirmed data from the sending queue and routinely checks for "bookkeeping", which records the total number of ACK data, RTT measurements, and so on, which in turn affects the execution strategy of the Congestion control algorithm.
In standard TCP, ACK is a cumulative acknowledgment, an ACK can confirm multiple segments, but for the TCP virtual stream inside the Shuang Yong Plug window, it is to record two ledger records separately, consider the following ACK sequence:
Last_una:10
This_ack:21
Obviously an ACK confirms 10 segments, in a two-window TCP virtual stream, each emitted data segment has a "virtual serial number" bound to it, which records the window in which the data segment was sent, receives an ACK, and in the process of cleaning up the sending queue and accounting, TCP records the account information in the TCP virtual stream corresponding to the confirmed segment, such as the above ACK sequence:
Last_una:10
This_ack:21
10[CWND0],11[CWND1],12[CWND0],13[CWND1],14[CWND0],15[CWND1],16[CWND0],17[CWND1],18[CWND0],19[CWND1],20[CWND0]
We see that the data 20 is sent in the Congestion window 0, then the Congestion window to perform cleanup send queue and accounting tasks, but it to the Congestion window 1 information to the TCP virtual Stream 1 instead of itself, so the account information is as follows:
TCP virtual stream 0 "bookkeeper"
Confirmed data details: 6, respectively 10,12,14,16,18,20
TCP Virtual Stream 1
Confirmed data details: 5, respectively 11,13,15,17,19
TCP virtual stream 0 and TCP virtual Stream 1 are then adjusted to their own congestion window based on the account information described above.
How TCP virtual streams respond to congestion because the congestion window of two virtual streams is independently controlled, so we do not want them to have adhesion, we do not want one virtual stream behavior to affect the other, finally, I think a more reasonable strategy is to detect congestion in a window (such as three times the repeated ACK, etc.), The window of another virtual stream is simply "deadlocked", that is, stopping its AI, but not shrinking the window. However, since the virtual serial number group must be able to merge into a fully sequential real sequence, once a window detects such as repeated ACK, then the virtual serial number of the other window will not be able to advance, similar to the "pipeline stagnation" phenomenon, in fact, the pipeline stalled! You can also call this phenomenon virtual serial number synchronization phenomenon, casually say.
So, in order to let the pipeline start again, there is a strategy here is "congestion mutual help", that is, a window detects the loss of packets must be re-transmission of the packet, while the other window is not reduced but also to break their own virtual serial number sequence, to assist the detection of the lost packet in the window of the retransmission operation. The generalized congestion Mutual assistance is that as long as the congestion is detected in a window, all the virtual streams have to enter the fast retransmission fast recovery phase, and the virtual serial number of all virtual streams is renumbered from the first of the retransmission packets. After the retransmission is complete (Newreno exits the fast recovery phase), the virtual serial number is re-assigned to the resume number as it enters the quick resume.
At first glance, it seems that this idea is actually supposed to send X data, now send x+a data, this and reno,bic,cubic in a window more than a few data what is the difference? The answer is that X and a are independent and both are subject to the ACK of their own sending data and the independent drive of the RTT. Remember that "fair-efficiency" coordinate system? The third time I pasted that picture is as follows:
The practice of more data in a window is similar to the Red line "User 1", which does not converge because it does not do "AIMD", whereas the double (N) window approach, each window is independent of the AIMD, for a two-window scheme, which is two separate windows, it is similar to "fairness-efficiency" "User 1" and user "2" in the coordinates, each of them will share the same link bandwidth equally as those "outsiders".
TCP virtual Stream class scores when the system TCP virtual stream between the rotation of the transmission is similar to the sequence of the program split into multiple processes or threads, and then the time slice rotation is assigned to these processes or threads, if the original program does not have a sleep wait operation, on the single core This does not make any sense, adding the cost of switching scheduling, However, if there is an operation waiting in the original program, thread 2 can continue to take advantage of the CPU when 1 waits after splitting. You can think of the entire end-to-end TCP send behavior as a sequential execution of the program, and the intermediary network is a collection of CPUs, when Windows 1 due to congestion and stop sending, window 2 can continue to send, and process scheduling system, the network itself is also a fair allocation of resources. So here's the problem. See "Added Value"
Additional value of TCP virtual stream window 1 since it has stopped sending, then according to the theory of congestion control, it is detected that the network congestion to stop sending, the window 2 continue to send a meaningful? Isn't that a jam? The beneficial side effects of the double window or even the N window are now shown. The question is how credible the "congestion detected in Windows 1" is! We know that there are many false congestion in the network, especially in the wireless environment, if it is false congestion, Windows 1 reduction is unnecessary. We classify the discussion.
1). If true congestion is detected in window 1 the Sending Behavior in Window 2 will also detect congestion, thereby reducing the window.
2). If false congestion is detected in window 1, the Send Behavior in window 2 may not detect congestion, which will compensate for the false positives in window 1!
The idea of bloom filter used here, in order to avoid the cost of exact matching, adopts the principle of "minority obedience majority", we can foresee that the more independent congestion window number, the more accurate the judgment of congestion. In the case where all windows are independently controlled, most of the windows have a very low probability of being wrongly mistaken for congestion!
Therefore, this "double (N) window TCP side acceleration" in the AI (additive) phase, can improve the bandwidth utilization, in the MD (multiplicative minus) phase can reduce the congestion rate of miscarriage.
Finally, let's take a look at the problem of speedup, this section is qualitative analysis, and there are no quantitative analyses.
The serial speedup of the TCP virtual stream is strictly the same as the total send time of the serial sent dual TCP virtual stream and the standard TCP stream.
However, we know that the end-to-end delay includes two parts, including host delay and network delay, we say this "send time" refers to the host delay, relative to the network delay, the host delay will appear insignificant. In this way, the host delay can be ignored, for the network, can be seen as two TCP streams simultaneously sent. If you build n TCP virtual streams, the final speedup ratio will be close but not equal to N, after all, the host delay is not negligible.
As can be seen, if TCP is sent according to the standard flow sequence, assuming that the congestion window is 10, then it can only emit 10 segments, but after the creation of 5 TCP virtual streams, each virtual stream Congestion window is calculated independently, here n is 5, and M may be a tens of thousands of values, At this point each virtual stream calculated congestion window should be close to 10, which reached the N speedup.
There are two additional functions of serial TCP virtual streams, which are:
1. Can smooth the impact of network device shaping if the standard TCP stream is sent, even if the successive sent segments, to the receiving end may also be shaped into the form of the arrival of the array, this I also mentioned in a number of articles. After joining multiple TCP virtual streams, can effectively compensate for the idle gap between the shaping, but there is a premise, that is, the plastic equipment must not rely on the five-tuple for plastic surgery, although rarely seen, resulting in this beneficial side effects few people are benefited, but the world is always better than nothing.
2. Slow-beat problem of the ACK clock caused by delayed ACK if the TCP data receiver has deferred ACK enabled, this is actually a form of shaping, but the difference is that the shaping is not about the data, but the ACK, whereas the shaped ACK stream is reversed for the data flow. Using multiple TCP virtual streams on the receiving side does not seem to recognize the multiple virtual streams, and the client perceives that the data arrives faster and smoother, so the ACK is more likely to be triggered.
The parallel speedup of TCP virtual streams in ultra-high-speed networks, there is an optimization principle is to reduce time delay! At this time the serial TCP virtual stream is not appropriate, we would like to be two or more TCP virtual streams are sent at the same time, Mptcp is doing so, but because we are a unilateral acceleration scheme, no receiver will cooperate with us to reorganize multiple streams into a stream, considering the disorder of the network, If we simultaneously send the following TCP virtual streams in parallel across multiple CPU cores:
Virtual Stream 1:1,3,5,7,9,11
Virtual Stream 2:2,4,6,8,10,12
Considering that the network card scheduling also does not guarantee the timing, the receiver has a large chaos of the probability of receiving packets, and this for the receiving end of the reception cache is a great test! As long as there is an empty void, data cannot be delivered to the upper layer! Therefore, the solution is synchronous parallel send, each TCP virtual stream on different CPU core send time and the previous TCP virtual stream sent stagger a fixed interval time slot:
So we have a send matrix, in the horizontal, through a plurality of TCP virtual streams to get close to n speedup, in the vertical through a fixed interval of parallel send to minimize the host transmission delay. The area of the rectangle can be regarded as the product of the number of times and TCP virtual streams n, is a measure of the total amount of data, similar to the time-delay bandwidth product, in the case where we want shorter transmission times (the target of most TCP acceleration), in order to maintain the area unchanged, the method is to create a number of TCP virtual streams, Give the rectangle to widen! The wider the rectangle, the more time it takes to save the host.
We know that. TCP virtual stream is mainly to improve the bandwidth utilization and avoid congestion miscalculation, and finally on the high-speed network, it can also reduce the host delay.
About implementing tips to be honest this implementation is much simpler than copying a congestion control algorithm, and in 2014 the confused time with the manager of a conversation also let me understand that good at assembling things people may be better than the people who are good at building components, but in most cases will be found not to make components more cool.
This TCP virtual stream is obviously the assembly behavior, there is no technical content, for the Linux protocol stack, only need to tcp_sock the structure of the congestion control related to the field to be changed into groups, such as Snd_cwnd to change to snd_cwnd[2] ... That's it. Then independently according to the ACK response to control the changes in the value of these variables, you may also need to add a virtual ordinal field for the TCP control block structure, to save the SKB currently owned by which Virtual stream. I don't understand why anyone would think this is a "mess with the kernel"!
Do not want to change the kernel can also, then use NetFilter hook to do, the IP layer hook completely take over all the data transmitted by the TCP layer, the direct reply ACK to the upper layer, resulting in a data has been delivered to the illusion, in fact, the data is only cached in this hook buffer, Each CPU core initiates a kernel thread that handles the AIMD of the congested window independently, and then each virtual stream takes the data from the buffer itself and sends it. A very simple model, this netfilter hook is actually just an agent, the hook alone out to do a device, this is the speed of the gateway can be sold, and can really sell! But if you don't put this in the box, and just as a Wenzhou boss from selling leather shoes to sell kernel modules, simply can't sell, or, at least, difficult.
TCP unilateral acceleration thought of independent double (N) Congestion window