Fully implement Linux TCP pacing send logic-high-precision version Hrtimer

Last Update:2017-01-14 Source: Internet

Author: User

Tags goto

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The implementation of the code is simple and the thoughts behind it are complex.
If the pure implementation of the Linux TCP pacing send logic-ordinary timer version of the timer_list replaced Hrtimer, will inevitably incur failure. Because in Hrtimer's function, calling a long path function such as Tcp_write_xmit is a way of doing durian in stockings. Well, in frustration I can only refer to the practice of TSQ. Old hate Heart Demon!
In the TCP implementation of Linux, TSQ guarantees that a single stream does not occupy too much of the sending cache, thus guaranteeing the relative fairness of multiple data streams. This mechanism is implemented with Tasklet, so I think TCP pacing can also be achieved through Tasklet.
-------------------------------------
TCP destroys the harmony of the entire network world! Why?
Who said the TCP sender must maintain a congested window? This man has established authority! However, the concept of "congestion window" only describes one aspect of the matter and does not mention another aspect! As I said, the congestion window is a scalar rather than a vector, meaning that it is just a numeric value measured on a dimension that represents "the amount of data that can be sent at the moment", that's all. Congestion Windows simply don't understand what's happening on the network. However, it is certain that the data arrival behavior observed by the receiving end is real, that is, there is a gap between the two packets arriving! It is easy to see this fact if you capture and analyze the packets separately on the sending and receiving sides.
We look at the sending side, in Linux, for example, its sending behavior is Tcp_write_xmit hosted, it will send all the packets that can be sent at one time, the delay between each packet is just the host delay of the packet to go to the protocol stack, this delay is negligible for the long-fat network delay! Please do not be fooled by the illusion of gigabit/Gigabit Ethernet and DC, we have access to the internet most of the content from the "thousands of miles away", the middle of the road rough bumpy, do not believe what CDN, all nonsense, commercial propaganda gimmick, not a bird. Back to the text, since the sending side is a one-time burst sent packets, and the receiver is intermittently received packets, the middle of something must have happened! Yes, that's the root, but we don't have to worry about it, just kneel down and lick TCP just fine. We pay attention to a similar phenomenon, that is married wedding car, pick up the bride, dozens of German BBA (Mercedes-Benz, BMW, Audi) by a "super-car" (such as modified Bentley) led, a word lined up, collective reunification, but in the fast arrival of the bride's home, the car must stop to wait for the car, In order to create a continuous formation along the way and today the sun for you to rise the illusion, this is the role of traffic shaping, first, the team through the road is a statistical multiplexing system, and did not sell to which individual, coupled with traffic lights, queue jumping, congestion and other factors, the team will be scattered, the car is only a few meters apart, Eventually this distance will be pulled up to a kilometer, near the destination, in order to arrive at the start of the lineup, you have to wait for all vehicles to the line, and then a word lined up into the bride's home. In fact, these in the network card data transmission technology, there are corresponding things, such as the last bus to the alignment of the behavior, in fact, LRO (large-receive-offload) or Shard reorganization and so on.
That's not what I'm talking about, what I'm saying is, do you know how much it affects traffic when you get married and engage in a team? We assume that everyone is non-emotional and nobody wants anyone, but can you imagine a scene so big that the motorcade is out of an alley? Unfortunately, almost every host on our Internet that can send data (whether it's a DC machine or a computer of your own) is starting every moment with such a large team! Can not each other humility, at least in the distance between two cars to pull a car, at least at the crossroads, the other vehicles may cross through AH! But in fact, in China, the idiot will do so, bad currency drive good money, because no one do! Everyone is anxious to make the team longer!
On the network, although the TCP congestion window indicates how much data can be sent, why can't we follow the behavior that the receiving side actually receives to guide the sending behavior? Why has it been a one-time delivery for decades and remains the same? At least the Linux TCP is like this, I believe other good not to go, after all, "evil, will be cut off" is only a wish. Bad money drives good money, everyone is evil, evil becomes goodness.
This is the sorrow of TCP! Google's engineers see this sadness, created a BBR algorithm, draw countless people kneeling lick, this is more sad. Google's BBR Patch says:
The primary control is the pacing RATE:BBR applies a gain
Multiplier to transmit faster or slower than the observed bottleneck
Bandwidth. The conventional Congestion window (CWnd) is now the
Secondary control; The CWnd is set to a small multiple of the
Estimated BDP (Bandwidth-delay product) in order to allow full
Utilization and bandwidth probing while bounding the potential amount
of the queue at the bottleneck.
Summarize the complaints in this paragraph. Using only congestion windows to control the sending behavior of TCP is a garbage way, all chicken poop. I decided to implement the Hrtimer version of Linux TCP pacing.
-------------------------------------
The overall framework is as follows:
##############
It seems that the previous implementation of the regular timer version of TCP pacing is only a primer, this article will be implemented based on the Tasklet Hrtimer version is the King! Once the framework has been understood, it is easy to implement. It's very simple.
-------------------------------------
As with the normal timer version, I disassembled the code into several sections, but this is not the whole code, I omitted some code such as List_head initialization, and some code for variable initialization.
Implementation of 1.tasklet:

Define pacing_tasklet:/* include/net/tcp.h */struct pacing_tasklet {struct tasklet_struct tasklet; struct List_head head; /* Queue of TCP sockets */};extern struct Pacing_tasklet pacing_tasklet;/* net/ipv4/tcp_output.c *///define per CPU Tasklet variable D EFINE_PER_CPU (struct pacing_tasklet, pacing_tasklet);//Independent Handler, just to separate from the action of Tasklet, so that it is not too long static void tcp_        Pacing_handler (struct sock *sk) {struct Tcp_sock *tp = Tcp_sk (SK);        if (!sysctl_tcp_pacing | |!tp->pacing.pacing) return; if (Sock_owned_by_user (SK)) {if (!test_and_set_bit (tcp_pacing_timer_deferred, &tcp_sk (SK)->tsq_flag                s)) Sock_hold (SK);        Goto out;        } if (sk->sk_state = = tcp_close) goto out;        if (!sk->sk_send_head) {goto out; } tcp_push_pending_frames (SK); Out:if (tcp_memory_pressure) Sk_mem_reclaim (SK);} Pacing Tasklet's ACTIOn function static void Tcp_pacing_func (unsigned long data) {struct Pacing_tasklet *pacing = (struct pacing_tasklet *) data;        List_head (list);        unsigned long flags;        struct List_head *q, *n;        struct Tcp_sock *tp;        struct sock *sk;        Local_irq_save (flags);        List_splice_init (&pacing->head, &list);        Local_irq_restore (flags);                List_for_each_safe (q, N, &list) {tp = list_entry (q, struct tcp_sock, pacing_node);                List_del (&tp->pacing_node);                SK = (struct sock *) TP;                Bh_lock_sock (SK);                Tcp_pacing_handler (SK);                Bh_unlock_sock (SK);        Clear_bit (pacing_queued, &tp->tsq_flags);        }}//Initialize pacing Tasklet (completely learning the look of tsq) void __init tcp_tasklet_init (void) {int i,j;        struct sock *sk;        Local_irq_save (flags);        List_splice_init (&pacing->head, &list);        Local_irq_restore (flags); LisT_for_each_safe (q, N, &list) {tp = list_entry (q, struct tcp_sock, pacing_node);                        List_del (&tp->pacing_node);                SK = (struct sock *) TP;                Bh_lock_sock (SK);                Tcp_pacing_handler (SK);                Bh_unlock_sock (SK);        Clear_bit (pacing_queued, &tp->tsq_flags); }}

2.hrtimer Related:

/* net/ipv4/tcp_timer.c *///reset hrtimer timer void Tcp_pacing_reset_timer (struct sock *sk, U64 expires) {struct Tcp_sock        *TP = Tcp_sk (SK);        U32 Timeout = nsecs_to_jiffies (expires);        if (!sysctl_tcp_pacing | |!tp->pacing.pacing) return; Hrtimer_start (&sk->timer, Ns_to_ktime (expires), hrtimer_mode_abs_pinned); }//Hrtimer Timeout callback static enum Hrtimer_restart tcp_pacing_timer (struct hrtimer *timer) {struct sock *sk = container_of        (timer, struct sock, timer);        struct Tcp_sock *TP = Tcp_sk (SK);                if (!test_and_set_bit (pacing_queued, &tp->tsq_flags)) {unsigned long flags;                struct Pacing_tasklet *pacing;                Dispatch only Tasklet, not execute action!                Local_irq_save (flags);                Pacing = This_cpu_ptr (&pacing_tasklet);                List_add (&tp->pacing_node, &pacing->head); Tasklet_schedule (&pacing-> tasklet);        Local_irq_restore (flags); } return Hrtimer_norestart;} initialize void Tcp_init_xmit_timers (struct sock *sk) {inet_csk_init_xmit_timers (SK, &tcp_write_timer, &tcp_dela        Ck_timer, &tcp_keepalive_timer);        Hrtimer_init (&sk->timer, Clock_monotonic, hrtimer_mode_abs_pinned); Sk->timer.function = &tcp_pacing_timer;}

3.tcp_write_xmit in the judgment:

/* NET/IPV4/TCP_OUTPUT.C */static bool Tcp_write_xmit (struct sock *sk, unsigned int mss_now, int nonagle, int Push_one, gfp_t GFP) {... while ((SKB = Tcp_send_head (SK))) {unsigned int limi                T                U64 now = Ktime_get_ns ();                ... Cwnd_quota = tcp_cwnd_test (TP, SKB);  if (!cwnd_quota) {if (Push_one = = 2)/* Force out a loss probe PKT.                        */Cwnd_quota = 1; else if (tp->pacing.pacing = = 0)//Here is a pioneering, since pacing rate is calculated by CWnd, check pacing rate will not have to detect CWnd, but in the BBR algorithm to be cautious,                                Because BBR's pacing rate is not really calculated by the CWnd, on the contrary, CWnd is calculated by the pacing!                Break                }//The advertisement window is not related to network congestion or is to be detected.                if (Unlikely (!tcp_snd_wnd_test (TP, SKB, Mss_now)) break;                The logic here is the same as the normal timer version! if (sysctl_tcp_pacing &&Amp                        Tp->pacing.pacing = = 1) {u32 plen;                        U64 rate, Len; if (now < Tp->pacing.next_to_send) {Tcp_pacing_reset_timer (SK, tp->pacing.next_to                                _send);                        Break } rate = Sysctl_tcp_rate?                        sysctl_tcp_rate:sk->sk_pacing_rate;                        Plen = Skb->len + Max_header;                        Len = (u64) plen * NSEC_PER_SEC;                        if (rate) Do_div (len, rate);                        Tp->pacing.next_to_send = now + len;                if (Cwnd_quota = = 0) Cwnd_quota = 1; } if (Tso_segs = = 1) {...}

execution in 4.TCP_RELEASE_CB

/* net/ipv4/tcp_output.c */void TCP_RELEASE_CB (struct sock *sk) {        ...        if (Flags & (1UL << tcp_pacing_timer_deferred)) {                if (sk->sk_send_head) {                        tcp_push_pending_frames (SK);                }                __sock_put (SK);        }        ...}

The above 4 parts are almost all logic.
-------------------------------------
Now look at the effect, using the results of Netperf I will not post, I only paste a use Curl download 10M file comparison results.

First look at the curve of the standard cubic algorithm:

CTMB, Rubbish! It's all fucking rubbish!

its throughput curve is as follows :

And then look at my pacing curve:

Then look at the graph of throughput! Although I did not go to college, in fact, I also disdain the university, my circle, are Shuo Boljan read, long time not to return to the country, and I, I do not know why undergraduate?! So let's look at the results:

-------------------------------------

Finally, look at my original vision.
My idea is not to engage in TCP pacing, the whole TCP, if you want to change something, in China is a scandal, because the Chinese people do not understand the real game. In fact, my goal is to host VPN traffic over UDP! Because a VPN packet is too expensive to lose, compared to a normal TCP packet. This only means that the network bandwidth is wasted, because retransmission will also pull the CPU into the mire! Last week, wrote a VPN based on DTLS, in my test, the CPU has been high, and later found out because the drop is too serious, and then in the user-state implementation of a pacing sent, the problem is resolved.
UDP of course can be implemented in the user state pacing, and TCP is not, because the TCP send is not controlled by the user, so I think of this scheme and simple implementation of a demo. However, some people feel that I Cao Ying Han, in fact, otherwise, I am actually in Cao Camp Heart also in Cao Ying, just sad its misfortune, and anger it does not dispute.

Fully implement Linux TCP pacing send logic-high-precision version Hrtimer

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More