Linux TCP資料包接收流程

來源:互聯網
上載者:User

   TCP接收方存在3種隊列:

1 Backlog Queue (sk->backlog)

2 Prequeue Queue (tp->ucopy.prequeue)

3 Receive Queue (sk->receive_queue)

 

然後來看3個隊列的區別。

首先sk_backlog隊列是噹噹前的sock在進程上下文中被使用時,如果這個時候有資料到來,則將資料拷貝到sk_backlog.

prequeue則是資料buffer第一站,一般都是這裡,如果prequeue已滿,則會拷貝資料到receive_queue隊列種。

最後一個receive_queue也就是進程上下文第一個取buffer的隊列

這裡為什麼要有prequeue呢,直接放到receive_queue不就好了.因為receive_queue的處理比較繁瑣

(看tcp_rcv_established的實現就知道了,分為slow path和fast path),而非強制中斷每次只能處理一個資料包

(在一個cpu上),因此為了非強制中斷能儘快完成,我們就可以先將資料放到prequeue中(tcp_prequeue),然後軟

中斷就直接返回. 而處理prequeue就放到進程上下文(tcp_recvmsg調用中)去處理了.

最後在分析tcp_v4_rcv和tcp_recvmsg之前,我們要知道tcp_v4_rcv還是處於非強制中斷上下文,

而tcp_recvmsg是處於進程上下文,因此比如socket_lock_t才會提供一個owned來鎖住對應的sock。

而我們也就是需要這3個隊列來進行非強制中斷上下文和進程上下文之間的通訊。最終當資料拷貝到對應隊列,

則非強制中斷調用返回。這裡要注意的是相同的函數在非強制中斷上下文和進程上下文種調用是不同的,我們下面就會看到(比如tcp_rcv_established函數) 。

首先資料包進入非強制中斷內容相關的tcp_v4_rcv函數

int tcp_v4_rcv(struct sk_buff *skb)<br />{<br />const struct iphdr *iph;<br />struct tcphdr *th;<br />struct sock *sk;<br />int ret;<br />struct net *net = dev_net(skb->dev);</p><p>if (skb->pkt_type != PACKET_HOST)<br />goto discard_it;</p><p>/* Count it even if it's bad */<br />TCP_INC_STATS_BH(net, TCP_MIB_INSEGS);</p><p>if (!pskb_may_pull(skb, sizeof(struct tcphdr)))<br />goto discard_it;</p><p>/* 。。。。。。*/</p><p>bh_lock_sock_nested(sk);<br />ret = 0;<br />if (!sock_owned_by_user(sk)) {<br />#ifdef CONFIG_NET_DMA<br />struct tcp_sock *tp = tcp_sk(sk);<br />if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)<br />tp->ucopy.dma_chan = dma_find_channel(DMA_MEMCPY);<br />if (tp->ucopy.dma_chan)<br />ret = tcp_v4_do_rcv(sk, skb);<br />else<br />#endif<br />{<br />if (!tcp_prequeue(sk, skb))<br />ret = tcp_v4_do_rcv(sk, skb);<br />}<br />} else<br />sk_add_backlog(sk, skb);<br />bh_unlock_sock(sk);</p><p>sock_put(sk);</p><p>return ret;

 

該函數的處理過程是:

首先bh_lock_sock_nested調用加自旋鎖

然後判斷當前sock是否被使用者進程佔用(sock_owned_by_user函數判斷)。如果沒有的話,就調用tcp_prequeue將資料包加入

prequeue隊列中;否則調用sk_add_backlog將它加入backlog隊列中。

 

tcp_prequeue調用流程如下

static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)<br />{<br />struct tcp_sock *tp = tcp_sk(sk);</p><p>if (!sysctl_tcp_low_latency && tp->ucopy.task) {<br />__skb_queue_tail(&tp->ucopy.prequeue, skb);<br />tp->ucopy.memory += skb->truesize;<br />if (tp->ucopy.memory > sk->sk_rcvbuf) {<br />struct sk_buff *skb1;</p><p>BUG_ON(sock_owned_by_user(sk));</p><p>while ((skb1 = __skb_dequeue(&tp->ucopy.prequeue)) != NULL) {<br />sk_backlog_rcv(sk, skb1);<br />NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPPREQUEUEDROPPED);<br />}</p><p>tp->ucopy.memory = 0;<br />} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {<br />wake_up_interruptible(sk->sk_sleep);<br />if (!inet_csk_ack_scheduled(sk))<br />inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,<br /> (3 * TCP_RTO_MIN) / 4,<br /> TCP_RTO_MAX);<br />}<br />return 1;<br />}<br />return 0;<br />}<br />

 

該函數的本意是將資料包加入prequeue隊列,以待tcp_recvmsg函數調用(通過函數tcp_prequeue_process),這時就返回0;

如果ucopy.task為NULL的話,表示當前沒有pending的進程,函數返回1,資料包在非強制中斷(函數tcp_v4_do_rcv)中處理。

還有一種情況是prequeue已滿,則在非強制中斷上下文中處理該隊列中的所有資料包(函數 sk_backlog_rcv)。

最後,如果發現該skb使得prequeue從空變為非空,則調用wake_up_interruptible(sk->sk_sleep)喚醒在該sock上的等待進程

(該進程在tcp_recvmsg函數中通過sk_wait_data調用進入該sock的等待隊列)。

 

不管是非強制中斷中的資料包處理還是系統調用中的資料包的處理,都是調用tcp_v4_do_rcv。在串連建立後,該函數的作用是處理資料包,

資料包加入receive queue中。

 

先分析資料包如何接收到使用者進程的——tcp_recvmsg函數。

 /*<br /> *This routine copies from a sock struct into the user buffer.<br /> *<br /> *Technical note: in 2.3 we work on _locked_ socket, so that<br /> *tricks with *seq access order and skb->users are not required.<br /> *Probably, code can be easily improved even more.<br /> */</p><p>int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,<br />size_t len, int nonblock, int flags, int *addr_len)<br />{<br />struct tcp_sock *tp = tcp_sk(sk);<br />int copied = 0;<br />u32 peek_seq;<br />u32 *seq;<br />unsigned long used;<br />int err;<br />int target;/* Read at least this many bytes */<br />long timeo;<br />struct task_struct *user_recv = NULL;<br />int copied_early = 0;<br />struct sk_buff *skb;<br />u32 urg_hole = 0;<br /> /* 需要強調的是這裡的鎖操作只是把sk->sk_lock.owned置為1,表示當前sock上有使用者進程<br /> 而沒有對其spinlock加鎖,所以非強制中斷可以把資料包加入backlog中,但此時非強制中斷不能對prequeue和receive queue 操作*/<br />lock_sock(sk);</p><p>TCP_CHECK_TIMER(sk);</p><p>err = -ENOTCONN;<br />if (sk->sk_state == TCP_LISTEN)<br />goto out;</p><p>timeo = sock_rcvtimeo(sk, nonblock);</p><p>/* Urgent data needs to be handled specially. */<br />if (flags & MSG_OOB)<br />goto recv_urg;<br /> /* copied_seq表示下一被應用程式讀取的序號,當設定了MSG_PEEK時,seq指標指向一個自動變數,其含義在於當讀取後,下次讀取任然從原有的位置開始 */<br />seq = &tp->copied_seq;<br />if (flags & MSG_PEEK) {<br />peek_seq = tp->copied_seq;<br />seq = &peek_seq;<br />}<br /> /* 如果設定了MSG_WAITALL,則讀取資料直到指定長度len */<br />target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);

 

下一步從receive queue中讀取資料包

do {<br />u32 offset;</p><p>/* Are we at urgent data? Stop if we have read anything or have SIGURG pending. */<br />if (tp->urg_data && tp->urg_seq == *seq) {<br />if (copied)<br />break;<br />if (signal_pending(current)) {<br />copied = timeo ? sock_intr_errno(timeo) : -EAGAIN;<br />break;<br />}<br />}</p><p>/* Next get a buffer. */<br /> /* 從receive queue 中找到可用的資料包 */<br />skb = skb_peek(&sk->sk_receive_queue);<br />do {<br />if (!skb)<br />break;</p><p>/* Now that we have two receive queues this<br /> * shouldn't happen.<br /> */<br /> /* 存在hole,不可能發生*/<br />if (before(*seq, TCP_SKB_CB(skb)->seq)) {<br />printk(KERN_INFO "recvmsg bug: copied %X "<br /> "seq %X/n", *seq, TCP_SKB_CB(skb)->seq);<br />break;<br />}<br /> /* offset不為0,表示資料有重疊 */<br />offset = *seq - TCP_SKB_CB(skb)->seq;<br /> /* SYN佔用一個序號 */<br />if (tcp_hdr(skb)->syn)<br />offset--;<br /> /* 當前skb有資料可讀 */<br />if (offset < skb->len)<br />goto found_ok_skb;<br />if (tcp_hdr(skb)->fin)<br />goto found_fin_ok;<br />WARN_ON(!(flags & MSG_PEEK));<br />skb = skb->next;<br />} while (skb != (struct sk_buff *)&sk->sk_receive_queue);

receive queue的特徵是

(1) already acked
(2) guaranteed in order
(3) contain no holes but
(4) apparently may contain overlapping data(資料可能重疊)

 

當receive queue沒有可用資料或已經讀取完後,進入下面流程

/* Well, if we have backlog, try to process it now yet. */<br /> /* copied表示已經讀取的資料量,target表示最少讀取量,如果copied大於target並且backlog隊列為空白,則接收過程結束*/<br />if (copied >= target && !sk->sk_backlog.tail)<br />break;<br /> /* 下面是出錯以及訊號處理*/<br />if (copied) {<br />if (sk->sk_err ||<br /> sk->sk_state == TCP_CLOSE ||<br /> (sk->sk_shutdown & RCV_SHUTDOWN) ||<br /> !timeo ||<br /> signal_pending(current))<br />break;<br />} else {<br />if (sock_flag(sk, SOCK_DONE))<br />break;</p><p>if (sk->sk_err) {<br />copied = sock_error(sk);<br />break;<br />}</p><p>if (sk->sk_shutdown & RCV_SHUTDOWN)<br />break;</p><p>if (sk->sk_state == TCP_CLOSE) {<br />if (!sock_flag(sk, SOCK_DONE)) {<br />/* This occurs when user tries to read<br /> * from never connected socket.<br /> */<br />copied = -ENOTCONN;<br />break;<br />}<br />break;<br />}</p><p>if (!timeo) {<br />copied = -EAGAIN;<br />break;<br />}</p><p>if (signal_pending(current)) {<br />copied = sock_intr_errno(timeo);<br />break;<br />}<br />}

接下來程式調用函數

tcp_cleanup_rbuf(sk, copied);

該函數的主要作用是發送一個通告視窗更新的ACK,因為使用者進程消費了讀緩衝中的資料。

 

流程到此的條件是:

● the receive queue is empty, 
● no serious errors or state changes were noted and
● we haven't consumed sufficient data to return to the caller.

/* 第一次到來時,task和user_recv都為NULL,所以裝載該進程為sock的當前任務*/<br />if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {<br />/* Install new reader */<br />if (!user_recv && !(flags & (MSG_TRUNC | MSG_PEEK))) {<br />user_recv = current;<br />tp->ucopy.task = user_recv;<br />tp->ucopy.iov = msg->msg_iov;<br />}</p><p>tp->ucopy.len = len;</p><p>WARN_ON(tp->copied_seq != tp->rcv_nxt &&<br />!(flags & (MSG_PEEK | MSG_TRUNC)));</p><p>/* Ugly... If prequeue is not empty, we have to<br /> * process it before releasing socket, otherwise<br /> * order will be broken at second iteration.<br /> * More elegant solution is required!!!<br /> *<br /> * Look: we have the following (pseudo)queues:<br /> *<br /> * 1. packets in flight<br /> * 2. backlog<br /> * 3. prequeue<br /> * 4. receive_queue<br /> *<br /> * Each queue can be processed only if the next ones<br /> * are empty. At this point we have empty receive_queue.<br /> * But prequeue _can_ be not empty after 2nd iteration,<br /> * when we jumped to start of loop because backlog<br /> * processing added something to receive_queue.<br /> * We cannot release_sock(), because backlog contains<br /> * packets arrived _after_ prequeued ones.<br /> *<br /> * Shortly, algorithm is clear --- to process all<br /> * the queues in order. We could make it more directly,<br /> * requeueing packets from backlog to prequeue, if<br /> * is not empty. It is more elegant, but eats cycles,<br /> * unfortunately.<br /> */<br /> /* prequeue不為空白,先處理 */<br />if (!skb_queue_empty(&tp->ucopy.prequeue))<br />goto do_prequeue;</p><p>/* __ Set realtime policy in scheduler __ */<br />}</p><p>if (copied >= target) {<br /> /* 讀取了足夠的資料,但是backlog中還有資料,所以調用release_sock來處理該<br /> 隊列中的資料包(tcp_v4_do_recv函數)*/<br />/* Do not sleep, just process backlog. */<br />release_sock(sk);<br />lock_sock(sk);<br />} else<br /> /* 資料讀取未完成,也不確定backlog中是否有資料,所以需要一個等待的操作*/<br />sk_wait_data(sk, &timeo);

 

分析sk_wait_data函數

int sk_wait_data(struct sock *sk, long *timeo)<br />{<br />int rc;<br />DEFINE_WAIT(wait);<br /> /* 加入等待隊列,該等待隊列當有資料包進入prequeue或receive queue時喚醒*/<br />prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);<br />set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);<br />rc = sk_wait_event(sk, timeo, !skb_queue_empty(&sk->sk_receive_queue));<br />clear_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);<br />finish_wait(sk->sk_sleep, &wait);<br />return rc;<br />}</p><p>#define sk_wait_event(__sk, __timeo, __condition)/<br />({int __rc;<br /> /* 解鎖時可能會處理backlog中資料包,如果有的話,__rc就為1,無需等待<br /> ;沒有可處理的話,就置used成員為0,這樣非強制中斷可以接收資料到prequeue隊列中,重而喚醒本進程 *//<br />release_sock(__sk);/<br />__rc = __condition;/<br />if (!__rc) {/<br />*(__timeo) = schedule_timeout(*(__timeo));/<br />}/<br />lock_sock(__sk);/<br />__rc = __condition;/<br />__rc;/<br />})

 

if (user_recv) {<br />int chunk;<br /> /* ucopy.len初始值為len,但在tcp_rcv_established中會減小,減少量為copy到使用者進程中的資料大小*/<br />/* __ Restore normal policy in scheduler __ */</p><p>if ((chunk = len - tp->ucopy.len) != 0) {<br />NET_ADD_STATS_USER(sock_net(sk), LINUX_MIB_TCPDIRECTCOPYFROMBACKLOG, chunk);<br />len -= chunk;<br />copied += chunk;<br />}<br /> /* 確保資料有序*/<br />if (tp->rcv_nxt == tp->copied_seq &&<br /> !skb_queue_empty(&tp->ucopy.prequeue)) {<br />do_prequeue: /* 調用 tcp_v4_do_rcv處理prequeue中skb */<br />tcp_prequeue_process(sk);</p><p>if ((chunk = len - tp->ucopy.len) != 0) {<br />NET_ADD_STATS_USER(sock_net(sk), LINUX_MIB_TCPDIRECTCOPYFROMPREQUEUE, chunk);<br />len -= chunk;<br />copied += chunk;<br />}<br />}<br />}<br />if ((flags & MSG_PEEK) &&<br /> (peek_seq - copied - urg_hole != tp->copied_seq)) {<br />if (net_ratelimit())<br />printk(KERN_DEBUG "TCP(%s:%d): Application bug, race in MSG_PEEK./n",<br /> current->comm, task_pid_nr(current));<br />peek_seq = tp->copied_seq;<br />}<br /> /* 開始下一次迴圈,接下來的是receive queue中skb的資料讀取,所以不進入 */<br />continue;

 

接下來的是receive queue中skb的資料讀取

found_ok_skb:<br />/* Ok so how much can we use? */<br />used = skb->len - offset;<br />if (len < used)<br />used = len;</p><p>/* Do we have urgent data here? */<br />if (tp->urg_data) {<br />u32 urg_offset = tp->urg_seq - *seq;<br />if (urg_offset < used) {<br />if (!urg_offset) {<br />if (!sock_flag(sk, SOCK_URGINLINE)) {<br />++*seq;<br />urg_hole++;<br />offset++;<br />used--;<br />if (!used)<br />goto skip_copy;<br />}<br />} else<br />used = urg_offset;<br />}<br />}</p><p>if (!(flags & MSG_TRUNC)) {<br />#ifdef CONFIG_NET_DMA<br />#endif<br />{<br />err = skb_copy_datagram_iovec(skb, offset,<br />msg->msg_iov, used);<br />if (err) {<br />/* Exception. Bailout! */<br />if (!copied)<br />copied = -EFAULT;<br />break;<br />}<br />}<br />}</p><p>*seq += used;<br />copied += used;<br />len -= used;<br /> /* 調整TCP接收緩衝空間 */<br />tcp_rcv_space_adjust(sk);</p><p>skip_copy:<br />if (tp->urg_data && after(tp->copied_seq, tp->urg_seq)) {<br />tp->urg_data = 0;<br />tcp_fast_path_check(sk);<br />}<br />if (used + offset < skb->len)<br />continue;</p><p>if (tcp_hdr(skb)->fin)<br />goto found_fin_ok;<br />if (!(flags & MSG_PEEK)) {<br />sk_eat_skb(sk, skb, copied_early);<br />copied_early = 0;<br />}<br />continue;</p><p>found_fin_ok:<br />/* Process the FIN. */<br />++*seq;<br />if (!(flags & MSG_PEEK)) {<br />sk_eat_skb(sk, skb, copied_early);<br />copied_early = 0;<br />}<br />break;<br />} while (len > 0);

 

最後在跳出迴圈後,prequeue隊列又一次被處理(因為其中可能還有資料,可以讀取到本進程中)

if (user_recv) {<br />if (!skb_queue_empty(&tp->ucopy.prequeue)) {<br />int chunk;</p><p>tp->ucopy.len = copied > 0 ? len : 0;</p><p>tcp_prequeue_process(sk);</p><p>if (copied > 0 && (chunk = len - tp->ucopy.len) != 0) {<br />NET_ADD_STATS_USER(sock_net(sk), LINUX_MIB_TCPDIRECTCOPYFROMPREQUEUE, chunk);<br />len -= chunk;<br />copied += chunk;<br />}<br />}<br /> /* 處理完後task複位為NULL, 表示當前sock沒有進程佔用 */<br />tp->ucopy.task = NULL;<br />tp->ucopy.len = 0;<br />}

 

資料包處理函數將在後面分析

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.