Kernel version: 2.6.34
UDP packet Reception
The reception of UDP message can be divided into two parts: the protocol stack receives the UDP message, inserts the corresponding queue; the user calls Recvfrom () or recv () system call to take out the message from the queue, the queue here is Sk->sk_receive_queue, it is the link of message relay , the two-part contact is shown in the following figure.
The first part: How the protocol stack collects UDP packets.
The UDP module is registered in Inet_init (), and when a UDP message is received, the handler function Udp_rcv () in Udp_protocol is called.
if (Inet_add_protocol (&udp_protocol, IPPROTO_UDP) < 0)
PRINTK (kern_crit "inet_init:cannot add UDP protocol\ n ");
UDP_RCV ()-> __UDP4_LIB_RCV () completes the UDP packet reception, initializes the UDP checksum and does not verify the correctness of the checksum.
if (Udp4_csum_init (SKB, uh, proto))
goto Csum_error;
In the udptable of sockets in the [Saddr, Sport, DADDR, Dport] to find the corresponding SK, in the previous article has detailed said "SK Lookup", where the source port of the message is equivalent to the source host port, the Dest destination port equivalent to the local port.
SK = __UDP4_LIB_LOOKUP_SKB (SKB, Uh->source, Uh->dest, udptable);
If there is a corresponding SK in the udptable, that is, a socket is received, the message is SKB into the queue through UDP_QUEUE_RCV_SKB (), the function is analyzed later, in short, the message will be placed on the Sk->sk_receive_queue queue, Then Sock_put () reduces the reference calculation of SK and returns. The completion of the subsequent receive work will depend on the user's actions.
if (SK!= NULL) {
int ret = UDP_QUEUE_RCV_SKB (SK, SKB);
Sock_put (SK);
if (Ret > 0)
return-ret;
return 0;
}
When the SK is not found in the udptable, then the machine does not have a socket will receive it, so to send ICMP unreachable message, before this, but also verify checksum udp_lib_checksum_complete (), if checksum error, then directly discard the message If the checksum is correct, the statistics in the MIB are increased, the ICMP Port unreachable message is sent, and the message is discarded.
if (Udp_lib_checksum_complete (SKB))
goto Csum_error;
UDP_INC_STATS_BH (NET, udp_mib_noports, proto = = Ipproto_udplite);
Icmp_send (SKB, Icmp_dest_unreach, Icmp_port_unreach, 0);
KFREE_SKB (SKB);
UDP_QUEUE_RCV_SKB () packet into queue
Sock_woned_by_user () to determine the value of sk->sk_lock.owned, if equal to 1, means that SK is in the occupied state, at this time can not add SKB to the SK receive queue, execute else if part, Sk_add_backlog () Add SKB to the Sk->sk_backlog queue; if it equals 0, the SK is not occupied, the if part is executed, and __UDP_QUEUE_RCV_SKB () adds SKB to the Sk->sk_receive_queue queue.
Bh_lock_sock (SK);
if (!sock_owned_by_user (SK))
rc = __UDP_QUEUE_RCV_SKB (SK, SKB);
else if (Sk_add_backlog (SK, SKB)) {
bh_unlock_sock (SK);
goto drop;
}
Bh_unlock_sock (SK);
So when will SK be occupied. When the Sk->sk_backlog on the SKB is processed.
When creating a socket, Sys_socket ()-> inet_create ()-> sk_alloc ()-> sock_lock_init ()-> () Initialize Sk->sk_lock_owned=0.
For example, when the socket is destroyed, Udp_destroy_sock () will call the Lock_sock () on the SK plus lock, after the operation, call Release_sock () to the SK unlock.
void Udp_destroy_sock (struct sock *sk)
{
lock_sock (SK);
Udp_flush_pending_frames (SK);
Release_sock (SK);
}
In fact, Lock_sock () sets the sk->sk_lock.owned=1; and Release_sock () sets Sk->sk_lock.owned=0 and processes the messages on the Sk_backlog queue Release_ Sock ()-> __release_sock (), for each message on the Sk_backlog queue, invokes SK_BACKLOG_RCV ()-> SK->SK_BACKLOG_RCV (). Also in the creation of the socket, SK->SK_BACKLOG_RCV = SK->SK_PROT->BACKLOG_RCV () is the __UDP_QUEUE_RCV_SKB (), the function of which has been mentioned above, Add the SKB to the sk_receive_queue so that all the messages on the Sk_backlog are transferred to the Sk_receive_queue. In short, the function of the Sk_backlog queue is that the lock times text is temporarily stored here, and when unlocked, the message moves to the Sk_receive_queue queue.
Part Two: How users collect Messages
users can invoke Sys_recvfrom () or SYS_RECV () to receive messages, unlike Sys_recvfrom () The source address of the message may be obtained by parameter, while the SYS_RECV () is not available, but it has no effect on the receiving message. Before a user invokes recvfrom () or recv () to receive a message, the message to the socket is added to the Sk->sk_receive_queue, Recvfrom () and recv () to be done from the Sk_receive_ Queue to take out the message, copy to user space for users to use.
sys_recv ()-> sys_recvfrom ()
Sys_recvfrom ( )-> sk->ops->recvmsg ()
==> Sock_common_recvmsg ()-> sk->sk_prot->recvmsg ()
==> udp_recvmsg ()
Sys_recvfrom ()
Call Sock_recvmsg () to receive UDP packets, stored in MSG, if received packets, from the kernel to the user space to copy the source address of the message to the addr, addr is recvfrom () call the incoming parameter, representing the address of the source of the message. The content of the message is copied from the kernel to the user space in the udp_recvmsg ().
Err = sock_recvmsg (sock, &msg, size, flags);
if (Err >= 0 && addr!= NULL) {
err2 = move_addr_to_user (struct sockaddr *) &address, Msg.msg_namel
En, addr, addr_len);
if (ERR2 < 0)
err = ERR2;
}
UDP_RECVMSG () receive UDP message
This function has three key operations:
1. Access to the packet--__skb_recv_datagram ()
2. Copy data-Skb_copy_datagram_iovec () or Skb_copy_and Csum_datagram_iovec ()
3. Calculate calibration and –skb_copy_and_csum_datagram_iovec () if necessary ()
__skb_recv_datagram (), which takes a SKB from the Sk->sk_receive_queue, has previously been analyzed, and the kernel receives messages sent to the socket in Sk->sk_receive_queue.
SKB = __skb_recv_datagram (SK, Flags | (Noblock?) msg_dontwait:0), &peeked, &err);
If there are no messages, there are two situations: Use non-blocking receive, and the user received no message arrival, use blocking receive, but no message before, and in Sk->sk_rcvtimeo time no message arrived. No message, return error value.
if (!SKB)
goto out;
Len is recvfrom () the size of the incoming buf, Ulen is the length of the message content, if Ulen > len, then only need to use buf Ulen length can be, if Len < Ulen, then buf not enough message to fill, only the message truncation, Takes the first Len Byte.
Ulen = skb->len-sizeof (struct UDPHDR);
if (Len > Ulen)
len = Ulen;
else if (Len < Ulen)
msg->msg_flags |= Msg_trunc;
If the message is truncated or uses udp-lite, then the checksum needs to be validated in advance, Udp_lib_checksum_complete () completes the checksum calculation, and the function is analyzed in detail below.
if (Len < Ulen | | UDP_SKB_CB (SKB)->partial_cov) {
if (Udp_lib_checksum_complete (SKB))
goto csum_copy_err;
}
If the message does not validate the checksum, then execute the IF part, call Skb_copy_datagram_iovec () direct copy of the message to the BUF, and if the message needs to validate the checksum, then execute the else part and invoke the Skb_copy_and_csum_ Datagram_iovec () copies the message to the BUF and calculates the checksum during the copy process. This is why when the kernel receives the UDP message, why verify the checksum reprocessing first, UDP packets can be very large, the calculation of the checksum can be time-consuming, put it in the copy process can save money, of course, it is the price of some checksum error messages will be added to the socket receiver queue, They will not be discarded until the user actually receives it.
if (Skb_csum_unnecessary (SKB))
err = Skb_copy_datagram_iovec (SKB, sizeof (struct UDPHDR), Msg->msg_iov, Len);
else {
err = Skb_copy_and_csum_datagram_iovec (SKB, sizeof (struct UDPHDR), Msg->msg_iov);
if (err = =-einval)
goto csum_copy_err;
}
Copy the address to Msg->msg_name, msg->msg_name=&address in Sys_recvfrom (), and then adress copies the addr from the kernel to the user space.
if (sin) {
sin->sin_family = af_inet;
Sin->sin_port = UDP_HDR (SKB)->source;
SIN->SIN_ADDR.S_ADDR = IP_HDR (SKB)->saddr;
memset (sin->sin_zero, 0, sizeof (Sin->sin_zero));
}
Here are three functions that focus on the core operation:
__skb_recv_datagram () Take a SKB from the Sk_receive_queue
The core code snippet is as follows, Skb_peek () takes a SKB from the sk->sk_receive_queue, and if so, returns SKB as the message that the user received this time, as well as subsequent processing of the SKB, but the function simply pulls out a SKB If not, then use Wait_for_packet () to wait for the message to arrive, where the parameter Timeo represents the waiting time, if the use of non-blocking receive, Timeo will be set to 0 (that is, if there is no SKB now directly return, do not wait), otherwise set to sk- >sk_rcvtimeo.
Do {
...
.. SKB = Skb_peek (&sk->sk_receive_queue);
if (SKB) {
*peeked = skb->peeked;
if (Flags & Msg_peek) {
skb->peeked = 1;
Atomic_inc (&skb->users);
} else
__skb_unlink (SKB, &sk->sk_receive_queue);
}
if (SKB) return
SKB;
...
} while (!wait_for_packet (SK, err, &timeo));
Skb_copy_datagram_iovec () Copy SKB content into msg
Copies can be divided into three parts: a copy of the linear address space, a copy of the aggregation/divergence address space, and a copy of the Non-linear address space. The second part requires hardware support, and here we discuss another two parts.
In the buff of SKB is the linear address space, in the SKB of the frag_list is a non-linear address space, when there is no fragmentation, the use of linear address space is sufficient, but when the message is too long and fragmented, the first fragment will use the linear address space, the rest of the fragment will be linked to the SKB Frag_ List, that is, the non-linear address space, you can refer to the "IPv4 module" in the fragmented section.
When copying the contents of a message, it is necessary to copy the contents of the linear and nonlinear spaces over. Here is a copy of the code segment of the linear address space, start is the linear portion of the message length (Skb->len-skb->datalen), copy is the size of the linear address space, offset is the relative SKB offset (that is, where the copy begins), Take the UDP message as an example, these values are shown in the following figure. Memcpy_toiovec () Copy the kernel to the to, note that it changes the to member variable.
int start = Skb_headlen (SKB);
int i, copy = Start-offset;
if (Copy > 0) {
if (copy > len)
copy = Len;
if (Memcpy_toiovec (to, Skb->data + offset, copy))
goto fault;
if (len = copy) = = 0) return
0;
Offset + + copy;
}
The following is a copy of the Non-linear address space code snippet, traversing the SKB of the frag_list linked list, on the top of each fragment, copy content into to, here start, end of the value is not important, it is important that their difference End-start, indicating the length of the current fragment Frag_iter, Use Skb_copy_datagram_iovec () to copy the current fragment content, i.e. each fragment is treated as a separate message. However, for the fragmentation, the feeling only the first part of the copy and the second part, in the IP layer fragmentation, and not the fragmented chain on the frag_list of the situation, but are linked to the head of the frag_list.
Skb_walk_frags (SKB, frag_iter) {
int end;
End = start + frag_iter->len;
if (copy = End-offset) > 0 {
if (copy > len)
copy = Len;
if (Skb_copy_datagram_iovec (Frag_iter,
Offset-start, to, copy))
goto fault;
if (len = copy) = = 0) return
0;
Offset + + copy;
}
start = end;
}
Or as an example, the host received a UDP message, the content length of 4000 bytes,mtu is 1500, the incoming buff array size is also 4000. According to the MTU, the message will be divided into three pieces, the fragment IP datagram content size is 1480, 1480, 1040. Each fragment has a 20-word IP message, and the first fragment has a 8-word UDP header. Copy of data when received is as follows:
Piecewise One is the first fragment, including UDP packets, in the copy to skip, because the use of UDP socket received, as long as the message content can be. Three pictures represent three calls to Skb_copy_datagram_iovec (), Iov is the buff of storage content, and the end result is three slices of 4000 bytes copied into the Iov.
The Memcpy_toiovec () function needs attention, not only because it changes the Iovec member value, but also because of the final iov++. In the receiving recvfrom () of the UDP socket, Msg.msg_iov = &iov, and Iov is defined as struct Iovec Iov, that is, the incoming parameter Iov actually has only one space, then iov++ will point to the illegal address after Iov. This is only considered when UDP is used, and the previous sentence of Memcpy_toiovec () is where Len is the length of the received buff:
if (copy > len)
copy = Len;
and Memcpy_toiovec () also has int copy = min_t (unsigned int, iov->iov_len, Len), where Len is the above incoming Copy,iov_len is to receive buff length, These two guarantees that the copy value in the function is equal to Len, that is, once the copy is completed, Len-=copy makes len==0, although iov++ points to illegal memory, but because while (len > 0) has exited, it does not use Iov to do anything. Second, the iov++ in the function does not have an effect on the parameter Iov, that is, whether the function completes Iov or the value passed in. Finally, the Iov_len and Iov_base values are modified after the copy is completed, and Iov_len represents the available length, iov_base indicates the starting copy location.
int Memcpy_toiovec (struct Iovec *iov, unsigned char *kdata, int len)
{while
(len > 0) {
if (iov->iov_le N) {
int copy = min_t (unsigned int, iov->iov_len, len);
if (Copy_to_user (iov->iov_base, Kdata, copy))
Return-efault;
Kdata + + copy;
Len-= copy;
Iov->iov_len = copy;
Iov->iov_base + + copy;
}
iov++;
}
return 0;
}
Skb_copy_and_csum_datagram_iovec () Copy SKB content into MSG, and compute checksum
This function increases the efficiency of the checksum calculation because it merges the copy-and-compute operations so that only one traversal operation is possible. Compared with Skb_copy_datagram_iovec (), it calculates the checksum of the contents of this copy each time the content is copied SKB.
Csum = Csum_partial (Skb->data, Hlen, skb->csum);
if (Skb_copy_and_csum_datagram (SKB, Hlen, Iov->iov_base, Chunk, &csum))
UDP message Sending
There are two ways of calling when sending: Sys_send () and Sys_sendto (), the difference being that sys_sendto () needs to give an argument to the destination address, while Sys_send () calls Sys_connect () to bind the destination address information ; The subsequent invocations of both are the same. If the call Sys_sendto () is sent, the address information is copied from user space to kernel space in sys_sendto (), and the message content is copied from user space to kernel space in udp_sendmsg ().
Sys_send ()-> sys_sendto ()
Sys_sendto ()-> sock_sendmsg ()-> __sock_sendmsg ()-> sock->ops->sendmsg ()
==> inet_sendmsg ()-> sk->sk_prot->sendmsg ()
==> udp_sendmsg ()
The core process of udp_sendmsg (), as shown in the following illustration, lists only the core functions that call the parameter assignment, and the approximate steps are: Get the information-> get the route item RT-> Add the data-> send the data.
The pending in the Udp_sock structure is used to identify whether data is to be sent on the current udp_sock, or, if so, to go directly to the do_append_data to continue adding the data, otherwise the initialization must be done before the data is added. In fact, pending!=0 means that there is already data in Udp_sock before this call, pending initially equals 0 each time the data is reconciled sendto (), and when the data is added, the up->pending = Af_inet is set. Until the last call to Udp_push_pending_frames () sends the data to the IP layer or to the Skb_queue_empty (&sk->sk_write_queue) send list as null, set up->pending = 0. So here you can see the change in the pending value when the message is sent:
Usually the sendto () send is a call to the corresponding message, that is, pending=0->af_inet->0; but if the parameter uses the MSG_MORE flag when the SendTo () is invoked, then pending=0->af_inet, The MSG_MORE flag is not used until the sendto () is invoked, indicating that the data sent this time is the last part of the data, pending=af_inet->0.
if (up->pending) {
lock_sock (SK);
if (likely (up->pending)) {
if (unlikely (up->pending!= af_inet)) {
release_sock (SK);
Return-einval;
}
Goto do_append_data;
}
Release_sock (SK);
}
If the pending=0 does not have to send data, perform an initialization operation: message length, address information, routing items.
Ulen is initially sendto () the data length passed in, because the first part of the data (if there is no subsequent data is the message), Ulen to add the UDP header 8 bytes.
Ulen + + sizeof (struct UDPHDR);
This code gets the destination address and port number of the data to be sent. One scenario is to call SendTo () to send the data, where the information for the destination is passed in as a parameter, stored in the msg->msg_name, so that daddr and dport are removed from it, and the other is to call Connect () and send () the data in Connect ( The information that is bound to the purpose when invoked is stored in inet, and the sk->sk_state is set to tcp_established because it is called connect (). After calling send () to send the data, there is no need to give the destination information parameter, so remove Dadr and dport from the inet. and connected indicates whether the socket is bound to the purpose.
if (msg->msg_name) {
struct sockaddr_in * usin = (struct sockaddr_in *) msg->msg_name;
if (Msg->msg_namelen < sizeof (*usin))
Return-einval;
if (usin->sin_family!= af_inet) {
if (usin->sin_family!= af_unspec)
return-eafnosupport;
}
DADDR = usin->sin_addr.s_addr;
Dport = usin->sin_port;
if (Dport = = 0)
Return-einval
} else {
if (sk->sk_state!= tcp_established) return
- Edestaddrreq;
DADDR = inet->inet_daddr;
Dport = inet->inet_dport;
connected = 1;
}
The next step is to get the route item RT, if connected (call over Connect), then the routing information is acquired at Connect (), and can be taken directly; if the disconnected or fetched route item has been deleted, you will need to find it again in the routing table or use the Ip_route_output_ Flow () to find, if the connection state of the socket, you need to update the socket with the newly found RT, of course, if the previous RT has expired.
if (RT = NULL) {
...
.. Err = Ip_route_output_flow (NET, &rt, &FL, SK, 1);
..... if (connected)
Sk_dst_set (SK, Dst_clone (&RT->U.DST));
Storage information daddr, Dport, Saddr, sport to CORK.FL, which are used when generating UDP headers and calculating UDP checksums. Up->pending=af_inet identifies the beginning of the data addition, and starts with adding data.
INET->CORK.FL.FL4_DST = daddr;
Inet->cork.fl.fl_ip_dport = Dport;
INET->CORK.FL.FL4_SRC = saddr;
Inet->cork.fl.fl_ip_sport = inet->inet_sport;
up->pending = af_inet;
If Pending!=0 or performs an initialization operation, the add data operation is performed directly:
Up->len represents the total length of the data to be sent, including the UDP header, so each part of the data is added to its length, and the Up->len is cleared 0 after it is sent. The Ip_append_data () is then invoked to add data to Sk->sk_write_queue, which handles data fragmentation issues, which are analyzed in detail in the ICMP module.
Up->len + = Ulen;
Getfrag = is_udplite? Udplite_getfrag:ip_generic_getfrag;
Err = Ip_append_data (SK, Getfrag, Msg->msg_iov, Ulen,
sizeof (struct UDPHDR), &IPC, &rt, Corkreq
? msg->msg_flags| Msg_more:msg->msg_flags);
Ip_append_data () Adding data correctly returns 0, otherwise udp_flush_pending_frames () discards the data that will be added, and if the data is added correctly and no subsequent data arrives (identified by Msg_more), then Udp_push_ Pending_frames () sends the data to the IP layer, and the function is analyzed in detail below. In the last case, when the sk_write_queue is empty, the condition it triggers must be to send multiple packets and the Sk_write_queue is empty, and the sk_write_queue will not be empty after ip_append_data, So it doesn't normally happen. What kind of situation will happen? Resetting the pending value of 0 is done here, and three conditional statements will set pending to 0.
if (err)
udp_flush_pending_frames (SK);
else if (!corkreq)
err = Udp_push_pending_frames (SK);
else if (Unlikely (Skb_queue_empty (&sk->sk_write_queue))
up->pending = 0;
The data has been processed to release RT of the routed item that was taken, and if there are IP options, release it. Returns Len of the sent length if the data is sent successfully, otherwise error-handling based on error value ERR and return err.
Ip_rt_put (RT);
if (free)
kfree (ipc.opt);
if (!err) return
Len;
if (err = =-enobufs | | test_bit (sock_nospace, &sk->sk_socket->flags)) {
Udp_inc_stats_user (SK ), udp_mib_sndbuferrors, is_udplite);
return err;
The Ip_push_pending_frames () is used to send data to the IP layer in the ICMP module. The Ip_push_pending_frames () is used to send data to the IP layer in the UDP module. The Udp_push_pending_frames () which sends data to the IP layer in the UDP module is only the encapsulation of the Ip_push_pending_frames (), which mainly increases the processing of the UDP header. Similarly, udp_flush_pending_frames () is simply that it is simpler to reset only the values of Up->len and up->pending, which can be reset to start a new message. So Udp_push_pending_frames () encapsulates what to do with it.
Udp_push_pending_frames () Send data to IP layer
Sets the UDP header, which includes the source port source, destination port dest, message length len.
UH = UDP_HDR (SKB);
Uh->source = fl->fl_ip_sport;
Uh->dest = fl->fl_ip_dport;
Uh->len = htons (Up->len);
Uh->check = 0;
Computes the checksum in the UDP header, including the pseudo headers, UDP headers, and message contents.
if (is_udplite)
csum = udplite_csum_outgoing (SK, SKB);
else if (Sk->sk_no_check = = Udp_csum_noxmit) {/ * UDP csum Disabled */
skb->ip_summed = Checksum_none;
goto send;
} else if (skb->ip_summed = = checksum_partial) {/* UDP hardware csum
/udp4_hwcsum_outgoing (SK, SKB, FL->FL4_SRC , FL->FL4_DST, Up->len);
goto send;
} else /* ' normal ' UDP
/csum = udp_csum_outgoing (SK, SKB);
Uh->check = Csum_tcpudp_magic (fl->fl4_src, FL->FL4_DST, Up->len, Sk->sk_protocol, csum);
Send the message to the IP layer, which has been analyzed.
Err = Ip_push_pending_frames (SK);
Again, after sending the message, reset the values of Len and pending so that the next message can be sent.
Up->len = 0;
up->pending = 0;