This series of articles summarizes Linux network stacks, including:
(1) Linux Network protocol Stack Summary
(2) Network multipart offload technology Gso/tso/ufo/lro/gro in non-virtualized Linux environment
(3) QEMU/KVM Network multipart offload technology in virtualized Linux environment Gso/tso/ufo/lro/gro
1. Linux Network Path
1.1 Send-side 1.1.1 Application layer
(1) Socket
The application layer of the various network applications are basically through the Linux Socket programming interface and the Kernel Space Network protocol stack communication. The Linux socket evolved from the BSD socket, which is one of the important components of the Linux operating system and is the foundation of the network application. At the level of the application layer, it is the API that the operating system provides to the application programmer, through which the application can access the Transport layer protocol.
- The socket is above the transport layer protocol, shielding the differences between different network protocols
- Socket is the gateway of network programming, it provides a lot of system call, constitute the main body of network program
- In a Linux system, the socket is part of the file system, and network communication can be viewed as a read to the file, making our control of the network as convenient as the control of the file.
UDP socket Process (source) TCP socket processing Source
(2) Application layer processing process
- The network application calls the socket API socket (int family, int type, int protocol) to create a socket that eventually calls the Linux system call socket () and eventually Call the Sock_create () method of the Linux Kernel. This method returns the file descriptor of the socket that was created. For each socket created for a userspace network application, there is a corresponding struct socket and struct sock in the kernel. The struct sock has three queues, Rx, TX and err, which are also initialized when the sock structure is initialized, and each queue in the receipt and receiving process holds the corresponding Li for each packet to be sent or accepted. An instance of the Nux network stack sk_buffer data structure SKB.
- For TCP sockets, the app calls the Connect () API, which allows the client and server to establish a virtual connection through the socket. In this process, the TCP stack is established by a three-time handshake to establish a TCP connection. By default, the API waits for the TCP handshake to complete before it is returned. An important step in the process of establishing a connection is to determine the Maxium Segemet Size (MSS) used by both parties. Because UDP is a non-connection-oriented protocol, it is not required for this step.
- The application calls the send or write API of the Linux Socket to emit a message to the receiving end
- Sock_sendmsg is called, it uses the socket descriptor to get the sock struct, creates the message header and the socket control message
- _sock_sendmsg is called, depending on the protocol type of the socket, the sending function of the corresponding protocol is called.
- For TCP, call the Tcp_sendmsg function.
- For UDP, the userspace app can call any of the three system calls of Send ()/sendto ()/sendmsg () to send a UDP message, which eventually calls the Udp_sendmsg () function in the kernel.
1.1.2 Transport Layer
The ultimate goal of the transport layer is to provide its users with efficient, reliable and cost-effective data transfer services, including (1) structuring TCP segment (2) to compute checksum (3) Send reply (ACK) packets (4) sliding windows (sliding Windown) to ensure reliable operation. the approximate processing process for the TCP stack is as follows:
TCP Stack Brief process:
- The TCP_SENDMSG function first checks the status of the established TCP connection and then acquires the MSS of the connection, starting the segement sending process.
- Constructs the TCP segment's playload: It creates an instance of the packet Sk_buffer data structure in kernel space SKB, copying userspace data from packet buffer to the SKB buffer.
- Constructs a TCP header.
- Computes the TCP checksum (checksum) and the sequence number (sequence numbers).
- The TCP checksum is an end-to-end checksum that is computed by the sending side and then validated by the receiving side. The purpose is to discover any changes that occur between the TCP header and the data between the sending end and the receiving end. If the receiver detects a checksum error, the TCP segment is discarded directly. TCP checksum overrides TCP header and TCP data.
- The checksum of TCP is required
- Send to IP layer processing: Call IP handler handle Ip_queue_xmit, will SKB incoming IP processing process.
UDP Stack Brief process:
- UDP encapsulates a message as a UDP datagram
- Call the Ip_append_data () method to send the packet to the IP layer for processing.
1.1.3 IP Network Layer-Add header and checksum, route handling, IP fragmentation
The task of the network layer is to select the appropriate inter-network routing and switching nodes to ensure the timely transmission of data. The network layer consists of a packet of frames provided by the data link layer, which encapsulates the network layer header, which contains the logical address information-the network address of the source site and destination site address. Its main tasks include (1) route processing, that is, select the next hop (2) Add IP header (3) Compute IP header checksum, used to detect whether the IP packet header in the propagation process error (4) if possible, IP shard (5) processing completed, get the next hop MAC ground Address, set the link beginning header, and then turn to link layer processing.
IP Header:
The basic processing process for the IP stack is as follows:
- First, Ip_queue_xmit (SKB) checks the SKB->DST routing information. If not, for example, the first packet of a socket, use Ip_route_output () to select a route.
- Next, populate the various fields of the IP packet, such as version, header length, TOS, and so on.
- Some of the intermediate fragments, etc., can refer to related documents;
- The next step is to use IP_FINISH_OUPUT2 to set the link beginning the text header. If the link layer header is cached (that is, HH is not empty), then it is copied into the SKB. If not, then call Neigh_resolve_output and use ARP to get it.
1.1.4 Data Link Layer
Functionally, on the basis of providing the bit stream service in the physical layer, the data link between adjacent nodes is established, and the data frame (frame) is transmitted without error on the channel through error control, and the action series on each circuit is carried out. The data link layer provides reliable transmission on unreliable physical media. The functions of this layer include: Physical address addressing, data framing, flow control, data error checking, re-sending, etc. In this layer, the units of the data are called frames. The Data Link layer protocol includes: SDLC, HDLC, PPP, STP, Frame Relay, and so on.
Implementation, Linux provides a Network device abstraction layer, in fact, now LINUX/NET/CORE/DEV.C. Specific physical network devices in the device driver (DRIVER.C) need to implement the virtual function. The network device abstraction layer invokes the functions of the specific networking devices.
、
1.1.5 Physical Layer-physical layer encapsulation and forwarding
- The physical layer copies the data in the main memory to internal RAM (buffer) via DMA after receiving the sending request. In the data copy, the associated HEADER,IFG, leader, and CRC that conform to the Ethernet protocol are added. For Ethernet networks, the physical layer is sent with CSMA/CD, which is the interception of link collisions during the sending process.
- Once the NIC is sent, the interrupt notification CPU is generated, and then the interrupt handler in the drive layer can delete the saved SKB.
1.1.6 Simple Summary
Source
1.2 Receiving end 1.2.1 physical layer and data link layer
Brief process:
- A package arrives at the machine's physical network adapter, and when it receives a data frame, it triggers an interrupt and transmits the DMA to the rx_ring in Linux kernel memory.
- The NIC interrupts and notifies the CPU that a package needs to be processed. The interrupt handler mainly takes the following actions, including assigning the skb_buff data structure and copying the received frame from the network adapter I/O port to the Skb_buff buffer, extracting some information from the data frame, and setting the skb_buff corresponding parameters. These parameters will be used by the upper network protocol, for example skb->protocol;
- After the terminal handler is simply processed, a soft interrupt (NET_RX_SOFTIRQ) is emitted, notifying the kernel that a new data frame is received.
- In Kernel 2.5, a new set of APIs is introduced to handle the received data frame, the NAPI. Therefore, the driver has two ways to inform the kernel: (1) through the previous function Netif_rx, (2) through the NAPI mechanism. The interrupt handler calls the Netif_rx_schedule function of the Network device, enters the soft interrupt processing process, and then calls the Net_rx_action function.
- The function closes the interrupt, obtains all the package in the rx_ring of each Network device, and eventually pacakage is removed from the rx_ring and enters the netif _RECEIVE_SKB process.
- NETIF_RECEIVE_SKB is the last stop of the link layer to receive datagrams. It submits the datagram to the receiving function of the different network layer protocols (mainly IP_RCV and ARP_RCV in the inet domain) according to the network Layer datagram type registered in the global array Ptype_all and ptype_base. The function is to call the third layer protocol's receive function to process the SKB packet and go to the third layer of network layer processing.
1.2.2 Network Layer
- The ingress function of the IP layer is in the IP_RCV function. The function will first do a variety of checks including the package checksum, if necessary, will do IP defragment (merge multiple shards), and then packet call the pre-routing netfilter hook that has been registered, and finally arrive I The P_rcv_finish function.
- The Ip_rcv_finish function calls the Ip_router_input function and enters the routing process. It first calls Ip_route_input to update the route, and then finds the route, deciding whether the package will be sent to the native or forwarded or discarded:
- If this is the case, call the Ip_local_deliver function, you might do de-fragment (merge multiple IP packet), and then call the Ip_local_deliver function. The function calls the next layer of interface, based on the protocal number of the next processing layer of the package, including TCP_V4_RCV (TCP), UDP_RCV (UDP), ICMP_RCV (ICMP), IGMP_RCV (IGMP). For TCP, the function Tcp_v4_rcv function is called to process the process into the TCP stack.
- If forwarding (forward) is required, enter the forwarding process. The process needs to process the TTL and then call the Dst_input function. The function will (1) handle the NetFilter Hook (2) Execution IP fragmentation (3) Call Dev_queue_xmit to enter the link layer processing flow.
1.2.3 Transport Layer (TCP/UDP)
- The Transport layer TCP handles the ingress in the TCP_V4_RCV function (located in the linux/net/ipv4/tcp ipv4.c file), which does the processing of TCP header checks.
- Call _tcp_v4_lookup to find the package's open socket. If it is not found, the package is discarded. Next, check the status of the sockets and connection.
- If the socket and connection are all right, call Tcp_prequeue to get the package from the kernel into user space and put it in the socket's receive queue. The socket then wakes up, calls system call, and eventually calls the Tcp_recvmsg function to get segment from the socket recieve queue.
1.2.4 Receiving end-application layer
- Whenever a user application calls read or Recvfrom, the call is mapped to a SYS_RECV system call in/net/socket.c and converted to a sys_recvfrom call, and then the SOCK_RECGMSG function is called.
- The Inet_recvmsg method in the Socket,/net/ipv4/af inet.c of the INET type is called, and it invokes the data-receiving method of the associated protocol.
- For TCP, call tcp_recvmsg. The function copies data from the socket buffer to user buffer.
- For UDP, any of the three system call recv ()/recvfrom ()/recvmsg () can be called from user space to receive the UDP package, which eventually calls the Udp_recvmsg party in the kernel Method.
Brief summary of 1.2.5 message receiving process
2. Linux sk_buff struct data structures and queues (queue) 2.1 Sk_buff
(This section is selected from http://amsekharkernel.blogspot.com/2014/08/what-is-skb-in-linux-kernel-what-are.html)
What is 2.1.1 Sk_buff?
When a network packet is processed by the kernel, the data of the underlying protocol is routed higher, and the process is reversed when the data is transmitted. Data generated by different protocols, including headers and loads, are passed down the layer until they are eventually sent. Because the speed of these operations is critical to the performance of the network layer, the kernel uses a specific structure called Sk_buff, whose definition file is skbuffer.h. The Socket buffer is used to exchange data over the network implementation layer without copying or going to the packet – a significant speed gain.
- Sk_buff is a core data structure of the Linux network, and its definition file is skbuffer.h.
- Socket kernel buffer (SKB) is the buffer used by the Linux kernel network stack (L2 to L4) to handle network packets (packets), which is of type Sk_buffer. In a nutshell, a SKB represents a packet;tcp segment in the Linux network stack and multiple SKB of IP grouping are saved by a SKB list form.
- The struct sock has three SKB queues (Sk_buffer queue), namely Rx, TX, and Err.
2.1.2 Data-Sending process using SKB operations
(1) Assignment SKB = ALLOC_SKB (len, Gfp_kernel)
(2) Add Payload (Skb_put (SKB, User_data_len))
(3) Add protocol header using Skb->push, or skb->pull delete header
2.2 Drive queue used by Linux network stack (driver queue)
(This section is selected from queueing in the Linux Network Stack by Dan Siemon)
2.2.1 Queue
Between the IP stack and the NIC driver, there is a driver queue (drive queue). Typically, it is implemented as a FIFO ring buffer, which can simply be thought of as a fixed size. This queue does not contain packet data, instead it simply holds a pointer to the socket kernel buffer (SKB), and the use of SKB as described in the previous section is always the process of running through the kernel network stack.
The queue's input is packets when the IP stack is processed. These packets are either generated by native applications or entered into the machine and routed out again. The packets that are queued by the IP stack are removed by the network device driver (hardware driver) and sent to the NIC hardware device via a data bus and transmitted.
Without TSO/GSO, the length of the IP stack's packets to the queue must be less than the MTU.
2.2.2 SKB Size-default maximum size is NIC MTU
The vast majority of network cards have a fixed maximum transmission unit (maximum transmission unit, MTU) property, which is the maximum frame size that the network device can transmit. For Ethernet, the default value is bytes, but some Ethernet networks can support jumbo frames (jumbo frame), which can be up to 9000 bytes. Within the IP network stack, the MTU represents the maximum packet size that can be sent to the NIC. For example, if an application writes a bytes data to a TCP socket, then the IP stack needs to create two IP packets to maintain the size of each packet equal to or less than bytes. Visible, for large data transfers, a relatively small MTU causes a large number of small network packets (small packets) to be passed into the driver queue. This becomes an IP shard (IP fragmentation).
Indicates that the payload is an IP packet of bytes, with the MTU at 1000 and 600 when the Shard condition:
Note:
- The above information is compiled from various data obtained from the network.
- This piece itself is more complex, and different versions of the Linux kernel have differences, the content of the text needs to be further processed, the error is unavoidable.
Reference Links:
Linux Network protocol stack (i)--socket Getting Started
Linux Network protocol stack (iv)--Link layer (1)
What's SKB in Linux kernel? What is SKB operations? Memory representation of SKB? How to send packet out using SKB operations?
Queueing in the Linux Network Stack
Transmission Control Protocol
Data receiving and dispatching in TCP/IP protocol stack
Http://www.haifux.org/lectures/217/netLec5.pdf
Understanding the Linux network stack (Linux networking stack) (1): A simple summary of the Linux network protocol stack