Linux Network protocol stack kernel analysis __linux

Source: Internet
Author: User
Tags new set
1. Linux Network Path

1.1 Sending End 1.1.1 Application Layer

(1) Socket

The application layer of the various network applications are basically through the Linux Socket programming interface and kernel Space network protocol stack communication. Linux sockets are developed from BSD sockets, one of the most important components of the Linux operating system, which is the foundation of a network application. At the level, it is located in the application layer, an API provided by the operating system for the application programmer, through which the application can access the Transport layer protocol. Socket is located on the Transport Layer protocol, shielding the difference between different network protocols socket is the portal of network programming, it provides a large number of system calls, forming the main body of the network program in the Linux system, the socket is part of the file system, network communication can be seen as a read of the file , which makes our control of the network as convenient as the control of the file.

UDP socket processing process (source)

TCP Socket processing process (source)

(2) Application layer processing process Network Applications call socket Api socket (int family, int type, int protocol)   Create a socket that will eventually invoke the Linux system Call socket () and eventually invoke the Sock_create () method of the Linux Kernel. This method returns the file descriptor of the socket that was created. For each socket created by a userspace network application, there is a corresponding struct socket and struct sock in the kernel. Of these, the struct sock has three queues (queue),  rx, TX, and err, which are also initialized when the  sock structure is initialized, and each queue is saved to be sent or accepted during receipt and dispatch. Packet the corresponding Linux network stack Sk_buffer An instance of the data structure SKB. For a TCP socket, the application calls the Connect () API, which enables the client and server to establish a virtual connection through the socket. In this process, TCP connections are established by the TCP protocol stack through three handshake. By default, the API waits until the TCP handshake completes the connection establishment before returning. An important step in establishing a connection is to determine the Maxium Segemet Size (MSS) used by both parties. Because UDP is a connectionless protocol, it does not require this step. Application calls the send or write API of the Linux socket to send a message to the receiving end sock_sendmsg is invoked, it uses the socket descriptor to obtain sock struct, creates the messages Header and socket control message _sock_sendmsg are invoked, and the sending function of the corresponding protocol is invoked according to the protocol type of the socket. For TCP, call the Tcp_sendmsg function. For UDP, the userspace application can call either Send ()/sendto ()/sendmsg () of three system calls to send a UDP message, which will eventually call the Udp_sendmsg () function in the kernel.

1.1.2 Transport Layer

The ultimate goal of the transport layer is to provide efficient, reliable and cost-effective data transfer services to its users, including (1) constructing TCP segment (2) Computing checksum (3) sending a reply (ACK) packet (4) sliding window (sliding Windown) and so on to ensure reliability of operation. The approximate processing of the TCP protocol stack is shown in the following illustration:

TCP stack Brief process: The TCP_SENDMSG function first checks the state of the established TCP connection, obtains the MSS for the connection, and begins the segement send process. Constructs the playload of a TCP segment: it creates an instance of the packet Sk_buffer data structure in the kernel space SKB, copies userspace data from packet buffer to SKB buffer. Constructs a TCP header. Computes the TCP checksum (checksum) and sequence number (sequence numbers). A TCP checksum is an end-to-end checksum that is computed by the sender and then validated by the receiving end. The purpose is to discover any changes that occur between the TCP header and the data between the sender and the receiving end. If the receiver detects a checksum error, the TCP segment is discarded directly. TCP checksum overwrites TCP header and TCP data. TCP checksum is required for IP layer processing: Invoking IP handler handle Ip_queue_xmit, SKB incoming IP processing flow.

UDP stack Brief process: UDP encapsulates message into UDP datagram call Ip_append_data () method sends packet to IP layer for processing. 1.1.3 IP Network layer – Add header and checksum, route processing, IP fragmentation

The task of the network layer is to select the appropriate network routing and Exchange nodes to ensure the timely transmission of data. The network layer consists of the frames provided by the data link layer, which is encapsulated with the network layer header, which contains the logical address information-the network address of the source site and destination site address. Its main tasks include (1) routing processing, that is, select the next hop (2) Add IP header (3) to compute IP header checksum, used to detect whether the IP packet head is in the process of error (4) if possible, IP fragmentation (5) processing completed, get the next hop MAC Address, set the link Shong the header, and then transfer to the link layer processing.

IP Header:

The IP stack basic processing process is shown in the following illustration:

First, Ip_queue_xmit (SKB) checks SKB->DST routing information. If not, such as the first package of sockets, use Ip_route_output () to select a route. Next, fill in the various fields of the IP packet, such as version, header length, TOS, and so on. Some of the intermediate fragments, etc., can refer to the relevant documentation. The basic idea is that when the length of the message is greater than the length of the MTU,GSO 0, it will call ip_fragment for fragmentation, or it will call Ip_finish_output2 to send the data out. ip_fragment function, will check the IP_DF flag bit, if you want to fragment IP packets prohibit fragmentation, then call Icmp_send () to send a reason for the need to fragment and set up a fragment of the logo for the purpose of unreachable ICMP packets, and discarded packets, That is, setting the IP state to fragment failed, releasing SKB, and returning the message too long error code. Next, use the IP_FINISH_OUPUT2 to set up the link Shong the text head. If the link layer header cache has (that is, HH is not empty), copy it to the SKB. If not, then call Neigh_resolve_output and use ARP to get it.

1.1.4 Data Link Layer

function, the data link between neighboring nodes is established on the basis of providing bit-flow service in the physical layer, and error control is used to provide data frame (frame) without error transmission on the channel, and to carry out the action series on each circuit. The data link layer provides reliable transmission on unreliable physical media. The functions of this layer include: Physical address addressing, data framing, flow control, data error checking, re-sending, etc. At this level, the unit of data is called a frame. Data Link layer protocol representatives include: SDLC, HDLC, PPP, STP, Frame Relay and so on.

Implementation, Linux provides a network device layer of abstraction, in fact, now LINUX/NET/CORE/DEV.C. The specific physical network device (DRIVER.C) needs to implement the virtual function in the device driver. Network Device abstraction layer functions that call specific network devices.

, 1.1.5 Physical layer – Physical layer encapsulation and forwarding

The physical layer copies the data in the main memory to internal RAM (buffer) by DMA after receiving the send request. In the data copy, add the associated HEADER,IFG, leader, and CRC conforming to the Ethernet protocol. For Ethernet networks, physical layers are sent using CSMA/CD, which is to listen for link conflicts during the send process. Once the NIC completes sending the message, the interrupt notification CPU is generated, and the interrupt handler in the driver layer can delete the saved SKB. 1.1.6 Simple Summary

(source) 1.2 1.2.1 Physical layer and data link layer at the receiving end

Brief procedure: A physical network adapter that package to the machine, and when it receives a frame of data, triggers an interrupt and passes the DMA to the rx_ring in the Linux kernel memory. The NIC interrupts and notifies the CPU that a package is needed to handle it. The interrupt handler mainly does the following operations, including allocating the Skb_buff data structure, copying the received frames from the network adapter I/O port into the skb_buff buffer, extracting some information from the data frame, and setting the corresponding Skb_buff parameters. These parameters will be used by the upper network protocol, such as a soft interrupt (NET_RX_SOFTIRQ) after the skb->protocol; terminal handler has been simply processed, notifying the kernel of a new data frame. In Kernel 2.5, a new set of APIs is introduced to handle the data frames received, that is, NAPI. Therefore, the driver has two ways to notify the kernel: (1) through the previous function Netif_rx, (2) through the NAPI mechanism. The interrupt handler invokes the Netif_rx_schedule function of the network device, enters the soft interrupt processing flow, and then invokes the Net_rx_action function. The function closes the interrupt, gets all the package in the rx_ring of each network device, and eventually pacakage is removed from the rx_ring and into the netif _RECEIVE_SKB processing process. NETIF_RECEIVE_SKB is the last stop for the link layer to receive datagrams. It submits the datagram to different network layer protocol receiving functions (mainly IP_RCV and ARP_RCV in the inet domain) according to the network Layer datagram type registered in the global array Ptype_all and ptype_base. The function is to call the third-layer protocol to handle the SKB packet, and enter the third layer of network layer processing. 1.2.2 Network Layer

The

  IP layer's entry function is in the IP_RCV function. The function will first do a variety of checks, including package checksum, if necessary, will do IP defragment (multiple fragments merged), and then packet invoke the registered pre-routing NetFilter Hook, after the completion of the final arrival I P_rcv_finish function. The Ip_rcv_finish function calls the Ip_router_input function and goes into the routing process. It first invokes Ip_route_input to update the route and then finds route to determine whether the package will be sent to the native or forwarded or discarded: if it is sent to this machine, call the Ip_local_deliver function and may do De-fragment (combine multiple IP packet), and then call the Ip_local_deliver function. The function invokes the next layer of interfaces based on the protocal number of the next processing layer of package, including TCP_V4_RCV (TCP), UDP_RCV (UDP), ICMP_RCV (ICMP), IGMP_RCV (IGMP). For TCP, the function TCP_V4_RCV function is invoked to process the process into the TCP stack. If forwarding (forward) is required, the forwarding process is entered. The process needs to process the TTL, and then call the Dst_input function. The function (1) handles the NetFilter Hook (2) to execute the IP fragmentation (3) Call  dev_queue_xmit and enter the link layer processing flow.

The 1.2.3 Transport Layer (TCP/UDP) Transport Layer TCP processing portal in the TCP_V4_RCV function (located in linux/net/ipv4/tcp ipv4.c file), it will do TCP header check processing. Call _tcp_v4_lookup to find the open socket for the package. If it is not found, the package is discarded. Next check the status of the socket and connection. If the socket and connection all normal, call Tcp_prequeue so package from the kernel into the user space, into the socket of the receive queue. The socket is then awakened, calls system call, and finally calls the Tcp_recvmsg function to get segment from the socket recieve queue. 1.2.4 Receiver-Application layer whenever a user applies a call to read or recvfrom, the call is mapped to a SYS_RECV system call in/net/socket.c and converted to a sys_recvfrom call, and then calls the S ock_recgmsg function. The Inet_recvmsg method in the Socket,/net/ipv4/af inet.c of the INET type is invoked, and it invokes the data-receiving method of the relevant protocol. For TCP, call tcp_recvmsg. The function copies data from the socket buffer to the user buffer. For UDP, you can call any of the three system call recv ()/recvfrom ()/recvmsg () from the user space to receive the UDP package, which will eventually call the udp_recvmsg in the kernel Method. Brief summary of 1.2.5 message receiving process

2. Linux sk_buff struct data structures and queues (queue) 2.1 Sk_buff

(This section is selected from http://amsekharkernel.blogspot.com/2014/08/what-is-skb-in-linux-kernel-what-are.html) 2.1.1 Sk_buff is what

When a network packet is processed by the kernel, the underlying protocol's data is transferred to a higher level, and the process is reversed when the data is transferred. Data generated by different protocols (including headers and loads) is passed down the layer until they are eventually sent. Because the speed of these operations is critical to the performance of the network layer, the kernel uses a specific structure called Sk_buff, whose definition file is in Skbuffer.h. Socket buffer is used to exchange data at the network implementation level without copying or going to the packet – which significantly gains speed. Sk_buff is a core data structure of the Linux network, whose definition files are in skbuffer.h. Socket kernel buffer (SKB) is the buffer used by the Linux kernel network stack (L2 to L4) to handle network packets (packets), and its type is sk_buffer. In simple terms, a SKB represents a packet;tcp segment of a Linux network stack and multiple SKB produced by an IP packet are saved by a SKB list form. The struct sock has three SKB queues (Sk_buffer queue), respectively, Rx, TX, and Err.

Its main structural members are:

struct Sk_buff {* * Two members must is first. * * packet can exist in the list or queue, which is used for list processing struct SK_BUFF
    *next;
    struct Sk_buff *prev; struct Sk_buff_head *list;      #该 packet the list ... struct sock *sk; #跟该 SKB-associated socket struct Timeval stamp;  # packet send or receive time, mainly for packet sniffers struct Net_device *dev;
    #这三个成员跟踪该 packet related devices, such as receiving its equipment struct net_device *input_dev;

    struct Net_device *real_dev;
        Union {#指向各协议层 Header structure struct TCPHDR *th;
        struct UDPHDR *uh;
        struct ICMPHDR *icmph;
        struct IGMPHDR *igmph;
        struct IPHDR *ipiph;
        struct IPV6HDR *ipv6h;
    unsigned char *raw;

    } h;
        Union {struct IPHDR *iph;
        struct IPV6HDR *ipv6h;
        struct ARPHDR *arph;
    unsigned char *raw;

    NH;
    Union {unsigned char *raw;

    } mac; struct DST_entry *DST;    #指向该 Packet routing purpose structure that tells us how it will be routed to the destination Char cb[40];
                # SKB control blocks, for each protocol layer to hold private information, such as TCP sequence number and frame of the unsigned int len, the length of the #packet Data_len,          Mac_len, # MAC header length csum; # packet checksum, used to compute the checksum stored in the protocol header. 
                When sent, when checksum offloading is not set, when received, the unsigned char LOCAL_DF can be computed by device, #用于 IPV4 in the case of fragmentation, such as IPSEC.
                Cloned:1, #在 SKB is cloned when it is set, at this point, the SKB members are their own, but the data is shared nohdr:1, #用于支持 TSO Pkt_type, #packet type ip_summed; # Network card can support the type of checksum calculation, none means unsupported, HW express support, __U32 priority; #用于 QoS unsigned short protocol, # Receive packet protocol security;
main operation of 2.1.2 SKB

(1) Allocation SKB = ALLOC_SKB (len, Gfp_kernel)

(2) Add Payload (Skb_put (SKB, User_data_len))

(3) Add protocol header using Skb->push, or skb->pull delete header

2.2 The driver queue used by the Linux network stack (driver queue)

(This section is selected from the queueing in the Linux network Stack by Dan Siemon) 2.2.1 Queue

There is a driver queue (drive queue) between the IP stack and the NIC driver. Typically, it is implemented as a FIFO ring buffer, which can simply be considered to be a fixed size. This queue does not contain packet data, instead, it simply holds a pointer to the socket kernel buffer (SKB), and SKB is always used as described in the previous section throughout the kernel network stack process.

Packets The IP stack is processed when the queue is entered. These packets are either generated by the native application or are routed into the local machine. The packets that are queued by the IP stack are removed by the network device driver (hardware driver) and sent to the NIC hardware device via a data channel.

In the case where TSO/GSO is not used, the length of the packets that the IP stack sends to the queue must be less than the MTU. 2.2.2 SKB Size-The default maximum size is NIC MTU

Most network adapters have a fixed maximum transmission unit (maximum transmission, MTU) attribute, which is the size of the maximum frame (frame) that the network device can transmit. For Ethernet, the default value is 1500 bytes, but some Ethernet networks can support a giant frame (jumbo frame), maximum to 9000 bytes. Within the IP network stack, the MTU represents the maximum packet size that can be sent to the NIC. For example, if an application writes bytes data to a TCP socket, the IP stack needs to create two IP packets to keep each packet size equal to or less than 1500 bytes. Thus, for large data transfers, a relatively small MTU causes a large number of small network packets (small packets) to be generated and passed into the driver queue. This becomes IP fragmentation (IP fragmentation).

The following figure shows the IP packets payload as 1500 bytes, the fragmentation of the MTU at 1000 and 600:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.