A brief analysis of network performance optimization method under Linux

Source: Internet
Author: User
Tags goto

Overview

For the behavior of the network, can be divided into 3 paths: 1) Send path, 2) forwarding path, 3) receive path, and network performance optimization can be based on these 3 paths to consider. Since the forwarding of packets is generally concerned with the device with the routing function, it is not described in this paper, if readers are interested, they can learn by themselves (in the Linux kernel, the hash-based routing lookup and the dynamic Trie-based routing lookup algorithm are used respectively). This article focuses on the analysis of optimization methods on the Send path and the receive path, where the NAPI is essentially an optimization on the receive path, but because it occurs earlier in the Linux kernel, and it is also the basis for various subsequent optimization methods, it is analyzed separately.

The most basic NAPINAPI

The core of NAPI is that in a busy network, every time a network packet arrives, there is no need to interrupt, because high-frequency interrupts may affect the overall efficiency of the system, the illusion of a scene, we use the standard 100M network card at this time, may actually achieve the receiving rate of 80mbits/s, When the average packet length is 1500Bytes, the number of interrupts per second is:

80M bits/s/(8 bits/byte * bytes) = 6,667 Interrupts/s

6,667 interrupts per second, the system is a great pressure, at this time can actually be converted to use polling (polling) to handle, not interrupt, but polling is not efficient when the network traffic is small, so low traffic, based on the method of interruption is more appropriate, this is the reason why NAPI occurs, Use interrupts to receive packets at low traffic, while high traffic is received using polling-based methods.

Now that the NIC in the kernel has basically fully supported the NAPI function, it is known from the previous narrative that the NAPI is suitable for processing high-rate packets, and the benefits are:

    • Interrupt Mitigation (Interrupt mitigation), from the above example can be seen, in the high traffic, the network card generated by the interrupt can reach thousands of times per second, and if each interrupt requires the system to process, is a great pressure, and NAPI use polling is to prohibit the network card receive interrupt, This reduces the pressure on the system to process interrupts
    • Packet throttling (Packet throttling), the pre-NAPI Linux NIC driver always generates an IRQ after receiving the packet, and then in the Interrupt service example thread adds the SKB to the local softnet and then triggers the local NET_RX_SOFTIRQ soft Subsequent processing of the fault. If the packet speed is too high, because the IRQ priority is higher than SoftIRQ, causing most of the system's resources are in response to interrupt, but the softnet queue size is limited, the received excess packets can only be discarded, so this model is in the use of valuable system resources do not work hard. and NAPI in such cases, directly throw away the package, will not continue to throw away the packet to the kernel to deal with, so that the network card will need to discard the packet as early as possible discarded, the kernel will not be visible to lose the packet, which also reduces the pressure of the kernel

The use of NAPI generally includes the following steps:

  1. In the interrupt processing function, the first stop receiving interrupts, and tell the network subsystem, will be polling fast packet, which prohibit the receive interrupt is entirely determined by the hardware function, and tell the kernel will be polled to process the package is using the function netif_rx_schedule (), or the following way, where the Netif_rx_schedule_prep is to determine whether the polling mode is now entered: Listing 1. To schedule a network card as a polling mode
             void Netif_rx_schedule (struct net_device *dev);    or         if (Netif_rx_schedule_prep (dev))                  __netif_rx_schedule (dev);
  2. To create a polling function in the driver, it works by getting the packet from the network card and feeding it to the network subsystem, whose prototype is: Listing 2. Polling method for NAPI
        Int (*poll) (struct net_device *dev, int *budget);

    The polling function here is used to process packets in the receive queue with the poll () method after switching the network adapter to polling mode, such as if the queue is empty, then re-switched to break mode. Switching back to break mode requires the polling mode to be turned off, the function netif_rx_complete () is used, and the network card receive interrupt is turned on.

    Listing 3. Exiting polling mode
             void Netif_rx_complete (struct net_device *dev);
  3. Creating a polling function in the drive, which needs to be associated with the actual network device struct Net_device, is typically done when the NIC is initialized, with the sample code as follows: Listing 4. Setting the network card support polling mode
       Dev->poll = My_poll;    Dev->weight = 64;

    There is another field in the weight (weight), this value does not have a very strict requirements, is actually an empirical data, generally 10Mb network card, we set to 16, and the faster the network card, we set to 64.

Some related Interface of NAPI

Here are some of the interfaces of the NAPI feature, which are basically involved in the front, we simply look at:

Netif_rx_schedule (Dev)

Called in the interrupt handler function of the NIC to switch the network card's receive mode to polling

Netif_rx_schedule_prep (Dev)

When the network card is up and running, set the NIC to be ready to be added to the status of the polling list, which can be seen as the first half of the Netif_rx_schedule (Dev)

__netif_rx_schedule (Dev)

Join the device to the polling list, provided that the netif_schedule_prep (dev) function has returned 1

__netif_rx_schedule_prep (Dev)

Similar to Netif_rx_schedule_prep (Dev), but does not determine if the NIC device is up and running, it is not recommended

Netif_rx_complete (Dev)

Used to remove the NIC interface from Samsara in the polling list, which is typically called after the polling function is complete.

__netif_rx_complete (Dev)

Newer newer NAPI

In fact, the previous NAPI (New API) Such a name has been a bit of a laugh, it can be seen that the kernel geek of Linux control of the name, than the control of the code too much, so, a continuous two times to NAPI refactoring, was dubbed newer newer NAPI.

Similar to Netif_rx_complete (Dev), but needs to ensure that local interrupts are disabled

Newer newer NAPI

In the originally implemented NAPI, there are 2 fields in the struct net_device 中,分别为轮询函数 poll() 和权重 weight,而所谓的 Newer newer NAPI,是在 2.6.24 版内核之后,对原有的 NAPI 实现的几次重构,其核心是将 NAPI 相关功能和 net_device 分离,这样减少了耦合,代码更加的灵活,因为 NAPI 的相关信息已经从特定的网络设备剥离了,不再是以前的一对一的关系了。例如有些网络适配器,可能提供了多个 port,但所有的 port 却是共用同一个接受数据包的中断,这时候,分离的 NAPI 信息只用存一份,同时被所有的 port 来共享,这样,代码框架上更好地适应了真实的硬件能力。Newer newer NAPI 的中心结构体是 napi_struct:

Listing 5. NAPI Structural Body
/* * Structure for NAPI scheduling similar to Tasklet, with weighting */struct NAPI_STRUC  t {/* The poll_list must only is managed by the entity which * changes the state of the napi_state_sched bit. This means * Whoever atomically sets so bit can add this napi_struct * to the PER-CPU poll_list, and whoever clears th  At bit * Can remove from the list right before clearing the bit.  */struct List_head poll_list;  unsigned long state;  int weight;  Int (*poll) (struct napi_struct *, int);  #ifdef Config_netpoll spinlock_t Poll_lock;  int Poll_owner;  #endif unsigned int gro_count;  struct Net_device *dev;  struct List_head dev_list;  struct Sk_buff *gro_list;  struct Sk_buff *skb; };

Familiar with the old NAPI interface implementation, the fields inside the poll_list, state, weight, poll, Dev, nothing to say, Gro_count and Gro_list will be described in the following the GRO. It is important to note that the biggest difference from the previous NAPI implementation is that the struct is no longer part of the net_device, in fact, now want the NIC to drive itself to separate and manage the NAPI instance, usually put it in the network card-driven private information, so the main advantage is that if the driver is willing, Multiple napi_struct can be created because more and more hardware is now supporting multiple receive queues (multiple receive queues), so multiple napi_struct implementations make multi-queue usage more efficient.

Compared to the initial NAPI, the registration of the polling function has changed, and the new interface is now used:

void Netif_napi_add (struct net_device *dev, struct napi_struct *napi,     Int (*poll) (struct napi_struct *, int), int Weig Ht

Familiar with the old NAPI interface, this function is nothing to say.

It is worth noting that the previous polling poll () method prototype also began to require some minor changes:

    Int (*poll) (struct napi_struct *napi, int budget);

Most NAPI related functions also need to change the previous prototype, the following is the API to open the polling feature:

    void Netif_rx_schedule (struct net_device *dev,                            struct napi_struct *napi);     /* ... or ... *     /int netif_rx_schedule_prep (struct net_device *dev,        struct napi_struct *napi);     void __netif_rx_schedule (struct net_device *dev,             struct napi_struct *napi);

The polling feature is closed by using:

    void Netif_rx_complete (struct net_device *dev,    struct napi_struct *napi);

Because there may be multiple instances of napi_struct that require each instance to be enabled or disabled independently, it is necessary to drive the author to ensure that all instances of napi_struct are disabled when the NIC interface is closed.

Functions netif_poll_enable () and netif_poll_disable () are no longer needed because polling management is no longer managed directly with Net_device and is replaced by the following two functions:

    void napi_enable (struct napi *napi);     void napi_disable (struct napi *napi);

Optimized TSO (TCP segmentation offload) on the Send path

TSO (TCP segmentation offload) is a technology that uses network cards to divide large packets, reduce CPU load, and is also called LSO (Large segment offload), if the packet type is only TCP, it is called TSO, if If the hardware supports TSO functionality, it also needs to support both the hardware's TCP checksum calculation and the scatter-gather (scatter Gather) feature.

Can see the implementation of TSO, need some basic conditions, and these are actually done by the combination of software and hardware, for hardware, in particular, the hardware can be large packets shard, after the Shard, but also to each shard attached to the relevant head. TSO's support mainly requires the following steps:

  • If the network adapter supports TSO functionality, the ability to declare the NIC supports TSO, which is indicated by the features field of the NETIF_F_TSO flag setting Net_device structure, for example, in Benet (drivers/net/bene T/BE_MAIN.C) in the driver for the NIC, set the NETIF_F_TSO code as follows: Listing 6. Benet NIC Driver declaration supports TSO function
     static void Be_netdev_init (struct net_device *netdev) {struct Be_adapter *adapter = netdev_pr  IV (NETDEV); Netdev->features |= Netif_f_sg | Netif_f_hw_vlan_rx |  Netif_f_tso | Netif_f_hw_vlan_tx | Netif_f_hw_vlan_filter |  Netif_f_hw_csum | Netif_f_gro |  Netif_f_tso6; Netdev->vlan_features |= Netif_f_sg | Netif_f_tso |  Netif_f_hw_csum;  Netdev->flags |= Iff_multicast;  Adapter->rx_csum = true;  /* Default settings for Rx and Tx flow control */Adapter->rx_fc = true;  Adapter->tx_fc = true;  Netif_set_gso_max_size (Netdev, 65535);  Be_set_netdev_ops (Netdev, &be_netdev_ops);  Set_ethtool_ops (Netdev, &be_ethtool_ops);  Netif_napi_add (Netdev, &adapter->rx_eq.napi, Be_poll_rx, be_napi_weight);  Netif_napi_add (Netdev, &adapter->tx_eq.napi, BE_POLL_TX_MCC, be_napi_weight);  Netif_carrier_off (Netdev);  Netif_stop_queue (Netdev); }

    In the code, the Gso_max_size field of Net_device is also set with the Netif_set_gso_max_size function. This field indicates the maximum buffer size that the network interface can handle at a time, which is typically 64Kb, which means that as long as the TCP data size does not exceed 64Kb, it does not need to be fragmented in the kernel, but only one-off push to the network interface, the network interface to perform the Shard function.

  • When a TCP socket is created, one of the duties is to set the ability of the connection, the socket representation in the network layer is struck sock, which has a field sk_route_caps the ability to mark the connection, after TCP's three-way handshake is completed, the network interface-based Ability and connections to set the field. Listing 7. Network layer settings for TSO function support
    /* This would initiate an outgoing connection. *  /int tcp_v4_connect (struct sock *sk, struct sockaddr *uaddr, int addr_len)  {          ... */OK, now commit Destinatio N to socket.  */  Sk->sk_gso_type = Skb_gso_tcpv4;  Sk_setup_caps (SK, &RT->DST);          ...... }

    The Sk_setup_caps () function in the code sets the above mentioned Sk_route_caps field, and also checks whether the hardware supports the scatter-aggregate function and the hardware checksum calculation function. The reason for these 2 features is that the Buffer may not be on a memory page, so the scatter-gather function is required, and each fragment after the Shard needs to recalculate the checksum, so hardware support is required to verify the calculation.

  • Now, all the preparation work has been done, when the actual data need to transfer, we need to use the set of gso_max_size, we know that TCP sends data to the IP layer will consider MSS, so that the IP packets sent in the MTU, without fragmentation. The gso_max_size of TSO settings affects this process, mainly when calculating the Mss_now field. If the kernel does not support TSO functionality, the maximum value for Mss_now is "Mtu–hlens", and in the case of TSO, the maximum value for Mss_now is "Gso_max_size-hlens", so that the path driven from the network layer is opened.
GSO (Generic segmentation offload)

TSO is to enable the network protocol stack to push the large buffer to the NIC, and then the network card to perform sharding work, which reduces the load on the CPU, but TSO needs hardware to achieve the Shard function, and performance improvement, mainly because the delay of the Shard and reduce the load of the CPU, so you can consider the TSO technology generally Because it is essentially a delay shard, this technique, called GSO (Generic segmentation offload) in Linux, is more generic than TSO because it does not require hardware support shards to be used, and for hardware that supports TSO functionality, first GSO function, and then use the hardware shard capability of the NIC to perform the Shard, and for a NIC that does not support TSO functionality, the execution of the Shard is placed in the first moment of the NIC that pushes the data, that is, before the driver's XMit function is called.

Let's take a look at the possible moments at which shards of the packet in the kernel are:

    1. In the transport protocol, when constructing SKB for queuing
    2. In the transport protocol, but the use of the NETIF_F_GSO function, will be transmitted immediately when a network card driver
    3. In the driver, the driver supports TSO function at this time (the NETIF_F_TSO flag is set)

For GSO-supported scenarios, the main use is case 2 or Case 2., 3, where the second is in case the hardware does not support TSO, while the case 2, 3 is in the case of hardware support TSO.

In the code, you call dev_gso_segment in the Dev_hard_start_xmit function to execute the Shard, so that the Shard time is deferred to improve performance:

Listing 8. Shards in the GSO
int dev_hard_start_xmit (struct sk_buff *skb, struct net_device *dev,  struct netdev_queue *txq)  {... if (netif_nee Ds_gso (Dev, skb)) {  if (unlikely (Dev_gso_segment (SKB)))  goto OUT_KFREE_SKB;  if (skb->next)  goto GSO;}  else {...} ...}

Optimized LRO on the receive path (Large receive offload)

Linux joins the LRO (Large Receive offload), which supports the IPV4 TCP protocol in 2.6.24, by aggregating multiple TCP data into a SKB structure at a later time as a large packet to the upper layer's network stack to reduce The layer protocol stack handles the overhead of SKB and improves the system's ability to receive TCP packets.
Of course, all this requires network card driver support. Understanding how the LRO works, you need to understand how the Sk_buff structure stores the load, and in the kernel, Sk_buff can save the real load in three ways:

    1. The data is stored in the memory buffer requested by the Kmalloc in Skb->data, which is often referred to as the linear data area, and the length of the data area is given by the function Skb_headlen
    2. The data is saved in the memory page represented by the member Frags in the shared structure Skb_shared_info, immediately following the SKB linear data area, the number of skb_frag_t is given by Nr_frags, skb_frags_t The offset of the data in the memory page and the size of the data area
    3. The data is stored in the SKB Shard queue represented by the member Frag_list in Skb_shared_info

Combined with multiple SKB Super SKB, it is possible to pass through the network protocol stack at once, rather than multiple times, which is obvious to the CPU load reduction.

The core structure of LRO is as follows:

Listing 9. The core structure of LRO
 /* * Large Receive Offload (LRO) Manager * * fields must is set by driver * struct Net_lro_mgr {struct net_device  *dev;  struct Net_lro_stats stats;  /* LRO features */unsigned long features; #define LRO_F_NAPI 1/* Pass packets to Stack via NAPI */#define LRO_F_EXTRACT_VLAN_ID 2/* Set flag if VLA N IDs is extracted from received packets and ETH protocol is still eth_p_8021q */* Set for generated skbs t  Hat is not added to * The Frag list in fragmented mode */U32 ip_summed; U32 Ip_summed_aggr; /* Set in aggregated skbs:checksum_unnecessary * or Checksum_none */int max_desc; /* Max Number of LRO descriptors */int max_aggr; /* Max number of LRO packets to be aggregated */int frag_align_pad; /* Padding required to properly align Layer 3 * headers in generated SKB when using frags/struct NET_LRO_DESC *LR O_arr; /* Array of LRO descriptors */* * Optimized driver functions * * get_skb_header:returns TCP and IP headerFor packet in SKB */int (*get_skb_header) (struct sk_buff *skb, void **ip_hdr, void **tcpudp_hdr, U64 *hdr_flags,  void *priv);  /* hdr_flags: */#define LRO_IPV4 1/* IP_HDR is IPV4 header */#define LRO_TCP 2/* TCPUDP_HDR is TCP header */* * Get_frag_header:returns mac, TCP and IP header for packet in SKB * * @hdr_flags: Indicate what kind of LRO have to is D One * (IPV4/IPV6/TCP/UDP) */int (*get_frag_header) (struct skb_frag_struct *frag, void **mac_hdr, VO  ID **ip_hdr, void **tcpudp_hdr, U64 *hdr_flags, void *priv); };

In the struct:

Dev: pointing to network devices that support the LRO feature

Stats: Contains some statistics for viewing the operation of the LRO feature

Features: Control LRO How the package is sent to the network stack, where the LRO_F_NAPI indicates that the driver is NAPI compatible, should use the NETIF_RECEIVE_SKB () function, and lro_f_extract_vlan_id indicates that the driver supports the VLAN

Ip_summed: Indicates whether a network protocol stack is required to support checksum checksum

IP_SUMMED_AGGR: Indicates whether the aggregation of large packets requires a network protocol stack to support checksum checksum

Max_desc: Indicates the maximum number of LRO descriptors, note that each LRO descriptor describes a TCP stream, so this value indicates the number of TCP flows that can be processed at the same time

MAX_AGGR: Is the maximum number of packets that will be aggregated into a super packet

Lro_arr: is a descriptor array that needs to drive itself to provide enough memory or handle exceptions when memory is low

Get_skb_header ()/get_frag_header (): For fast locating IP or TCP headers, the general driver only provides one of the implementations

Generally in the drive, the function used is NETIF_RX or NETIF_RECEIVE_SKB, but in support of LRO driver, the following function is required, the two functions will come in the packet according to the LRO descriptor classification, if you can gather, aggregate into a super packet , no person directly to the kernel, go the normal way. The Lro_receive_frags function is required because some drivers put the packet directly into the memory page and then construct the Sk_buff, and for such a driver, the following interface should be used:

Listing 10. LRO Packet Collection function
void Lro_receive_skb (struct net_lro_mgr *lro_mgr,       struct sk_buff *skb,       void *priv);  void Lro_receive_frags (struct net_lro_mgr *lro_mgr,            struct skb_frag_struct *frags,    int len, int true_size,    void *priv, __wsum sum);

Because LRO need to aggregate to max_aggr number of packets, but in some cases may lead to a large delay, in this case, can be gathered after a portion of the package, directly to the network stack processing, you can use the following function, you can also receive a special package, without going through LRO, Pass a network protocol stack directly:

Listing 11. LRO Flush function
    void Lro_flush_all (struct net_lro_mgr *lro_mgr);     void Lro_flush_pkt (struct net_lro_mgr *lro_mgr,        struct IPHDR *iph,        struct TCPHDR *tcph);
GRO (Generic Receive Offload)

The core of the preceding LRO is that, on the receive path, multiple packets are aggregated into a large packet and then passed to the network stack processing, but there are some flaws in the LRO implementation:

    • Packet merging can break some states
    • Packet merging conditions are too broad, resulting in some cases where the packets that would otherwise have to be differentiated are also merged, which is not acceptable for routers
    • Bridge functionality is required under virtualization conditions, but LRO makes bridging functionality unusable
    • Implementation, only IPV4 TCP protocol is supported

The solution to these problems is the newly proposed gro (Generic Receive offload), first of all, Gro's merging conditions are more rigorous and flexible, and at design time consider supporting all transport protocols, so the subsequent drivers should use the Gro interface instead of LRO , the kernel may remove LRO from the kernel after all the first drivers have migrated to the GRO interface. David S. Miller, the maintainer of the Linux network subsystem, has made it clear that the current NIC driver has 2 functions to use, one using the NAPI interface for interrupt mitigation (interrupt mitigation), and a simple mutex, and the second is the use of GRO's NAPI interface to pass packets to the network protocol stack.

In the NAPI instance, there is a list of Gro's packages gro_list, with a stack of packets received, which the GRO layer uses to distribute the clustered package to the network protocol layer, and each network protocol layer that supports the GRO feature, the Gro_receive and Gro_complete methods need to be implemented.

Listing 12. The protocol layer supports GRO/GSO interfaces
struct Packet_type {  __be16  type;  /* This is really htons (Ether_type). */  struct net_device  *dev;  /* NULL is wildcarded here      */  int  (*func) (struct Sk_buff *,  struct net_device *,  struct packet_ Type *,  struct net_device *);  struct Sk_buff  * (*gso_segment) (struct sk_buff *skb,  int features);  int  (*gso_send_check) (struct sk_buff *skb);  struct Sk_buff  * * (*gro_receive) (struct sk_buff **head,        struct sk_buff *skb);  int  (*gro_complete) (struct sk_buff *skb);  void  *af_packet_priv;  struct list_head  list;  };

Where gro_receive is used to try to match incoming packets to the Gro_list list that has been queued, and the IP and TCP headers are discarded after the match, and once we need to submit a packet to the upper layer protocol, the Gro_complete method is called, and the Gro_list package Merged into a large package, while the checksum is also updated. In the implementation, the GRO is not required to implement aggregations for a long time, but in each NAPI polling operation, the list of the GRO packages is forced to run to the upper layer protocol. The biggest difference between Gro and LRO is that Gro retains entropy information for each packet received, which is critical for applications like routers and enables support for various protocols. In the case of TCP for IPv4, the matching conditions are:

    • Source/Destination Address matching
    • tos/protocol field Matching
    • Source/Destination Port matching

Many other events will cause the GRO list to pass aggregated packets to the upper layer protocol, such as a TCP ACK mismatch or TCP serial number not ordered, and so on.

The interface provided by GRO and the interface provided by the LRO are very similar, but more concise, for the driver, it is clear that only the Gro's receive function is visible, because most of the work is actually done at the protocol layer:

Listing 13. GRO Packet Interface
gro_result_t napi_gro_receive (struct napi_struct *napi, struct sk_buff *skb)  gro_result_t napi_gro_frags (struct Napi_struct *napi)

Summary

From the above analysis, we can see that the Linux Network performance optimization method, like an evolutionary history, but the evolution of each step, the solution to the problem is more general, more flexible; from NAPI to newer newer NAPI, from TSO to GSO, from LRO to GRO, is a special case to The evolution of a more general solution is precisely this evolutionary but continuous evolution that keeps Linux alive.

http://www.ibm.com/developerworks/cn/linux/l-cn-network-pt/

A brief analysis of network performance optimization method under Linux

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.