Optimization on the sending path
TSO (TCP segmentation offload)
TSO (TCP segmentation offload) is a technology that uses the network card to split large data packets and reduce the CPU load. It is also called LSO (large segment offload ), if the data packet type can only be TCP, it is called Tso. If the hardware supports the TSO function, it also needs to support the hardware TCP verification calculation and scatter gather function.
We can see that the implementation of TSO requires some basic conditions, which are actually completed by combining the software and hardware. For the hardware, specifically, the hardware can shard large data packets, after a partition, you must attach the relevant header to each partition. The following steps are required to support TSO:
1. If the network adapter supports the TSO function, you must declare that the network adapter supports Tso. This is indicated by setting the features field of net_device structure with the netif_f_tso flag. For example, in the driver of the Benet (Drivers/NET/Benet/be_main.c) Nic, the code for setting netif_f_tso is as follows:
The Benet NIC Driver Declaration supports the TSO function.
Static void be_netdev_init (struct net_device * netdev)
{
Struct be_adapter * adapter = netdev_priv (netdev );
Netdev-> features | = netif_f_sg | netif_f_hw_vlan_rx | netif_f_tso |
Netif_f_hw_vlan_tx | netif_f_hw_vlan_filter | netif_f_hw_csum |
Netif_f_gro | netif_f_tso6;
Netdev-> vlan_features | = netif_f_sg | netif_f_tso | netif_f_hw_csum;
Netdev-> flags | = iff_multicast;
Adapter-> rx_csum = true;
/* Default settings for RX and TX Flow Control */
Adapter-> rx_fc = true;
Adapter-> tx_fc = true;
Netif_set_gso_max_size (netdev, 65535 );
Be_set_netdev_ops (netdev, & be_netdev_ops );
Set_ethtool_ops (netdev, & be_ethtool_ops );
Netif_napi_add (netdev, & adapter-> rx_eq.napi, be_poll_rx,
Be_napi_weight );
Netif_napi_add (netdev, & adapter-> tx_eq.napi, be_poll_tx_mcc,
Be_napi_weight );
Netif_carrier_off (netdev );
Netif_stop_queue (netdev );
}
In the code, the netif_set_gso_max_size function is also used to set the gso_max_size field of net_device. This field indicates the maximum buffer size that can be processed by the network interface at a time. Generally, this value is 64 KB, which means that as long as the TCP data size does not exceed 64 KB, no split is required in the kernel, instead, you only need to push it to the network interface at one time, and the network interface performs the sharding function.
2. When a TCP socket is created, one of the responsibilities is to set the connection capability. The socket at the network layer indicates struck sock, sk_route_caps indicates the connection capability. After the TCP three-way handshake is complete, this field is set based on the capabilities and connections of the network interface.
Settings supported by the network layer for the TSO Function
/* This will initiate an outgoing connection .*/
Int tcp_v4_connect (struct sock * SK, struct sockaddr * uaddr, int addr_len)
{
......
/* OK, now commit destination to socket .*/
SK-> sk_gso_type = skb_gso_tcpv4;
Sk_setup_caps (SK, & RT-> DST );
......
}
The sk_setup_caps () function in the Code sets the sk_route_caps field mentioned above. It also checks whether the hardware supports the distributed-clustering and hardware verification and computing functions. The reason for these two features is that the buffer may not be on a memory page, so the distributed-aggregation function is required, and the checksum needs to be re-calculated for each segment after partitioning, therefore, hardware must support verification calculation.
3. Now all preparations are ready. When the actual data needs to be transmitted, we need to use the configured gso_max_size. We know that, when sending data to the IP layer, TCP considers MSS, so that the IP packet sent is in the MTU without sharding. The gso_max_size set by TSO affects this process, which is mainly used when calculating the mss_now field. If the kernel does not support the TSO function, the maximum value of mss_now is "Mtu-hlens", and the maximum value of mss_now is "gso_max_size-hlens" When Tso is supported, the path with a driver from the network layer is connected.
GSO (generic segmentation offload)
TSO enables the network protocol stack to push a large buffer block to the NIC, and then the NIC performs sharding, which reduces the CPU load, but Tso requires hardware to implement the sharding function; the performance improvement is mainly because the slice is delayed and the CPU load is reduced. Therefore, TSO technology can be generalized because it actually delays the slice, in Linux, it is called GSO (generic segmentation offload). It is more common than Tso because it can be used without hardware support for sharding. For hardware that supports Tso, the GSO function is used first, and then the hardware fragment capability of the NIC is used to perform the fragment. For NICs that do not support the TSO function, the fragment is executed, it is placed one minute before the NIC that pushes data, that is, before calling the xmit function of the driver.
Let's take a look at the possible time points of data packet sharding in the kernel:
1. In the transmission protocol, when constructing a SKB for queuing
2. In the transmission protocol, but the netif_f_gso function is used, when a NIC Driver is about to be passed
3. In the driver, the driver now supports the TSO function (with the netif_f_tso flag set)
When GSO is supported, Use Case 2 or case 2., 3. The second case is that the hardware does not support Tso, while the second case is that the hardware supports Tso.
The Code calls dev_gso_segment in the dev_hard_start_xmit function to execute the part. In this way, we try to delay the part time to improve performance:
Parts in GSO
Int dev_hard_start_xmit (struct sk_buff * SKB, struct net_device * Dev,
Struct netdev_queue * txq)
{
......
If (netif_needs_gso (Dev, SKB )){
If (unlikely (dev_gso_segment (SKB )))
Goto out_kfree_skb;
If (SKB-> next)
Goto GSO;
} Else {
......
}
......
}
Optimization on the receiving path
LRO (large receive offload)
Linux has added LRO (large receive offload) that supports IPv4 TCP protocol in 2.6.24. It aggregates multiple TCP data in a SKB structure, later, a large data packet is delivered to the upper-layer network protocol stack to reduce the overhead of the Upper-layer protocol stack to process SKB and improve the system's ability to receive TCP data packets.
Of course, all this requires support from the NIC Driver. To understand how LRO works, you need to understand the load storage methods of the sk_buff struct. In the kernel, sk_buff can be used to save the actual load in three ways:
1. data is stored in the SKB-> data-directed memory buffer applied by kmalloc. This data zone is usually called a linear data zone. The data zone length is given by the function skb_headlen.
2. data is stored in the Memory Page indicated by frags, a member of the shared struct skb_shared_info at the end of the SKB linear data zone. The number of skb_frag_t is given by nr_frags, skb_frags_t contains the offset of data in the memory page and the size of the Data zone.
3. data is stored in the SKB shard queue represented by frag_list, a member of skb_shared_info.
The super SKB that has merged multiple SKB instances can pass the network protocol stack at a time, rather than multiple times. This obviously reduces the CPU load.
The core structure of LRO is as follows:
Core Structure of LRO
/*
* Large receive offload (LRO) Manager
*
* Fields must be set by driver
*/
Struct net_lro_mgr {
Struct net_device * dev;
Struct net_lro_stats stats;
/* LRO features */
Unsigned long features;
# Define lro_f_napi 1/* pass packets to Stack via napi */
# Define lro_f_extract_vlan_id 2/* Set flag if VLAN IDs are extracted
From existing ed packets and ETH Protocol
Is still eth_p_8021q */
/*
* Set for generated skbs that are not added
* The Frag list in fragmented Mode
*/
U32 ip_summed;
U32 ip_summed_aggr;/* set in aggregated skbs: checksum_unnecessary
* Or checksum_none */
Int max_desc;/* max Number of LRO descriptors */
Int max_aggr;/* max Number of LRO packets to be aggregated */
Int frag_align_pad;/* padding required to properly align Layer 3
* Headers in generated SKB when using frags */
Struct net_lro_desc * lro_arr;/* array of LRO descriptors */
/*
* Optimized driver functions
*
* Get_skb_header: returns TCP and IP header for packet in SKB
*/
INT (* get_skb_header) (struct sk_buff * SKB, void ** ip_hdr,
Void ** tcpudp_hdr, u64 * hdr_flags, void * priv );
/* Hdr_flags :*/
# Define lro_ipv4 1/* ip_hdr is IPv4 header */
# Define lro_tcp 2/* tcpudp_hdr is TCP Header */
/*
* Get_frag_header: returns Mac, TCP and IP header for packet in SKB
*
* @ Hdr_flags: indicate what kind of LRO has to be done
* (IPv4/IPv6/TCP/UDP)
*/
INT (* get_frag_header) (struct skb_frag_struct * frag, void ** mac_hdr,
Void ** ip_hdr, void ** tcpudp_hdr, u64 * hdr_flags,
Void * priv );
};
In this struct:
Dev: point to a network device that supports the LRO Function
Stats: contains statistics to view the running status of the LRO function.
Features: controls how LRO sends packets to the network protocol stack. The lro_f_napi indicates that the driver is napi compatible. The netif_receive_skb () function should be used, while the lro_f_extract_vlan_id indicates that the driver supports VLAN.
Ip_summed: indicates whether the network protocol stack supports checksum verification.
Ip_summed_aggr: indicates whether the network protocol stack is required to support checksum verification for the clustered large data packets.
Max_desc: indicates the maximum number of LRO descriptors. Note that each LRO descriptor describes a TCP stream, so this value indicates the number of TCP streams that can be processed at the same time.
Max_aggr: indicates the maximum number of packets that will be aggregated into a super packet.
Lro_arr: a descriptor array. the driver must provide enough memory or handle exceptions when the memory is insufficient.
Get_skb_header ()/get_frag_header (): Used to quickly locate the IP address or TCP Header. Generally, the driver only provides one of the implementations.
Generally, the package is received in the driver. The function is netif_rx or netif_receive_skb. However, the following function must be used in the driver that supports LRO, these two functions classify incoming data packets according to the LRO descriptor. If clustering can be performed, the data packets are clustered into a super data packet. Otherwise, the data packets are directly transmitted to the kernel, which follows the normal path. The reason why the lro_receive_frags function is required is that some drivers directly put data packets into the memory page and then construct sk_buff. For such drivers, the following interface should be used:
LRO package receiving function
Void lro_receive_skb (struct net_lro_mgr * lro_mgr,
Struct sk_buff * SKB,
Void * priv );
Void lro_receive_frags (struct net_lro_mgr * lro_mgr,
Struct skb_frag_struct * frags,
Int Len, int true_size,
Void * priv, _ wsum sum );
Because LRO needs to aggregate data packets to max_aggr, but in some cases it may lead to a large delay. In this case, it can be directly transmitted to the network protocol stack for processing after some packets are clustered, in this case, you can use the following function or directly transmit a network protocol stack without going through LRO after receiving a special package:
LRO flush Function
Void lro_receive_skb (struct net_lro_mgr * lro_mgr,
Struct sk_buff * SKB,
Void * priv );
Void lro_receive_frags (struct net_lro_mgr * lro_mgr,
Struct skb_frag_struct * frags,
Int Len, int true_size,
Void * priv, _ wsum sum );
Gro (generic receive offload)
The core of LRO is that multiple data packets are aggregated into a large data packet on the receiving path and then transmitted to the network protocol stack for processing. However, the implementation of LRO has some flaws:
1. Data Packet merging may corrupt some statuses;
2. The data packet merging conditions are too broad. In some cases, the data packets that need to be differentiated are also merged, which is unacceptable to routers;
3. Bridging is required under virtualization conditions, but LRO makes the bridging function unavailable;
4. in implementation, only IPv4 TCP protocol is supported.
The solution to these problems is the newly proposed Gro (generic receive offload). First, Gro merge conditions are more rigorous and flexible, and during design, all transmission protocols are supported. Therefore, Gro interfaces should be used for subsequent drivers, instead of LRO, the kernel may remove LRO from the kernel after all drivers are first migrated to the gro interface. The maintainer of the Linux network subsystem, David S. miller clearly pointed out that the current NIC Driver has two functions to be used. First, the napi interface is used to ease the interruption (Interrupt mitigation) and simple mutual exclusion, second, use GRO's napi interface to transmit data packets to the network protocol stack.
In the napi instance, there is a gro package list gro_list, which is used to accumulate received packets, and the gro layer uses it to distribute the aggregated packets to the network protocol layer, for each network protocol layer that supports gro, The gro_receive and gro_complete methods must be implemented.
The protocol layer supports gro/GSO interfaces.
Struct packet_type {
_ Be16 type;/* this is really htons (ether_type ).*/
Struct net_device * dev;/* null is wildcarded here */
INT (* func) (struct sk_buff *,
Struct net_device *,
Struct packet_type *,
Struct net_device *);
Struct sk_buff * (* gso_segment) (struct sk_buff * SKB,
Int features );
INT (* gso_send_check) (struct sk_buff * SKB );
Struct sk_buff ** (* gro_receive) (struct sk_buff ** head,
Struct sk_buff * SKB );
INT (* gro_complete) (struct sk_buff * SKB );
Void * af_packet_priv;
Struct list_head list;
};
Specifically, gro_receive is used to try to match the incoming packet to the queued gro_list list, while the IP and TCP headers are discarded after the match. Once we need to submit the packet to the upstream protocol, the gro_complete method is called to combine the gro_list package into a large package, and the checksum is also updated. In implementation, Gro is not required to implement aggregation for a long time. Instead, in each napi round-robin operation, the gro packet list is forcibly transferred to the upper layer protocol. The biggest difference between GRO and LRO is that gro retains the entropy information of each received packet, which is crucial for applications such as routers and supports various protocols. Taking IPv4 TCP as an example, the matching conditions are as follows:
1. source/destination address matching;
2. Matching of TOS/protocol fields;
3. source/destination port match.
Many other events will cause the gro list to pass aggregated data packets to the upper layer protocol. For example, the tcp ack does not match or the TCP serial number is not in order.
The interfaces provided by Gro are very similar to those provided by LRO, but they are more concise. For drivers, it is clear that only GRO's package receiving function is available, because most of the work is actually done at the protocol layer:
Gro package receiving Interface
Gro_result_t napi_gro_receive (struct napi_struct * napi, struct sk_buff * SKB)
Gro_result_t napi_gro_frags (struct napi_struct * napi)