Linux Kernel Network protocol stack Learning notes: Analysis and testing of Gro/gso/lro/tso and other patch __linux

Source: Internet
Author: User
Tags ack flush goto

TSO, the full name is TCP segmentation offload, we know that the MTU of Ethernet is usually 1500, except for TCP/IP header, TCP's MSS (Max Segment size) size is 1460, In general, the protocol stack will be more than 1460 of the TCP payload to segmentation, to ensure that the generated IP packet does not exceed the MTU size, but for the support of the TSO/GSO network card, it is not necessary, we can put up to 64K the size of the TCP Payload directly down to the protocol stack, at this time the IP layer will not be segmentation, has been passed to the network card driver, support TSO/GSO network card will generate TCP/IP header and frame head, which can offload a lot of protocol stack on the memory operation, Checksum calculation and so on CPU to do the work has been moved to the network card

GSO is the TSO of the enhanced http://lwn.net/Articles/188489/, GSO not only for TCP, but for any protocol, as far as possible put segmentation back to the network card that moment, at this time will determine whether the network card support SG and GSO, If not supported, do segmentation in the stack; if supported, send payload directly to the network card

Ethtool-k Lo
Offload Parameters for lo:
Rx-checksumming:on
Tx-checksumming:on
Scatter-gather:on
TCP Segmentation Offload:on
UDP Fragmentation Offload:off
Generic segmentation Offload:on
Generic-receive-offload:on

At present, many network cards support TSO, but very few support UFOs, and Gso/gro and network card Independent, but the characteristics of the kernel. GSO used to delay segmentation, until Dev_hard_start_xmit function

int dev_hard_start_xmit (struct sk_buff *skb, struct net_device, *dev,
struct Netdev_queue *txq)
{
const struct Net_device_ops *ops = dev->netdev_ops;
int RC;
unsigned int skb_len;

if (likely (!skb->next)) {
if (!list_empty (&ptype_all))
Dev_queue_xmit_nit (SKB, Dev);

if (Netif_needs_gso (Dev, skb)) {
if (Unlikely (Dev_gso_segment (SKB))
Goto OUT_KFREE_SKB;
if (Skb->next)
Goto GSO;
}

......

Gso
do {
struct Sk_buff *NSKB = skb->next;

Skb->next = nskb->next;
Nskb->next = NULL;
Skb_len = nskb->len;
rc = Ops->ndo_start_xmit (NSKB, Dev);
Trace_net_dev_xmit (NSKB, rc, Dev, skb_len);
if (Unlikely (RC!= Netdev_tx_ok)) {
Nskb->next = skb->next;
Skb->next = NSKB;
return RC;
}
Txq_trans_update (TXQ);
if (unlikely netif_tx_queue_stopped (TXQ) && skb->next)
return netdev_tx_busy;
while (Skb->next);

Skb->destructor = DEV_GSO_CB (SKB)->destructor;

OUT_KFREE_SKB:
KFREE_SKB (SKB);
return NETDEV_TX_OK;
}


Dev_hard_start_xmit judge Netif_needs_gso to determine whether the network card support GSO, if not supported then call Dev_gso_segment inside and call the message fragmentation, for the skb_gso_segment, Actually call the Tcp_tso_segment, finally return a number of Sk_buff composed of the linked list, the head pointer exists in the SKB->NEXT; if the network card itself, the large SKB to the network card: Call Netdev_ops->ndo_start _xmit send out

As you can see, to check the Netdev->features value of the NIC when judging netif_need_gso, we can see these values in Include/linux/netdevice.h:

#define NETIF_F_SG 1/* Scatter/gather IO. */
#define NETIF_F_IP_CSUM 2/* Can checksum tcp/udp over IPv4. */
#define NETIF_F_NO_CSUM 4/* does not require checksum. F.E. Loopack. */
#define NETIF_F_HW_CSUM 8/* Can checksum all the packets. */

#define NETIF_F_FRAGLIST/* Scatter/gather IO. */

#define NETIF_F_GSO 2048/* Enable software GSO. */

#define NETIF_F_GSO_SHIFT 16
#define Netif_f_gso_mask 0x00ff0000
#define NETIF_F_TSO (Skb_gso_tcpv4 << netif_f_gso_shift)
#define NETIF_F_UFO (skb_gso_udp << netif_f_gso_shift)

For the NIC to support TSO, there is a need for Netif_f_sg | Netif_f_tso | Netif_f_ip_csum, if you want to support UFOs, you should need Netif_f_sg | Netif_f_ufo | Netif_f_ip_csum


Below do a test to consider the TSO, GSO effect on performance, I am on the test machine does not support UFOs, so have to take TCP to test

Scatter-gather:on
TCP Segmentation Offload:on
UDP Fragmentation Offload:off
Generic segmentation Offload:on
Generic-receive-offload:on


Recv Send Send

Socket Socket Message Elapsed
Size Size Size Time throughput
Bytes bytes bytes secs. 10^6bits/sec

87380 65536 65536 10.00 26864.51


Closed the TSO, after GSO

Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time throughput
Bytes bytes bytes secs. 10^6bits/sec


87380 65536 65536 10.00 18626.44


For the GSO, the throughput is not very stable, but on average it is a bit lower than the GSO opening.

By the way, TSO, the GSO effect is becoming less apparent as the MTU increases,

#ifconfig Lo MTU 65535

After the netperf-t Tcp_stream measured down, TSO open or close has not been very different, about 10% bar


GSO's commit is here, http://marc.info/?l=git-commits-head&m=115107854915578&w=2 this patch is very old. The new kernel has changed a lot.

The main addition of the Dev_gso_segment,skb_gso_segment function, modified the Dev_hard_start_xmit, Dev_queue_xmit function, which has been mentioned before


------------------------------Gorgeous split Line------------------------------------

LRO (Large receive offload) is a mechanism for TCP, GRO (Generic receive offload) is a LRO enhanced version of the SKB merge more restrictions, but not limited to TCP/IP, this article is mainly about GRO, Because LRO has a problem with IP forward and bridge scenarios, it's been used very little.

If the driver turns on the GRO feature, it calls Napi_gro_receive to collect the package instead of the usual NETIF_RECEIVE_SKB or NETIF_RX, and you can see that Gro is tightly tied to napi_struct, Here we go back to the napi_struct structure that we've studied many times before.

struct Napi_struct {
struct List_head poll_list;

unsigned long state;
int weight;
Int (*poll) (struct napi_struct *, int);
#ifdef Config_netpoll
spinlock_t Poll_lock;
int Poll_owner;
#endif

unsigned int gro_count;

struct Net_device *dev;
struct List_head dev_list;
struct Sk_buff *gro_list;
struct Sk_buff *skb;
};

Napi_struct contains gro_list a SKB list, each SKB in the list represents a Flow,gro_count representing the number of flow


The

Napi_gro_receive calls __napi_gro_receive_gr and then calls __napi_gro_receive, and __napi_gro_receive traverses the Napi_struct->gro_ list, by comparing Skb->dev, and SKB Mac_header to determine whether it belongs to the same flow and exists in Napi_gro_cb->flow. Here to mention the struct NAPI_GRO_CB structure, for the GRO processing of each SKB, in the SKB->CB to save a private data structure of the pointer, is this NAPI_GRO_CB. Note that the SKB private data structure here is just a void*, and skb_shared_info don't confuse, the latter is a linear memory behind the Sk_buff

struct NAPI_GRO_CB {
   /* Virtual address of Skb_shinfo (SKB)->frags[0].page + offset. */
    void *frag0;
  &NBSP;&NBSP
   /* Length of FRAG0. */
    unsigned int frag0_len;
  &NB SP;&NBSP
   /* Indicates where we are processing relative to skb->data. */
    int D Ata_offset;
          &NBSP;&NBSP
    * This is Non-zero if the packet the SAM E flow. */
    int same_flow;
  &NBSP;&NBSP
    * * Non-zero if the packet cannot be merged with the new SKB. */
&NB Sp   INT flush;
      &NBSP;&NBSP
   /* number of segments aggregated. */
    int count;
  &NBSP;&NBSP
    * Free the SKB */
    int free;
;

Then __napi_gro_receive will call Dev_gro_receive, Dev_gro_receive will call Ptype->gso_receive first, which is generally the IP protocol corresponding to the inet_gso_receive

Inet_gro_receive mainly do the following things:

First get the IP header, and then check the header, if check passed start traversing napi_struct->gro_list, according to IP saddr, daddr, TOS, protocol and those before the two layer is likely to be the same flow of SKB to judge, if inconsistent on the Same_flow to 0, of course, the Slow_flow can not start the merge, but also to carry out flush judgment, Any flush judgment but will discard the merge and call the direct call Skb_gro_flush function to the stack layer to handle

IP layer After the end, the TCP layer will continue to gro_receive, call Tcp_gro_receive, its core is traversal napi_struct->gro_list, based on the source addr judge whether it is Same_flow, The need for flush to do the calculation, here to mention the requirements for ACK, ACK consistent description is the same TCP payload is TSO/GSO after the result of segmentation, so is a necessary condition

If TCP does not need flush, then it will go into the skb_gro_receive, this function is used to merge, the first parameter is the gro_list in the SKB, the second is the new SKB, here is not much to say, I recommend the blog article is very clear. In fact, in two cases, if it is Scatter-gather SKB package, the new SKB in the frags data into gro_list SKB corresponding frags data, otherwise SKB data are in the SKB linear address, so that directly alloc a new SKB , put the new SKB into the frag_list inside, and finally put the original gro_list position; if Gro_list SKB already have frag_list, then just hang in there.

Now you're back in dev_gro_receive, and if you need flush or same_flow to be 0, you need to pass to the upper stack, and then call Napi_gro_complete

To the last situation where this SKB is a new flow, add it to the Gro_list list.


Finally, the so-called flush refers to the existing gro_list in the SKB flush to the upper layer of the agreement, do not make a mistake


For more detailed instructions on Gro please refer to http://marc.info/?l=git-commits-head&m=123050372721308&w=2

This blog is a lot clearer for Gro to explain http://www.pagefault.info/?p=159


The actual test, TSO in the performance of the Ascension is very obvious, but Gro is not too obvious, do not know at the limit of performance test will be God horse situation




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.