Overview
In computer networking, large segment offload (LSO) is a technique for increasing outbound
Throughput of high-bandwidth network connections by grouping CPU overhead. It works by queuing
Up large buffers and leader the network interface card (NIC) split them into separate packets.
The technique is also called TCP segmentation offload (TSO) when applied to TCP, or generic
Segmentation offload (GSO ).
The inbound counterpart of large segment offload is large recive offload (LRO ).
When large chunks of data are to be sent over a computer network, they need to be first broken
Down to smaller segments that can pass through all the network elements like routers and
Switches between the source and destination computers. This process it referred to
Segmentation. Segmentation is often done by the TCP protocol in the host computer. Offloading
This work to the NIC is called TCP segmentation offload (TSO ).
For example, a unit of 64KB (65,536 bytes) of data is usually segmented to 46 segments of 1448
Bytes each before it is sent over the network through the NIC. With some intelligence in the NIC,
The host CPU can hand over the 64KB of data to the NIC in a single transmit request, the NIC can
Break that data down into smaller segments of 1448 bytes, add the TCP, IP, and data link layer
Protocol headers -- according to a template provided by the host's TCP/IP stack -- to each
Segment, and send the resulting frames over the network. This significantly CES the work
Done by the CPU. Memory new records on the market today support TSO. [1]
Details
It is a method to reduce CPU workload of packet cutting in 1500 byte and asking hardware
Perform the same functionality.
1. TSO feature is implemented using the hardware support. This means hardware shocould be
Able to segment the packets in max size of 1500 byte and reattach the header with every
Packets.
2. Every network hardware is represented by netdevice structure in kernel. If hardware supports
TSO, it enables the Segmentation offload features in netdevice, mainly represented
"NETIF_F_TSO" and other fields. [2]
TCP Segmentation Offload is supported in Linux by the network device layer. A driver that wants
To offer TSO needs to set the NETIF_F_TSO bit in the network device structure. In order for
Device to support TSO, it needs to also support Net: TCP Checksum Offloading and
Net: Scatter Gather.
The driver will then receive super-sized skb's. These are indicated to the driver
Skb_shinfo (skb)-> gso_size being non-zero. The gso_size is the size the hardware shocould
Fragment the TCP data. TSO may change how and when TCP decides to send data. [3]
Implementation
[Java]
/* This data is invariant implements SS clones and lives at the end of
* Header data, ie. at skb-> end.
*/
Struct skb_rj_info {
...
Unsigned short gso_size; // the size of each data segment
Unsigned short gso_segs; // number of data segments that the skb is divided
Unsigned short gso_type;
Struct sk_buff * frag_list; // list of split data packets
...
}
[Java]
/* Initialize TSO state of skb.
* This must be invoked the first time we consider transmitting
* SKB onto the wire.
*/
Static int tcp_init_tso_segs (struct sock * sk, struct sk_buff * skb,
Unsigned int mss_now)
{
Int tso_segs = tcp_skb_pcount (skb );
/* If there are no segments or multiple segments but the segment length is not equal to the current MSS, you need to process them */
If (! Tso_segs | (tso_segs> 1 & tcp_skb_mss (skb )! = Mss_now )){
Tcp_set_skb_tso_segs (sk, skb, mss_now );
Tso_segs = tcp_skb_pcount (skb);/* obtain the number of parts again */
}
Return tso_segs;
}
/* Initialize TSO segments for a packet .*/
Static void tcp_set_skb_tso_segs (struct sock * sk, struct sk_buff * skb,
Unsigned int mss_now)
{
/* Partitions are not required in the following cases:
* 1. The data length cannot exceed the maximum allowed length of MSS.
* 2. Nic does not support GSO
* 3. Nic does not support re-calculation of checksum
*/
If (skb-> len <= mss_now |! Sk_can_gso (sk) |
Skb-> ip_summed = CHECKSUM_NONE ){
/* Avoid the costly divide in the normal non-TSO case .*/
Skb_shinfo (skb)-> gso_segs = 1;
Skb_shinfo (skb)-> gso_size = 0;
Skb_shinfo (skb)-> gso_type = 0;
} Else {
/* The calculation must be divided into several data segments */
Skb_shinfo (skb)-> gso_segs = DIV_ROUND_UP (skb-> len, mss_now);/* rounded up */
Skb_shinfo (skb)-> gso_size = mss_now;/* size of each data segment */
Skb_shinfo (skb)-> gso_type = sk-> sk_gso_type;
}
}
/* Due to TSO, an SKB can be composed of multiple actual packets.
* To keep these tracked properly, we use this.
*/
Static inline int tcp_skb_pcount (const struct sk_buff * skb)
{
Return skb_shinfo (skb)-> gso_segs;
}
/* This is valid if tcp_skb_pcount ()> 1 */
Static inline int tcp_skb_mss (const struct sk_buff * skb)
{
Return skb_shinfo (skb)-> gso_size;
}
Static inline int sk_can_gso (const struct sock * sk)
{
/* Sk_route_caps indicates the features of the NIC Driver. sk_gso_type indicates the type of GSO,
* Set it to SKB_GSO_TCPV4.
*/
Return net_gso_ OK (sk-> sk_route_caps, sk-> sk_gso_type );
}
Static inline int net_gso_ OK (int features, int gso_type)
{
Int feature = gso_type <NETIF_F_GSO_SHIFT;
Return (features & feature) = feature;
}
Sk_gso_max_size
NIC also specify the maximum segment size which it can handle, in sk_gso_max_size field.
Mostly it will be set to 64 k. This 64 k values means if the data at TCP is more than 64 k,
Then again TCP has to segment it in 64 k and then push to interface.
Related variable in sock: unsigned int sk_gso_max_size.
[Java]
/* RFC2861 Check whether we are limited by application or congestion window
* This is the inverse of cwnd check in tcp_tso_should_defer
* The function returns 1, which is restricted by the congestion control window. A congestion control window needs to be added;
* The function returns 0, which is restricted by the application and does not need to add a congestion control window.
*/
Int tcp_is_cwnd_limited (const struct sock * sk, u32 in_flight)
{
Const struct tcp_sock * tp = tcp_sk (sk );
U32 left;
If (in_flight> = tp-> snd_cwnd)
Return 1;
/* Left indicates the amount of data that can be sent */
Left = tp-> snd_cwnd-in_flight;
/* If gso is used, the following conditions are met and the congestion window is considered to be restricted,
* Adds a congestion window.
*/
If (sk_can_gso (sk )&&
Left * sysctl_tcp_tso_win_divisor <tp-> snd_cwnd &&
Left * tp-> mss_cache <sk-> sk_gso_max_size)
Return 1;
/* If left is greater than the allowable traffic burst, the congestion window will grow rapidly,
* Cannot be added.
*/
Return left <= tcp_max_burst (tp );
}
TSO Nagle
GSO, Generic Segmentation Offload, is a protocol stack to improve efficiency.
It is as late as possible to postpone the segmentation (segmentation), the most ideal is to segment in the NIC Driver, In the NIC Driver
The super-packet (super-packet) is disassembled to form a SG list, or the segments are reorganized in a pre-allocated memory and then handed over
Nic.
The idea behind GSO seems to be that evaluate of the performance benefits of LSO (TSO/UFO)
Can be obtained in a hardware-independent way, by passing large "superpackets" around
As long as possible, and deferring segmentation to the last possible moment-for devices
Without hardware segmentation/fragmentation support, this wocould be when data is actually
Handled to the device driver; for devices with hardware support, it cocould even be done in hardware.
Try to defer sending, if possible, in order to minimize the amount of TSO splitting we do.
View it as a kind of TSO Nagle test.
By sending delayed data packets, TSO segments are reduced to reduce CPU load.
[Java]
Struct tcp_sock {
...
U32 tso_deferred;/* timestamp of the last TSO delay */
...
};
[Java]
/** This algorithm is from John Heffner.
* 0: send now; 1: deferred
*/
Static int tcp_tso_should_defer (struct sock * sk, struct sk_buff * skb)
{
Struct tcp_sock * tp = tcp_sk (sk );
Const struct inet_connection_sock * icsk = inet_csk (sk );
U32 in_flight, send_win, cong_win, limit;
Int win_divisor;
/* If this skb contains the end mark, send it immediately */
If (TCP_SKB_CB (skb)-> flags & TCPHDR_FIN)
Goto send_now;
/* If it is not in the Open state, send it immediately */
If (icsk-> icsk_ca_state! = TCP_CA_Open)
Goto send_now;
/* Defer for less than two clock ticks.
* If the last skb is delayed for more than 1 ms, the delay is no longer required.
* That is to say, the TSO latency cannot exceed 2 ms!
*/
If (tp-> tso_deferred & (u32) jiffies <1)> 1)-(tp-> tso_deferred> 1)> 1)
Goto send_now;
In_flight = tcp_packets_in_flight (tp );
/* If this data segment does not need to be split or cannot be sent due to congestion windows, an error is returned */
BUG_ON (tcp_skb_pcount (skb) <= 1 | (tp-> snd_cwnd <= in_flight ));
/* Remaining size of the announcement window */
Send_win = tcp_wnd_end (tp)-TCP_SKB_CB (skb)-> seq;
/* Remaining congestion window size */
Cong_win = (tp-> snd_cwnd-in_flight) * tp-> mss_cache;
/* Take the minor as the final sending limit */
Limit = min (send_win, cong_win );
/* If a full-sized TSO skb can be sent, do it.
* Generally, it is 64 KB.
*/
If (limit> = sk-> sk_gso_max_size)
Goto send_now;
/* Middle in queue won't get any more data, full sendable already? */
If (skb! = Tcp_write_queue_tail (sk) & (limit> = skb-> len ))
Goto send_now;
Win_divisor = ACCESS_ONCE (sysctl_tcp_tso_win_divisor );
If (win_divisor ){
/* Maximum number of bytes allowed to be sent in an RTT */
U32 chunk = min (tp-> snd_wnd, tp-> snd_cwnd * tp-> mss_cache );
Chunk/= win_divisor;/* Number of messages that can be consumed by a single TSO segment */
/* If at least some fraction of a window is available, just use it .*/
If (limit> = chunk)
Goto send_now;
} Else {
/* Different approach, try not to defer past a single ACK.
* Explorer shoshould ACK every other full sized frame, so if we have space
* More than 3 frames then send now.
*/
If (limit> tcp_max_burst (tp) * tp-> mss_cache)
Goto send_now;
}
/* OK, it looks like it is advisable to defer .*/
Tp-> tso_deferred = 1 | (jiffies <1);/* record the defer timestamp */
Return 1;
Send_now:
Tp-> tso_deferred = 0;
Return 0;
}
/* Returns end sequence number of the specified er's advertised window */
Static inline u32 tcp_wnd_end (const struct tcp_sock * tp)
{
/* The unit of snd_wnd is byte */
Return tp-> snd_una + tp-> snd_wnd;
}
Tcp_tso_win_divisor: the ratio of congestion Windows that can be consumed by a single TSO segment. The default value is 3.
If any of the following conditions are met and TSO is not delayed, you can send the message immediately:
(1) data packets carry the FIN flag. Transmission is almost over, so it is not recommended to delay.
(2) The sender is not in Open congestion. It is not recommended to delay when it is in an abnormal state.
(3) The last skb was delayed and the distance is now greater than or equal to 2 ms. The latency cannot exceed 2 ms.
(4) min (send_win, cong_win)> full-sized TSO skb. The amount of data allowed to be sent exceeds the maximum value that can be processed by TSO at one time, so there is no need to defer it again.
(5) skb is in the middle of the sending queue and can be sent along with the entire skb. The skb in the middle of the sending queue cannot obtain new data and there is no need to defer it.
(6) When the tcp_tso_win_divisor is set, the amount of data that can be consumed by a single TSO segment is min (snd_wnd, snd_cwnd * mss_cache)/tcp_tso_win_divisor.
(7) When tcp_tso_win_divisor is not set, limit> tcp_max_burst (tp) * mss_cache is generally three data packets.
If conditions 4, 5, and 6/7 are all limit> a threshold value, the message can be sent immediately. With these conditions, we can determine that sending is restricted by the application at this time, rather
Announcement window or congestion window. TSO Nagle should not be used when the application sends a small amount of data, because this will affect such applications.
Note that the comment in tcp_is_cwnd_limited () says:
"This is the inverse of cwnd check in tcp_tso_should_defer", so it can be considered to include judgment in tcp_tso_should_defer ()
The condition of tcp_is_not_cwnd_limited (or tcp_is_application_limited.
TSO latency is performed only when all of the following conditions are met:
(1) data packets do not carry the FIN flag.
(2) The sender is in Open congestion state.
(3) The last delay is within 2 ms.
(4) The data volume allowed to be sent is smaller than sk_gso_max_size.
(5) skb is at the end of the sending queue, or skb cannot be completely sent out.
(6) When tcp_tso_win_divisor is set, the data volume allowed to be sent is not larger than that allowed by a single TSO segment.
(7) When tcp_tso_win_divisor is not set, no more than three packets can be sent.
We can see that the trigger conditions of TSO are not harsh, so unlikely is not added when called.
Application
(1) Disable TSO
Ethtool-K ethX tso off
(2) Enable TSO
TSO is enabled by default.
Ethtool-K ethX tso on
Author
Zhangskd @ csdn
Reference
[1] http://en.wikipedia.org/wiki/Large_segment_offload
[2] http://tejparkash.wordpress.com/2010/03/06/tso-explained/
[3] http://www.linuxfoundation.org/collaborate/workgroups/networking/tso