Linux TCP queue-related parameters summary turn

Last Update:2016-05-02 Source: Internet

Author: User

Tags ack

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the performance optimization of network applications on Linux, TCP-related kernel parameters are generally adjusted, especially the parameters related to buffering and queuing. The articles on the Internet will tell you what parameters you need to modify, but we often know them but don't know why, and each time we copy them, we may soon forget or confuse their meanings. This article attempts to summarize the TCP queue buffer related kernel parameters, combing them from the point of view of the protocol stack, hoping to be easier to understand and remember. Note that this article is derived from the reference documentation, not to read the relevant kernel source to do the verification, can not guarantee that the content is rigorous and correct. As a Java programmer has not read the kernel source is a mishap.

Below I take the server side as the view, from the connection establishment, the packet receives and the packet sends these 3 paths to classify the parameter to comb.

First, connection establishment

Simply look at the connection setup process, the client sends the package to the server, the SYN server replies, and the SYN＋ACK SYN_RECV state connection is saved to the semi-connection queue. The client returns the ACK package to complete three handshakes, the server ESTABLISHED moves the state's connection into the accept queue, waits for the app to call accept ().
You can see that establishing a connection involves two queues:

A half-connection queue that holds SYN_RECV the state of the connection. Queue Length is net.ipv4.tcp_max_syn_backlog set by the
Accept queue, save ESTABLISHED State of the connection. The queue length is min(net.core.somaxconn, backlog) . The backlog is the ServerSocket(int port,int backlog) parameter that we specified when we created it and eventually passed to the Listen method:
```
#include int listen(int sockfd, int backlog);
```
If we set backlog greater than net.core.somaxconn , the length of the accept queue will be set tonet.core.somaxconn

In addition, in order to respond SYN flooding (that is, the client only sends a SYN packet to initiate a handshake without responding to an ACK completion connection establishment, filling the server side of the semi-connection queue, so that it cannot handle the normal handshake request), Linux implements a mechanism called, SYN cookie through net.ipv4.tcp_syncookies control, Set to 1 to open. Simply put, the SYN cookie connection information is encoded in ISN (initial sequence number) returned to the client, then the server does not need to save a semi-connection in the queue, but instead of the client subsequently sent an ACK back to the ISN restore connection information, To complete the connection, avoid the half-connection queue being attacked by the SYN packet filling up. For the lost client handshake, ignore it.

Second, the receipt of data packets

Let's look at the path of the received packet:

Packet reception, from the bottom up through the three layer: Network card driver, system kernel space, and finally to the user state space application. The Linux kernel uses the sk_buff (socket kernel buffers) data structure to describe a packet. When a new packet arrives, the NIC (Network interface Controller) is called DMA engine by placing the packet Ring Buffer into the kernel memory area. is Ring Buffer fixed in size, it does not contain the actual packet, but rather contains a pointer sk_buff to a descriptor. When Ring Buffer full, the new packets will be discarded. Once the packet is successfully received, an interrupt is initiated and the packet is NIC passed by the kernel's interrupt handler to the IP layer. After processing the IP layer, the packet is placed in a queue waiting for the TCP layer to process. Each packet undergoes a series of complex steps in the TCP layer, updating the TCP state machine, eventually arriving recv Buffer , waiting for the application to receive processing. It is important to note that when a packet arrives recv Buffer , TCP ACK confirms that the TCP representation of the packet ACK has been received by the operating system kernel, but does not ensure that the application layer must receive data (such as system crash at this time). Therefore, it is generally recommended that the application protocol layer also design its own validation mechanism.

Above is a fairly simplified packet reception process, let's take a look at the queue buffer-related parameters.

Network card bonding mode
When a host has more than 1 network cards, Linux binds multiple network cards to a virtual bonded network interface, and only one bonded NIC exists for TCP/IP. Multi-NIC binding can improve network throughput on the one hand, and also enhance network high availability on the other. Linux supports 7 bonding modes:
- mode 0 (BALANCE-RR) round-robin policy with load balancing and fault tolerance
- Mode 1 (active-backup) Master and Standby policy, only one network adapter is active in the binding, others are in backup status
- Mode 2 (balance-xor) xor policy, select slave nic
- Mode 3 (broadcast) broadcast all messages on all network cards via the source MAC address and destination MAC address.
- Mode 4 (802.3AD) ieee 802.3ad dynamic Link aggregation. Create an aggregation group that shares the same rate and duplex mode
- mode 5 (balance-tlb) adaptive transmit load Balancing
- Mo The de 6 (balance-alb) adaptive load Balancing
Detailed instructions refer to the kernel documentation for Linux Ethernet Bonding Driver HOWTO. We can view the bonding mode of this machine through cat/proc/net/bonding/bond0 :

Rarely need to develop to set the NIC bonding mode, You can refer to this document for your own experiment

NIC multi-queue and interrupt binding
With the increasing bandwidth of the network, the single core CPU can not meet the needs of the network card, at this time through the multi-queue network card driver support, each queue through interrupts to different CPU cores, take full advantage of multi-core lifting packet processing capacity.
First, check if the NIC supports multiple queues, and use lspci -vvv the command to locate Ethernet controller the entry:

If there is msi-x, Enable+ and Count > 1, then the NIC is a multi-queue network card.
Then check to see if the network card multi-queue is turned on. Using cat /proc/interrupts the command, if you see eth0-txrx-0 indicates that multiple queue support is already open:

Finally, verify that each queue is bound to a different CPU. The cat /proc/interrupts interrupt number for each queue is queried, and the corresponding file is the case of the /proc/irq/${IRQ_NUM}/smp_affinity interrupt number irq_num the CPU core bound. In hexadecimal, each digit represents a CPU core:

（00000001）代表CPU0（00000010）代表CPU1（00000011）代表CPU0和CPU1

If the bindings are not balanced, you can set them manually, for example:

 echo "1" >/proc/irq/99/smp_affinity echo "2" >/proc/irq/100/smp_affinity echo "4" &G T  /proc/irq/101/smp_affinity echo "8" >/proc/irq/102/smp_affinity echo "ten" >/proc/irq/103/smp_affinity echo "20" >/proc/irq/104/smp_affinity echo "/proc/irq/105/smp_affinity" > Echo "/proc/irq/106/smp_affinity" >

Ring Buffer
Ring BufferLocated between the NIC and the IP layer, it is a typical FIFO (first in, out) ring queue. Ring Bufferdoes not contain the data itself, but rather contains a pointer to sk_buff (socket kernel buffers) descriptor.
You can use ethtool -g eth0 the view current Ring Buffer settings:

The above example receives a queue of 4096 and a transmission queue of 256. You can ifconfig observe the health of the receive and transmit queues by observing:
- RX errors: Total number of packets received
- RX dropped: Indicates that the packet has entered Ring Buffer , but due to insufficient memory for system reasons, it is discarded during the copy-to-memory process.
- The RX overruns:overruns means that the packet is Ring Buffer discarded by the physical layer of the NIC, and that the CPU is not able to handle interrupts in a timely manner Ring Buffer , such as uneven distribution of interrupts.
  When the number of dropped continues to increase, it is recommended to increase Ring Buffer and use ethtool -G to set.
Input Packet Queue (packet receive queues)
When the rate at which packets are received is greater than the rate of the kernel TCP processing packets, the packets are buffered in the queue before the TCP layer. The length of the receive queue is set by the parameter net.core.netdev_max_backlog .
Recv Buffer
recv bufferis the key parameter to adjust TCP performance. BDP(bandwidth-delay product, bandwidth delay product) is the bandwidth of the network RTT and the The product of (round trip time), BDP meaning the maximum amount of data that is not confirmed at any moment in transit. RTTusing ping commands is easy to get. In order to achieve maximum throughput, recv Buffer the setting should be greater than BDP , that is recv Buffer >= bandwidth * RTT . Assuming that the bandwidth is 100Mbps and RTT 100ms, then BDP the calculation is as follows:
```
BDP = 100Mbps * 100ms = (100 / 8) * (100 / 1000) = 1.25MB
```
Linux added an recv Buffer automatic adjustment mechanism after 2.6.17, recv buffer the actual size will automatically float between the minimum and maximum value, in order to find the balance of performance and resources, so it is not recommended to recv buffer manually set a fixed value in most cases.
When net.ipv4.tcp_moderate_rcvbuf set to 1 o'clock, the automatic throttling mechanism takes effect, and the recv buffer for each TCP connection is specified by the following 3-tuple array:
```
net.ipv4.tcp_rmem =   
```
The recv buffer setting is initially set to, and this default value overrides net.core.rmem_default . It recv buffer is then dynamically adjusted between the maximum and minimum values according to the actual situation. In the case where the buffered dynamic tuning mechanism is turned on, we net.ipv4.tcp_rmem set the maximum value to BDP .
When net.ipv4.tcp_moderate_rcvbuf set to 0, or the socket option is set SO_RCVBUF , the buffered dynamic throttling mechanism is turned off. is recv buffer set by default, net.core.rmem_default but if set, the net.ipv4.tcp_rmem default value is overridden. The maximum value that can be set by the system call SetSockOpt () recv buffer is net.core.rmem_max . It is recommended to set the default value of the buffer in case the buffering dynamic adjustment mechanism is closed BDP .
Note that there is also a detail, in addition to saving the received data itself, there is a part of the space to save the socket data structure and other additional information. So the recv buffer best value discussed above is just BDP not enough, and you need to consider the overhead of saving extra information such as sockets. Linux net.ipv4.tcp_adv_win_scale calculates the size of the extra overhead according to the parameters:

If net.ipv4.tcp_adv_win_scale the value is 1, then one-second of the buffer space is used for additional overhead, and if 2, One-fourth buffer space is used for additional overhead. Therefore recv buffer , the best value should be set to:

Third, the transmission of data packets

The path through which the packet was sent:

In contrast to the path of receiving data, the packet is sent from top to bottom through three layers: application of user-state space, system kernel space, and last-card driver. The application first writes data to TCP send buffer , and the TCP layer constructs the data send buffer in the packet to the IP layer. The IP layer places the packets to be sent into the queue QDisc (queueing discipline). After the packet is successfully placed QDisc , the descriptor that points to the packet sk_buff is placed in the Ring Buffer output queue, which is then sent to the network link by the NIC driver call DMA engine .

We also comb the parameters of the queue buffers by layer.

Send Buffer
recv Buffersimilar, and send Buffer the relevant parameters are as follows:
```
net.ipv4.tcp_wmem =   net.core.wmem_defaultnet.core.wmem_max
```
The auto-tuning mechanism of the send-side buffering has been implemented very early, and is unconditionally turned on without parameters to set. If specified tcp_wmem , it is net.core.wmem_default tcp_wmem overwritten. send Buffer tcp_wmem automatically adjusts between the minimum and maximum values. If the call setsockopt() sets the socket option SO_SNDBUF , the auto-throttling mechanism that will turn off the send-side buffering will be tcp_wmem ignored and SO_SNDBUF the maximum value is net.core.wmem_max limited.
Qdisc
QDisc(Queueing discipline) is located between the IP layer and the network card ring buffer . As we already know, ring buffer is a simple FIFO queue, this design makes the driver layer of the NIC keep simple and fast. The QDisc advanced functions of traffic management are realized, including traffic classification, priority and traffic shaping (rate-shaping). You can use the tc command configuration QDisc .
QDiscThe queue Length txqueuelen is set by, and the queue length of the receiving packet differs from the kernel parameter net.core.netdev_max_backlog control, which txqueuelen is associated with the NIC and can be used to ifconfig view the current size:

ifconfig txqueuelen size to use adjustment:
```
ifconfig eth0 txqueuelen 2000
```
Ring Buffer
As with the receipt of the packet, the sending packet also passes Ring Buffer , using ethtool -g eth0 view:

Where the TX item is Ring Buffer the transmission queue, which is the length of the send queue. The settings are also used by the command ethtool -G .
TCP Segmentation and Checksum offloading
The operating system can transfer some TCP/IP functions to the NIC, especially the segmentation (shard) and checksum calculations, which can save CPU resources and perform these operations by hardware instead of the OS for performance gains.
General Ethernet MTU (Maximum transmission Unit) is bytes, assuming the application to send packet size is 7300bytes, MTU 1500 bytes-IP header 20 bytes- TCP Header 20 bytes = payload is 1460 bytes, so 7300 bytes need to be split into 5 segment:

The segmentation (Shard) operation can be handed over to the NIC by the operating system, although it still transmits 5 packets on the final line, which saves CPU resources and provides performance gains:

You can use ethtool -k eth0 the current offloading condition to view the NIC:

The above example checksum and TCP segmentation offloading are all open. If you want to set the offloading switch for the NIC, you can use the ethtool -K (note k is uppercase) command, for example, the following command closes the TCP segmentation offload:
```
sudo ethtool -K eth0 tso off
```
NIC multi-queue and Nic bonding mode
has been introduced during the receipt of the packet.

At this point, finally combed finished. The cause of sorting TCP queue-related parameters is the recent troubleshooting of a network timeout problem that has not been found, and the resulting "side effects" are this document. To further solve this problem may need to do the TCP protocol code profile, need to continue to learn, I hope in the near future will be able to write documents and share with you.

Reference documents
Queueing in the Linux Network Stack
TCP implementation in Linux:a Brief Tutorial
Impact of Bandwidth Delay Product on TCP throughput
A system Knowledge series NIC that Java programmers should also know
Talk about network card interrupt processing

Linux TCP queue-related parameters summary turn

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More