Summary of parameters related to Linux TCP queue

Last Update:2016-06-15 Source: Internet

Author: User

Tags ack

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the performance optimization of network applications on Linux, TCP-related kernel parameters are generally adjusted, especially the parameters related to buffering and queuing. The articles on the Internet will tell you what parameters you need to modify, but we often know them but don't know why, and each time we copy them, we may soon forget or confuse their meanings. This article attempts to summarize the TCP queue buffer related kernel parameters, combing them from the point of view of the protocol stack, hoping to be easier to understand and remember. Note that this article is derived from the reference documentation, not to read the relevant kernel source to do the verification, can not guarantee that the content is rigorous and correct. As a Java programmer has not read the kernel source is a mishap.

Below I take the server side as the view, from the connection establishment, the packet receives and the packet sends these 3 paths to classify the parameter to comb.

First, connection establishment

Simply look at the connection setup process, the client sends the SYN packet to the server, the server replies Syn+ack, and the connection to the SYN_RECV state is saved to the semi-connection queue. The client returns an ACK packet to complete the three handshake, and the server moves the connection to the established State into the accept queue, waiting for the app to call accept ().

You can see that establishing a connection involves two queues:

A half-connection queue that holds a connection to the SYN_RECV state. Queue Length is set by Net.ipv4.tcp_max_syn_backlog
The Accept queue, which holds the connection to the established state. The queue Length is min (net.core.somaxconn,backlog). The backlog is the parameter that we specify when we create the ServerSocket (Intport,int backlog), which is eventually passed to the Listen method:
#include int listen (int sockfd, int backlog);
If we set the backlog to be longer than the net.core.somaxconn,accept queue length will be set to Net.core.somaxconn

In addition, in order to deal with synflooding (that is, the client only sends the SYN packet to initiate the handshake without responding to the ACK completion connection establishment, filling the server side of the half-connection queue, so that it cannot handle the normal handshake request), Linux implements a mechanism called Syncookie, With Net.ipv4.tcp_syncookies control, set to 1 to open. Simply put, Syncookie is the connection information encoded in isn (initialsequencenumber) to return to the client, then the server does not need to save a half-connection in the queue, but the client subsequently sent an ACK back to the ISN to restore the connection information, To complete the connection, avoid the half-connection queue being attacked by the SYN packet filling up. For the lost client handshake, ignore it.

Second, the receipt of data packets

Let's look at the path of the received packet:

Packet reception, from the bottom up through the three layer: Network card driver, system kernel space, and finally to the user state space application. The Linux kernel uses the Sk_buff (Socketkernel buffers) data structure to describe a packet. When a new packet arrives, the NIC (NetworkInterface Controller) calls Dmaengine to place the packet into the kernel memory area via Ringbuffer. The size of the ringbuffer is fixed, it does not contain the actual packet, but contains a descriptor that points to the sk_buff. When the Ringbuffer is full, the new packet is discarded. Once the packet is successfully received, the NIC initiates the interrupt and the packet is passed by the kernel's interrupt handler to the IP layer. After processing the IP layer, the packet is placed in a queue waiting for the TCP layer to process. Each packet undergoes a series of complex steps in the TCP layer, updating the TCP state machine, eventually reaching Recvbuffer, waiting for the application to receive processing. It is important to note that when a packet arrives recvbuffer,tcp it will return an ACK acknowledgment, both the TCP ACK indicating that the packet has been received by the operating system kernel, but does not ensure that the application layer must receive data (such as the system crash at this time), Therefore, it is generally recommended that the application protocol layer also design its own validation mechanism.

Above is a fairly simplified packet reception process, let's take a look at the queue buffer-related parameters.

Nic Bonding mode
When the host has more than 1 network cards, Linux will bind multiple network cards as a virtual bonded network interface, there is only one bonded NIC for TCP/IP. Multi-NIC binding can improve network throughput on the one hand, and also enhance network high availability on the other. Linux supports 7 types of bonding modes:
Detailed instructions refer to the kernel documentation linuxethernet Bonding Driver HOWTO. We can view the bonding mode of this machine through CAT/PROC/NET/BONDING/BOND0:
Generally rarely need to develop to set the network card bonding mode, your own experiment can refer to this document

Mode 0 (BALANCE-RR) round-robin policy with load balancing and fault tolerance
Mode 1 (active-backup) Master policy, only one network card is activated in the binding, the other is in the backup state
Mode 2 (balance-xor) xor policy, Select slave nic
Mode 3 (broadcast) broadcast with all messages on all NICs
Mode 4 with the source MAC address and destination MAC address 802.3AD) ieee 802.3ad dynamic Link aggregation. Create an aggregation group that shares the same rate and duplex mode
Mode 5 (balance-tlb) adaptive transmit loadbalancing
Mode 6 ( BALANCE-ALB) adaptive loadbalancing

Network queue and interrupt binding
as the bandwidth of the network increases, the single core CPU can not meet the needs of the network card, then through the multi-queue network card driver support, each queue through interrupts to different CPU cores, take full advantage of multi-core lifting packet processing power.
First see if the network card supports multiple queues, use the LSPCI-VVV command to locate the Ethernetcontroller entry:

If there is msi-x, enable+, and Count > 1, the NIC is a multi-queue network card. The
then checks to see if the NIC is turned on multiple queues. With command cat/proc/interrupts, if you see eth0-txrx-0 indicating that multi-queue support is already open:

finally confirms that each queue is bound to a different CPU. Cat/proc/interrupts queries to the interrupt number for each queue, and the corresponding file/proc/irq/${irq_num}/smp_affinity the interrupt number irq_num the bound CPU core. In hexadecimal, each bit represents a CPU core:

(00000001) represents CPU0 (00000010) representing CPU1 (00000011) representing CPU0 and CPU1

If the bindings are unbalanced, Can be set manually, for example:

echo "1" >/proc/irq/99/smp_affinity echo "2" >/proc/irq/100/smp_ Affinity echo "4" >/proc/irq/101/smp_affinity echo "8" >/proc/irq/102/smp_affinity Echo "Ten" >/proc/irq/103/smp_affinity echo " >/proc/irq/104/smp_affinity echo " 40 " >/proc/irq/105/smp_affinity echo >/proc/irq/106/smp_affinity
Ringbuffer
The ring buffer is located between the NIC and the IP layer, and is a typical FIFO (first in, in, out) ring queue. Ringbuffer does not contain the data itself, but rather contains a descriptor that points to Sk_buff (Socketkernel buffers).
You can use ethtool-g eth0 to view the settings for the current Ringbuffer:

The above example receives a queue of 4096 and a transmission queue of 256. You can observe the health of the receive and transmit queues through Ifconfig:

Rxerrors: Total number of packets received
RX dropped: Indicates that the packet has entered the Ringbuffer, but due to insufficient memory for system reasons, it is discarded during copying to memory.
The RX overruns:overruns means that the packet is discarded by the physical layer of the NIC without Ringbuffer, and the CPU is unable to handle the interrupt in a timely manner, which is one of the reasons why the Ringbuffer is full, such as the uneven distribution of interrupts.
When the number of dropped continues to increase, it is recommended to increase the Ringbuffer, using ethtool-g settings.

Inputpacket Queue (packet receive queues)
When the rate at which packets are received is greater than the rate of the kernel TCP processing packets, the packets are buffered in the queue before the TCP layer. The length of the receive queue is set by the parameter Net.core.netdev_max_backlog.
Recvbuffer
Recv buffer is a key parameter to adjust TCP performance. The BDP (bandwidth-delayproduct, bandwidth delay product) is the bandwidth of the network and the product of the RTT (roundtrip time), which means the maximum amount of data that is unacknowledged at any moment in transit. RTT can be easily obtained using the ping command. To achieve maximum throughput, the Recvbuffer setting should be greater than the BDP, which is recvbuffer >= bandwidth * RTT. Assuming that the bandwidth is 100mbps,rtt is 100ms, then the BDP is calculated as follows:
BDP = 100Mbps * 100ms = (100/8) * (100/1000) = 1.25MB
Linux has added Recvbuffer auto-tuning mechanism after 2.6.17, the actual size of Recvbuffer will automatically float between the minimum and maximum value, in order to find the balance point of performance and resources, so it is not recommended to set Recvbuffer manually to fixed value in most cases.
When Net.ipv4.tcp_moderate_rcvbuf is set to 1 o'clock, the automatic throttling mechanism takes effect, and the recvbuffer of each TCP connection is specified by the following 3-tuple array:
Net.ipv4.tcp_rmem =
The initial recvbuffer is set to, and the default value overrides the Net.core.rmem_default setting. The Recvbuffer is then dynamically adjusted between the maximum and minimum values according to the actual situation. In the case where the buffered dynamic tuning mechanism is turned on, we set the maximum value of the Net.ipv4.tcp_rmem to the BDP.
When Net.ipv4.tcp_moderate_rcvbuf is set to 0, or the socket option is set SO_RCVBUF, the buffered dynamic throttling mechanism is turned off. The default value of Recvbuffer is set by Net.core.rmem_default, but if Net.ipv4.tcp_rmem is set, the default value is overwritten. The maximum value of Recvbuffer can be set by system call setsockopt () to Net.core.rmem_max. It is recommended that the default value of the buffer be set to the BDP when the buffering dynamic adjustment mechanism is turned off.
Note that there is also a detail, in addition to saving the received data itself, there is a part of the space to save the socket data structure and other additional information. So the Recvbuffer best value discussed above is just equal to the BDP, and the overhead of saving additional information such as sockets is also considered. Linux calculates the size of the extra overhead based on the parameter net.ipv4.tcp_adv_win_scale:

If the value of Net.ipv4.tcp_adv_win_scale is 1, then one-second of the buffer space is used for additional overhead, and if 2, One-fourth buffer space is used for additional overhead. So the best value for Recvbuffer should be set to:

third, the transmission of data packets

The path through which the packet was sent:

In contrast to the path of receiving data, the packet is sent from top to bottom through three layers: application of user-state space, system kernel space, and last-card driver. The application first writes data to the TCP SENDBUFFER,TCP layer to build the data in the Sendbuffer into packets to the IP layer. The IP layer places the packets to be sent into the queue Qdisc (queueingdiscipline). After the packet is successfully placed into the Qdisc, the descriptor Sk_buff to the packet is placed in the Ringbuffer output queue, and the NIC driver calls Dmaengine to send the data to the network link.

We also comb the parameters of the queue buffers by layer.

Sendbuffer
Similar to Recvbuffer, the parameters associated with Sendbuffer are as follows:
Net.ipv4.tcp_wmem = Net.core.wmem_defaultnet.core.wmem_max
The auto-tuning mechanism of the send-side buffering has been implemented very early, and is unconditionally turned on without parameters to set. If TCP_WMEM is specified, Net.core.wmem_default is overwritten by Tcp_wmem. The sendbuffer is automatically adjusted between the minimum and maximum values of the Tcp_wmem. If the call to SetSockOpt () sets the socket option SO_SNDBUF, the auto-throttling mechanism of the send-side buffering is turned off, Tcp_wmem is ignored, and the SO_SNDBUF maximum value is limited by Net.core.wmem_max.
Qdisc
Qdisc (queueing discipline) is located between the IP layer and the ringbuffer of the network card. As we already know, Ringbuffer is a simple FIFO queue that keeps the drive layer of the NIC simple and fast. The QDISC implements advanced traffic management features, including traffic classification, prioritization, and traffic shaping (rate-shaping). You can use the TC command to configure Qdisc.
The queue length of the Qdisc is set by Txqueuelen, and the queue length of the receiving packet differs from the kernel parameter Net.core.netdev_max_backlog control, and the Txqueuelen is associated with the NIC. You can view the current size with the Ifconfig command:

Adjust the size of the Txqueuelen using ifconfig:
Ifconfig eth0 Txqueuelen 2000
Ringbuffer
As with the packet reception, the sending packet is Ringbuffer, using ethtool-g eth0 to view:

where the TX entry is the Ringbuffer transmission queue, which is the length of the send queue. The settings are also using the command ethtool-g.
Tcpsegmentation and Checksum offloading
The operating system can transfer some TCP/IP functions to the NIC, especially the segmentation (shard) and checksum calculations, which can save CPU resources and perform these operations by hardware instead of the OS for performance gains.
The MTU of the general Ethernet (maximumtransmission Unit) is bytes, assuming that the size of the application to send packets is 7300bytes,mtu1500 bytes-IP header 20 bytes-tcp Header 20 bytes = payload is 1460 bytes, So 7300 bytes need to be split into 5 segment:

The segmentation (Shard) operation can be handed over to the NIC by the operating system, although it still transmits 5 packets on the final line, which saves CPU resources and provides performance gains:

You can use ethtool-k eth0 to view the current offloading situation of the NIC:

The above example checksum and Tcpsegmentation's offloading are all open. If you want to set the offloading switch for the NIC, you can use the Ethtool-k (note K is uppercase) command, such as the following command to turn off TCP segmentation offload:
sudo ethtool-k eth0 tso off
NIC multi-queue and Nic bonding mode
has been introduced during the receipt of the packet.

At this point, finally combed finished. The cause of sorting TCP queue-related parameters is the recent troubleshooting of a network timeout problem that has not been found, and the resulting "side effects" are this document. To further solve this problem may need to do the TCP protocol code profile, need to continue to learn, I hope in the near future will be able to write documents and share with you.

Reference documents
Queueing in the Linux Network Stack
TCP implementation in Linux:a Brief Tutorial
Impact of Bandwidth Delay Product on TCP throughput
A system Knowledge series NIC that Java programmers should also know
Talk about network card interrupt processing

Original link

Summary of parameters related to Linux TCP queue

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More