In the performance optimization of network applications on Linux, TCP-related kernel parameters are generally adjusted, especially the parameters related to buffering and queuing. The articles on the Internet will tell you what parameters you need to modify, but we often know them but don't know why, and each time we copy them, we may soon forget or confuse their meanings. This article attempts to summarize the TCP queue buffer related kernel parameters, combing them from the point of view of the protocol stack, hoping to be easier to understand and remember. Note that this article is derived from the reference documentation, not to read the relevant kernel source to do the verification, can not guarantee that the content is rigorous and correct. As a Java programmer has not read the kernel source is a mishap.
Below I take the server side as the view, from the connection establishment, the packet receives and the packet sends these 3 paths to classify the parameter to comb.
First, connection establishment
Simply look at the connection setup process, the client sends the package to the server, the SYN
server replies, and the SYN+ACK
SYN_RECV
state connection is saved to the semi-connection queue. The client returns the ACK
package to complete three handshakes, the server ESTABLISHED
moves the state's connection into the accept queue, waits for the app to call accept ().
You can see that establishing a connection involves two queues:
- A half-connection queue that holds
SYN_RECV
the state of the connection. Queue Length is net.ipv4.tcp_max_syn_backlog
set by the
Accept queue, save ESTABLISHED
State of the connection. The queue length is min(net.core.somaxconn, backlog)
. The backlog is the ServerSocket(int port,int backlog)
parameter that we specified when we created it and eventually passed to the Listen method:
#include int listen(int sockfd, int backlog);
If we set backlog
greater than net.core.somaxconn
, the length of the accept queue will be set tonet.core.somaxconn
In addition, in order to respond SYN flooding
(that is, the client only sends a SYN packet to initiate a handshake without responding to an ACK completion connection establishment, filling the server side of the semi-connection queue, so that it cannot handle the normal handshake request), Linux implements a mechanism called, SYN cookie
through net.ipv4.tcp_syncookies
control, Set to 1 to open. Simply put, the SYN cookie
connection information is encoded in ISN
(initial sequence number) returned to the client, then the server does not need to save a semi-connection in the queue, but instead of the client subsequently sent an ACK back to the ISN
restore connection information, To complete the connection, avoid the half-connection queue being attacked by the SYN packet filling up. For the lost client handshake, ignore it.
Second, the receipt of data packets
Let's look at the path of the received packet:
Packet reception, from the bottom up through the three layer: Network card driver, system kernel space, and finally to the user state space application. The Linux kernel uses the sk_buff
(socket kernel buffers) data structure to describe a packet. When a new packet arrives, the NIC
(Network interface Controller) is called DMA engine
by placing the packet Ring Buffer
into the kernel memory area. is Ring Buffer
fixed in size, it does not contain the actual packet, but rather contains a pointer sk_buff
to a descriptor. When Ring Buffer
full, the new packets will be discarded. Once the packet is successfully received, an interrupt is initiated and the packet is NIC
passed by the kernel's interrupt handler to the IP layer. After processing the IP layer, the packet is placed in a queue waiting for the TCP layer to process. Each packet undergoes a series of complex steps in the TCP layer, updating the TCP state machine, eventually arriving recv Buffer
, waiting for the application to receive processing. It is important to note that when a packet arrives recv Buffer
, TCP ACK
confirms that the TCP representation of the packet ACK
has been received by the operating system kernel, but does not ensure that the application layer must receive data (such as system crash at this time). Therefore, it is generally recommended that the application protocol layer also design its own validation mechanism.
Above is a fairly simplified packet reception process, let's take a look at the queue buffer-related parameters.
-
Network card bonding mode
When a host has more than 1 network cards, Linux binds multiple network cards to a virtual bonded network interface, and only one bonded NIC exists for TCP/IP. Multi-NIC binding can improve network throughput on the one hand, and also enhance network high availability on the other. Linux supports 7 bonding modes:
-
mode 0 (BALANCE-RR)
round-robin policy with load balancing and fault tolerance
-
Mode 1 (active-backup)
Master and Standby policy, only one network adapter is active in the binding, others are in backup status
-
Mode 2 (balance-xor)
xor policy, select slave nic
-
Mode 3 (broadcast)
broadcast all messages on all network cards via the source MAC address and destination MAC address.
-
Mode 4 (802.3AD)
ieee 802.3ad dynamic Link aggregation. Create an aggregation group that shares the same rate and duplex mode
-
mode 5 (balance-tlb)
adaptive transmit load Balancing
-
Mo The de 6 (balance-alb)
adaptive load Balancing
Detailed instructions refer to the kernel documentation for Linux Ethernet Bonding Driver HOWTO. We can view the bonding mode of this machine through cat/proc/net/bonding/bond0
:
Rarely need to develop to set the NIC bonding mode, You can refer to this document for your own experiment
NIC multi-queue and interrupt binding
With the increasing bandwidth of the network, the single core CPU can not meet the needs of the network card, at this time through the multi-queue network card driver support, each queue through interrupts to different CPU cores, take full advantage of multi-core lifting packet processing capacity.
First, check if the NIC supports multiple queues, and use lspci -vvv
the command to locate Ethernet controller
the entry:
If there is msi-x, Enable+ and Count > 1, then the NIC is a multi-queue network card.
Then check to see if the network card multi-queue is turned on. Using cat /proc/interrupts
the command, if you see eth0-txrx-0 indicates that multiple queue support is already open:
Finally, verify that each queue is bound to a different CPU. The cat /proc/interrupts
interrupt number for each queue is queried, and the corresponding file is the case of the /proc/irq/${IRQ_NUM}/smp_affinity
interrupt number irq_num the CPU core bound. In hexadecimal, each digit represents a CPU core:
(00000001)代表CPU0(00000010)代表CPU1(00000011)代表CPU0和CPU1
If the bindings are not balanced, you can set them manually, for example:
echo "1" >/proc/irq/99/smp_affinity echo "2" >/proc/irq/100/smp_affinity echo "4" &G T /proc/irq/101/smp_affinity echo "8" >/proc/irq/102/smp_affinity echo "ten" >/proc/irq/103/smp_affinity echo "20" >/proc/irq/104/smp_affinity echo "/proc/irq/105/smp_affinity" > Echo "/proc/irq/106/smp_affinity" >
Ring Buffer
Ring Buffer
Located between the NIC and the IP layer, it is a typical FIFO (first in, out) ring queue. Ring Buffer
does not contain the data itself, but rather contains a pointer to sk_buff
(socket kernel buffers) descriptor.
You can use ethtool -g eth0
the view current Ring Buffer
settings:
The above example receives a queue of 4096 and a transmission queue of 256. You can ifconfig
observe the health of the receive and transmit queues by observing:
- RX errors: Total number of packets received
- RX dropped: Indicates that the packet has entered
Ring Buffer
, but due to insufficient memory for system reasons, it is discarded during the copy-to-memory process.
- The RX overruns:overruns means that the packet is
Ring Buffer
discarded by the physical layer of the NIC, and that the CPU is not able to handle interrupts in a timely manner Ring Buffer
, such as uneven distribution of interrupts.
When the number of dropped continues to increase, it is recommended to increase Ring Buffer
and use ethtool -G
to set.
Input Packet Queue (packet receive queues)
When the rate at which packets are received is greater than the rate of the kernel TCP processing packets, the packets are buffered in the queue before the TCP layer. The length of the receive queue is set by the parameter net.core.netdev_max_backlog
.
Recv Buffer
recv buffer
is the key parameter to adjust TCP performance. BDP
(bandwidth-delay product, bandwidth delay product) is the bandwidth of the network RTT
and the The product of (round trip time), BDP
meaning the maximum amount of data that is not confirmed at any moment in transit. RTT
using ping
commands is easy to get. In order to achieve maximum throughput, recv Buffer
the setting should be greater than BDP
, that is recv Buffer >= bandwidth * RTT
. Assuming that the bandwidth is 100Mbps and RTT
100ms, then BDP
the calculation is as follows:
BDP = 100Mbps * 100ms = (100 / 8) * (100 / 1000) = 1.25MB
Linux added an recv Buffer
automatic adjustment mechanism after 2.6.17, recv buffer
the actual size will automatically float between the minimum and maximum value, in order to find the balance of performance and resources, so it is not recommended to recv buffer
manually set a fixed value in most cases.
When net.ipv4.tcp_moderate_rcvbuf
set to 1 o'clock, the automatic throttling mechanism takes effect, and the recv buffer for each TCP connection is specified by the following 3-tuple array:
net.ipv4.tcp_rmem =
The recv buffer
setting is initially set to, and this default value overrides net.core.rmem_default
. It recv buffer
is then dynamically adjusted between the maximum and minimum values according to the actual situation. In the case where the buffered dynamic tuning mechanism is turned on, we net.ipv4.tcp_rmem
set the maximum value to BDP
.
When net.ipv4.tcp_moderate_rcvbuf
set to 0, or the socket option is set SO_RCVBUF
, the buffered dynamic throttling mechanism is turned off. is recv buffer
set by default, net.core.rmem_default
but if set, the net.ipv4.tcp_rmem
default value is overridden. The maximum value that can be set by the system call SetSockOpt () recv buffer
is net.core.rmem_max
. It is recommended to set the default value of the buffer in case the buffering dynamic adjustment mechanism is closed BDP
.
Note that there is also a detail, in addition to saving the received data itself, there is a part of the space to save the socket data structure and other additional information. So the recv buffer
best value discussed above is just BDP
not enough, and you need to consider the overhead of saving extra information such as sockets. Linux net.ipv4.tcp_adv_win_scale
calculates the size of the extra overhead according to the parameters:
If net.ipv4.tcp_adv_win_scale
the value is 1, then one-second of the buffer space is used for additional overhead, and if 2, One-fourth buffer space is used for additional overhead. Therefore recv buffer
, the best value should be set to:
Third, the transmission of data packets
The path through which the packet was sent:
In contrast to the path of receiving data, the packet is sent from top to bottom through three layers: application of user-state space, system kernel space, and last-card driver. The application first writes data to TCP send buffer
, and the TCP layer constructs the data send buffer
in the packet to the IP layer. The IP layer places the packets to be sent into the queue QDisc
(queueing discipline). After the packet is successfully placed QDisc
, the descriptor that points to the packet sk_buff
is placed in the Ring Buffer
output queue, which is then sent to the network link by the NIC driver call DMA engine
.
We also comb the parameters of the queue buffers by layer.
Send Buffer
recv Buffer
similar, and send Buffer
the relevant parameters are as follows:
net.ipv4.tcp_wmem = net.core.wmem_defaultnet.core.wmem_max
The auto-tuning mechanism of the send-side buffering has been implemented very early, and is unconditionally turned on without parameters to set. If specified tcp_wmem
, it is net.core.wmem_default
tcp_wmem
overwritten. send Buffer
tcp_wmem
automatically adjusts between the minimum and maximum values. If the call setsockopt()
sets the socket option SO_SNDBUF
, the auto-throttling mechanism that will turn off the send-side buffering will be tcp_wmem
ignored and SO_SNDBUF
the maximum value is net.core.wmem_max
limited.
Qdisc
QDisc
(Queueing discipline) is located between the IP layer and the network card ring buffer
. As we already know, ring buffer
is a simple FIFO queue, this design makes the driver layer of the NIC keep simple and fast. The QDisc
advanced functions of traffic management are realized, including traffic classification, priority and traffic shaping (rate-shaping). You can use the tc
command configuration QDisc
.
QDisc
The queue Length txqueuelen
is set by, and the queue length of the receiving packet differs from the kernel parameter net.core.netdev_max_backlog
control, which txqueuelen
is associated with the NIC and can be used to ifconfig
view the current size:
ifconfig
txqueuelen
size to use adjustment:
ifconfig eth0 txqueuelen 2000
Ring Buffer
As with the receipt of the packet, the sending packet also passes Ring Buffer
, using ethtool -g eth0
view:
Where the TX
item is Ring Buffer
the transmission queue, which is the length of the send queue. The settings are also used by the command ethtool -G
.
TCP Segmentation and Checksum offloading
The operating system can transfer some TCP/IP functions to the NIC, especially the segmentation (shard) and checksum calculations, which can save CPU resources and perform these operations by hardware instead of the OS for performance gains.
General Ethernet MTU
(Maximum transmission Unit) is bytes, assuming the application to send packet size is 7300bytes, MTU
1500 bytes-IP header 20 bytes- TCP Header 20 bytes = payload is 1460 bytes, so 7300 bytes need to be split into 5 segment:
The segmentation (Shard) operation can be handed over to the NIC by the operating system, although it still transmits 5 packets on the final line, which saves CPU resources and provides performance gains:
You can use ethtool -k eth0
the current offloading condition to view the NIC:
The above example checksum and TCP segmentation offloading are all open. If you want to set the offloading switch for the NIC, you can use the ethtool -K
(note k is uppercase) command, for example, the following command closes the TCP segmentation offload:
sudo ethtool -K eth0 tso off
NIC multi-queue and Nic bonding mode
has been introduced during the receipt of the packet.
At this point, finally combed finished. The cause of sorting TCP queue-related parameters is the recent troubleshooting of a network timeout problem that has not been found, and the resulting "side effects" are this document. To further solve this problem may need to do the TCP protocol code profile, need to continue to learn, I hope in the near future will be able to write documents and share with you.
Reference documents
Queueing in the Linux Network Stack
TCP implementation in Linux:a Brief Tutorial
Impact of Bandwidth Delay Product on TCP throughput
A system Knowledge series NIC that Java programmers should also know
Talk about network card interrupt processing
Linux TCP queue-related parameters summary turn