Net.ipv4.tcp_tw

Net.ipv4.tcp_tw_recycle

Last Update:2016-07-09 Source: Internet

Author: User

Tags ack rfc server port

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

original 2016-03-07 cfc4n operation and Maintenance help

This article is for translation English blog "Coping with the TCP time-wait state on busy Linux Servers", (http://vincent.bernat.im/en/blog/2014- tcp-time-wait-state-linux.html) but not a complete translation, the translator cfc4n the original understanding, has been adjusted to increase the relevant argument arguments, slightly different from the original text. The purpose of translation is to deepen the memory of their own knowledge, and to share with other friends, may also help them. The article is relatively long, impatient please point off.

do not enable Net.ipv4.tcp_tw_recycle

The description of net.ipv4.tcp_tw_recycle in the Linux kernel documentation is not very clear.

Tcp_tw_recycle (Boolean; default:disabled; since Linux 2.4) [Translator Note: Description from Linux man TCP]
Enable Fast recycling of time-wait sockets. Enabling this option isn't recommended since this causes
Problems when working with NAT (Network Address translation).
Enable fast recycle of time-wait status sockets, this option is not recommended. A large number of TCP connection errors are caused by the NAT (Network Address translation) networks.

The parameter net.ipv4.tcp_tw_reuse, which is similar to its function, is slightly described in the manual, as follows:

Tcp_tw_reuse (Boolean; default:disabled; since Linux 2.4.19/2.6)
Allow to reuse time-wait sockets for new connections when it's safe from protocol viewpoint. It
Should not being changed without advice/request of technical experts.
From the Protocol design point of view, it is safe to reuse the sockets of the time-wait state to the new TCP connection. (Configuration for the client)

The explanatory notes here are very few, and we found that many of the Linux parameter tuning guides on the Web recommend that these parameters net.ipv4.tcp_tw_recycle set 1"enabled" to quickly reduce the number of TCP connections in the Time-wait state.

However, in the TCP (7) manual, the parameters net.ipv4.tcp_tw_recycle very painful, especially in the ordinary user home, there are more than one device, or Internet cafes, companies and other devices, sharing the same NAT device environment, The TW recovery option is problematic for public servers as it does not connect two different computers to the hand, it is difficult to find the problem.

Enable Fast recycling of time-wait sockets. Enabling this option isn't recommended since this causes problems when working with NAT (Network Address translation).
Enable fast recycle of time-wait status sockets, this option is not recommended. A large number of TCP connection errors are caused by the NAT (Network Address translation) networks. If there is no technical great God's advice, do not change him.

The following will give a more detailed explanation, hoping to correct the wrong views on the Internet, especially reprinted more content, search, often ranked in front, users are often exposed to is not rigorous or wrong knowledge points.

As this article, in the Net.ipv4.tcp_tw_recycle control parameters, although many places are written in IPv4, but the same practical for IPv6. In addition, we are talking about the Linux TCP stack, which may be affected by NetFilter on Linux, slightly different.

What is the meaning of the time-wait state of a TCP connection, and why it is born?

Let's recall what is the TCP time-wait state? Such as

The process in this diagram is not very well understood, and a clearer picture of the process

TCP Status Flowchart

When the TCP connection is closed, the first party that initiates the shutdown enters the time-wait state and the other party can quickly reclaim the connection.
You can use Ss-tan to view the current status of a TCP connection

The role of the time-wait state

For a time-wait state, there are two functions

First, it is well known that the delayed packets that prevent the last TCP connection (initiating shutdown, but closing is not completed), are received and affect the new TCP connection. (The only connection confirmation method is four tuples: source IP address, destination IP address, source port, destination port), the package serial number also has certain function, may reduce the problem to occur the probability, but cannot completely avoid. This is especially true for large, fast (recycled) connections that receive Windows size. RFC1137 explains what happens when the time-wait state is low. If the time-wait state connection is not being recycled quickly, what is the problem? Take a look at the following example:

After shortening the time-wait time, the delayed TCP packets are received by the newly established TCP connection.

Second, another effect is that when the last ACK is lost, the remote connection enters the Last-ack state, which ensures that the current TCP connection is closed by the remote. If there is no time-wait state, when the remote still considers the connection to be valid, it will continue to communicate with it, causing the connection to reopen. When a SYN is received remotely, it will reply to an RST package, because this seq is incorrect, then the new connection will fail to establish a success, the error is terminated.

If the remote is stuck in the Last-ack state because the last ACK packet was lost, it will affect the newly established TCP connection with the same four tuple.

RFC 793 emphasizes that the time-wait state must be twice times the MSL time (max segment lifetime), on Linux, this limit time can not be adjusted, write dead for 1 minutes, defined in the Include/net/tcp.h

It has been proposed to change the TCP time-wait time to a parameter that can be customized, but is rejected, in fact, this TCP specification, for Time-wait, is more than the disadvantage of advantages.

So here's the problem.

Let's look at why this state can affect a server that handles a large number of connections, from the following three aspects:

New and old connections (same four-tuple) slot multiplexing in the TCP connection table avoids
The memory footprint of the socket structure in the kernel
Additional CPU Overhead

The result of Ss-tan state time-wait|wc-l does not explain these problems.

Connection table Slot connection

A TCP connection in the Time-wait state that survives for 1 minutes in a linked table slot means that another connection to the same four-tuple (source, source, destination, destination) cannot occur, meaning that the new TCP (same four-tuple) connection cannot be established.

The destination address and destination port are fixed values for the Web server. If the Web server is behind load balancing on the L7 layer, the source address is a fixed value. On Linux, as a client, the default number of client ports can be assigned is 3W (can be adjusted on the parameter net.ipv4.up_local_port_range).

This means that between the Web server and the Load Balancer server, only 3W ports per minute are in the established State and approximately 500 connections per second.

If the time-wait state socket appears on the client, the problem is easily discovered. Calling the Connect () function returns Eaddrnotavail, and the program logs related errors to the log.

If the socket of the Time-wati state is present on the server, the problem can be complicated because there is no logging and no counter reference. However, you can list the current number of all four-tuple connections on the server to confirm

The solution is to increase the scope of the four-tuple, which has many ways to achieve it. (The order of the following recommendations, the implementation difficulty from small to large arrangement)

Modify the Net.ipv4.ip_local_port_range parameter to increase the available range of client ports.
Increase the server port, listen to some ports, such as 81, 82, 83, Web servers before the load-balanced, user-friendly.
Increase the client IP, especially as a load balancer server, and use more IP to communicate with the backend Web server.
Add server-side IP.

Of course, the last way is to adjust the net.ipv4.tcp_tw_reuse and net.ipv4.tcp_tw_recycle. But don't do it unless you have to, and talk about it later.

Memory

When you maintain a large number of connections, when you keep more than 1 minutes for each connection, you consume more memory from the server. For a chestnut, if the server processes 1W new TCP connections per second, then the server will be in stock 1w/s*60s = 60W time-wait Status TCP connection, then how much memory will this occupy? Don't worry, young man, not so much.

First, from an application standpoint, a socket with a time-wait state does not consume any memory: the socket is closed. In the kernel, the socket for the time-wait state has three different structures for three different functions.

One, "TCP established hash table" of the Connection store hash table (including other non-established state connections), when a new packet is sent, is used to locate the surviving state of the connection.
The buckets of the hash table are included in the Time-wait connection list and in the list of active connections (the results of the NETSTAT-ANTP command, the TIME_WAIT state of the PID is connected, with the active connection of the PID two).

The size of the hash table depends on the operating system memory size. When the system boots, it will be printed out and can be seen in the DMESG log.

This value may be overwritten by changes to the kernel startup parameter thash_entries (the maximum number of TCP connection hash tables set).

In the Time-wait State connection list, each element is a tcp_timewait_sock struct, and the connections to other states are tcp_sock structures.

There is a set of connection lists called "Death row" that is used to terminate the connection of the time-wait state, which will begin applying before they expire. It occupies the same memory space as in the connection hash table. This struct Hlist_node Tw_death_node is a member of the Inet_timewait_sock, as in the penultimate line of the code.

There is a hash table with a bound port, which stores the binding port and other parameters to ensure that the current port is not being used, such as a specified port when the listen is listening, or a port that is dynamically allocated by the system when connecting to other sockets. The hash table size is the same as the size of the connection hash table.

Each element is a inet_bind_socket structure. Each bound port will have an element. For a Web server, it is bound to port 80, and its time-wait connection is shared by the same entry. In addition, local connections to remote servers, their ports are randomly assigned, and do not share their entry.

Therefore, we only care about the size of the space occupied by the structural body tcp_timewait_sock and the structure inet_bind_socket. Each connection to a remote, or remotely connected to a local time-wait state, has a tcp_timewait_sock structure. There is also a structure inet_bind_socket, only connect to the remote connection will exist, remote connected to the connection does not have this structure.

The size of the Tcp_timewait_sock structure is only 48bytes for the 168 bytes,inet_bind_socket structure:

Therefore, when the server has 4W connected connections into the time-wait state, only 10MB of memory is used. If the server has 4W connections to the remote connection into the time-wait state, 2.5MB of memory is used. Then look at the results of Slabtop, where the test data is a connection result of 5W time-wait states, where 4.5W is connected to a remote connection:

The command execution results are output and none of the characters are moved. The connection to the Time-wait state consumes very little memory. If you are dealing with thousands of new TCP connections per second on your server, you may need a little more memory to do the right data communication with the client. But the memory footprint of the time-wait state connection can simply be ignored.

CPU

On the CPU side, looking for an idle port operation, is still quite valuable. This is implemented by the Inet_csk_get_port () function, which locks the entire list of idle ports. The large number of entries in this hash table is usually not a problem, and if there are a large number of connections to the remote time-wait state on the server (such as FPM, memcache, etc.), the same profile will be shared, which would quickly find a new free port in order.

Other Solutions

If you read the above chapters and still have questions about the connection to the time-wait state, then look:

Disable socket delay off "translator Note 1: Take Ubuntu 12.04 as an example, the linger structure is defined in:/usr/src/linux-headers-3.2.0-23/include/linux/socket.h"
Disable Net.ipv4.tcp_tw_reuse
Disable Net.ipv4.tcp_tw_recycle

When Close is called, the socket needs to be delayed off (lingering), and the remaining data in the kernel buffers will be sent to the remote address, and the socket will switch to the time-wait state. If this option is disabled, the underlying also shuts down after the call to close and does not continue to send data that is not sent from the remaining data in the buffers.

However, the application can choose to disable the socket lingering delay shutdown behavior. For the socket lingering delay shutdown, the following two behaviors are briefly described:
In the first case, the close function does not directly terminate the four-tuple connection ordinal, but buffers any remaining data will be discarded. The TCP connection will receive a turn-off signal for the RST, and then the server will immediately destroy the (four-tuple) connection. In this practice, there is no longer a time-wait state socket. In the second case, if the socket send buffer still has residual data after the close function is called, the process will hibernate until all data is sent and confirmed, or the configured linger timer expires. Non-blocking sockets can be set without hibernation. As above, these processes all occur at the bottom, and this mechanism ensures that residual data is sent out within the configured timeout period. If the data is sent normally and the close packet is sent normally, it will be converted to the time-wait state. In other cases, the client receives a connection-close signal from the RST, and the server-side residual data is discarded.

Here are two cases where disabling the socket linger delay off is not a balm. However, in the Haproxy,nginx (anti-generation) scenario, it is more appropriate to use the upper layer of the TCP protocol (such as HTTP). Again, there are plenty of reasons why you can't disable it.

Net.ipv4.tcp_tw_reuse

The time-wait state is to prevent unrelated delay request packets from being accepted. However, under certain conditions, it is very likely that the newly established TCP connection request packet is mistakenly handled by the connection of the old connection (the same four-tuple, temporarily or time-wait state, recycled). RFC 1323 implements the TCP extension specification to ensure high availability under busy network conditions. In addition, it defines a new TCP option – two four-byte timestamp fields timestamp field, the first one is the current clock timestamp of the TCP sender, and the second is the latest timestamp received from the remote host.

When Net.ipv4.tcp_tw_reuse is enabled, if the new timestamp is greater than the previously stored timestamp, then Linux will select one from the surviving connection of the time-wait state to reassign to the new connected TCP connection.

Connected to the Time-wait state, it can be reused in just 1 seconds.

How to ensure the security of communication?

The first function of time-wait is to avoid the receipt of duplicate packets by new connections (unrelated). Because of the timestamp, duplicate packets are discarded because the timestamp expires.
The second role is to ensure that the remote side (remote is not necessarily the service side, it is possible, for the server, remote is the client, I am using the remote side to replace) is not in the Last-ack state. Because it is possible to lose the ACK packet lost. The remote side will re-send the fin package until

Abort (Connection broken)
Wait until the ACK packet
Receive RST Package

If the fin packet is received in a timely manner, the local end is still in the time-wait state, and the ACK packet is sent out.

Once the new connection replaces the Time-wait entry, the newly connected SYN packet is ignored (thanks to Timestramps), and the RST packet is not answered, but the fin packet is re-transmitted. The fin package will receive an answer to the RST packet (because the local connection is the Syn-sent state), which will let the remote end skip the Last-ack state. The initial SYN packet will be re-sent after 1 seconds, then the connection is completed. There seems to be no mistake, just a delay.

Additionally, the twrecycled counter is incremented when the connection is reused. "Translator Note: See the value of twrecycled in/proc/net/netstat"

net.ipv4.tcp_tw_recycle

This mechanism also relies on the timestamp option, which also affects all connections coming in and connecting. "Translator Note: Tcp_timestamps on Linux is turned on by default"
Time-wait status Plan expires earlier: it will be removed after a time-out (RTO) interval (the underlying will calculate RTO values based on RTT for the current connection's latency, and some thoughts on the changes to the backlog parameters in the PHP-FPM). You can execute the SS instruction to get the current surviving TCP connection status and view the data. "Translator Note: The results of the Linux instruction SS are Rto,rtt value units are ms"

Linux will discard any packets from the remote side that are less than the timestamp of the last record (and also from the remote side) of the Timestramp timestamp. Unless the time-wait state has expired.

When the remote side hosts host is in a NAT network, the timestamp within one minute (MSL interval) will prohibit the NAT network behind, except for this host, any host connection, because they have their respective CPU clock, the respective timestamp. This can lead to a lot of incurable diseases, it is difficult to troubleshoot, it is recommended that you disable this option. In addition, the Last-ack state of TCP on the other side is the best data reflecting the native net.ipv4.tcp_tw_recycle.

Summary

The most appropriate solution is to increase the number of more than four tuples, such as the server available port, or server IP, so that the server can accommodate enough time-wait state connections. In our Common Internet architecture (Nginx and Nginx,nginx with FPM,FPM and Redis, MySQL, memcache, etc.), reduce the time-wait state of the TCP connection, the most effective is the use of long connections, do not use short connections, This is especially true between load balancing and the Web server. In particular, the link family in the event of PHP is not connected to Redis.

On the service side, do not enable net.ipv4.tcp_tw_recycle unless you can ensure that your server network environment is not NAT. Enabling Net.ipv4.tw_reuse on the server does not have any OVA for incoming TCP connections.

In the client (especially on the server, a service in the form of a client, such as the above mentioned Nginx anti-generation, connected to Redis, MySQL, FPM, etc.) on the Enable Net.ipv4.tcp_tw_reuse, but also a slightly more secure solution time-wait solution. Again open net.ipv4.tcp_tw_recycle words, to the client (or in the form of the client) recycling, there is no egg use, but there will be a lot of strange things (especially the FPM server, relative nginx is the service side, relative to Redis is the client).

Last quote W. Richard Stevens in a sentence on UNIX network programming

The time_wait state was our friend and was there to help us (i.e., to-let-old duplicate segments expire in the network). Instead of trying to avoid, we should understand it.
Translator: Existence is reasonable, brave face, not escape.

Translator Note:

The translator made the translation according to the original text, and added the relevant arguments, slightly different from the original text.
The result of the netstat command SS command for Linux, the former is time_wait and the like, the characters are underlined, and the latter is the time-wait.

A brutal case:

Clients behind Nat/stateful FW would get dropped, 99.99999999% of time should never be enabled? Tuning TCP and Nginx on EC2 30 page content

References:

Tuning TCP and Nginx on EC2-P30
Time_wait and its design implications for protocols and scalable client server systems
Tcp_tw_recycle and tcp_timestamps cause connect failure issues
Remember once time_wait network fault
Open tcp_tw_recycle caused by a problem
Transmission Control Protocol

Net.ipv4.tcp_tw_recycle

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More