High-performance network programming 7--TCP connected memory using __ programming

Source: Internet
Author: User
Tags ack data structures memory usage
When the number of concurrent TCP connections for a server is timed at 100,000, we are interested in how much memory a TCP connection consumes on the operating system kernel. The socket programming method provides a SO_SNDBUF, SO_RCVBUF interface to set up the read-write cache of the connection, and Linux also provides the following system-level configuration to set up the TCP memory usage on the server as a whole, but these configurations see the name but some conflicting, vague sense of concept, as follows (the SYSCTL-A command can view these configurations):
Net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.tcp_wmem = 8192 65536 16777216 net.ipv4.tcp_mem
= 8388608 125829 16777216
net.core.rmem_default = 262144
net.core.wmem_default = 262144 Net.core.rmem_max
= 16777216
Net.core.wmem_max = 16777216

There are also some less-mentioned configurations that are related to TCP memory:
Net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_adv_win_scale = 2

(Note: For the convenience described above, the prefix omitted for the above system configuration, with multiple digits separated by the configuration value in an array, for example Tcp_rmem[2] represents the last column of the first line above 16777216.) )
Many of these system configuration items can be found on the web, but often obscure, for example, tcp_rmem[2] and Rmem_max seem to be related to receiving cache maximum, but they can be inconsistent and what is the difference. or tcp_wmem[1] and wmem_default all seem to indicate the default value of sending the cache, and what to do with the conflict. In the SYN handshake package caught with the grab software, why the TCP receive window size seems to have nothing to do with these configurations.
TCP connections have an ever-changing amount of memory used in a process that may not be directly based on socket programming when the program is more complex, and platform-level components may encapsulate the user-state memory used by TCP connections. Different platforms, components, middleware, and network libraries are not the same. The kernel-state algorithm for allocating memory to TCP connections is basically unchanged, and this article will attempt to explain how much memory the TCP connection will use in the kernel state, and what strategy the operating system uses to balance the macro throughput with the microscopic connection transfer speed. This article will, as always, be geared toward application developers, rather than a system-level kernel developer, there is no detailed description of how many bytes of memory are allocated for a TCP connection, a TCP messaging operating system, and kernel-level data structures that are not the focus of this article, nor are they the focus of application-level programmers. This article mainly describes how the Linux kernel manages the read-write cache for data transmitted over TCP connections.

First, what is the cache limit.
(1) Start from the application programming can be set up so_sndbuf, so_rcvbuf speaking.
No matter what language, the TCP connection is provided based on the SetSockOpt method to achieve so_sndbuf, SO_RCVBUF, how to understand the meaning of these two attributes. SO_SNDBUF, so_rcvbuf are personalized settings, that is, only affect the set of connections, but not to other connections. SO_SNDBUF represents the upper limit of the kernel write cache on this connection. In fact, the so_sndbuf of the process setting is not really the upper limit, it will be doubled in the kernel as the write cache limit, we do not need to tangle with this detail, just know that when the SO_SNDBUF is set, is equivalent to the maximum amount of memory that can be used by the write cache on the TCP connection being manipulated. However, this value is not set by the process at will, it is subject to the system-level upper and lower limits, when it is larger than the system configuration Wmem_max (Net.core.wmem_max), will be replaced by Wmem_max (also doubled); and when it's a special hour, For example, the minimum write cache that is designed in the 2.6.18 kernel is 2K bytes and will be replaced directly by 2K.
SO_RCVBUF represents the upper bound of the read cache on a connection, similar to So_sndbuf, and is also subject to Rmem_max configuration items, which are actually twice times the size of the kernel as the upper limit of read cache usage. The SO_RCVBUF setting also has a lower bound, and in the 2.6.18 kernel it is replaced by 256 if the value is less than 256 bytes.

(2) What is the relationship between the SO_SNDBUF, So_rcvbuf cache usage caps and actual memory that can be set up?
The memory used by TCP connections is primarily determined by the read-write cache, while the size of the read-write cache is only relevant to the actual use of the scene, and So_sndbuf and so_rcvbuf are useless when the actual use is not reached. For read caching, when a TCP message is received from the connection pair, the read cache is increased, and of course, if the packet size is exceeded by the read cache, the packet is discarded so that the read cache size remains unchanged. When will read cache use less memory? The read cache is reduced when a process calls read, recv, and so on, reading the TCP stream. Therefore, the read cache is a dynamically changing, how much is actually used to allocate how much buffer memory, when the connection is very idle, and the user process has been connected to the data received on the consumption, then read cache memory is 0.
The same is true of write caching. Write caching increases when a user process calls a method such as send or write to send a TCP stream. Of course, if the write cache has reached the upper limit, the write cache remains unchanged and the user process fails to return. And whenever a TCP connection to the end of the ACK to confirm the message successfully sent, write cache will be reduced, this is because of the reliability of TCP decision, sent out after the message due to worry about the message loss will not destroy it, may be the resend timer to the text. As a result, write caching is also dynamically changing, and the memory used for write caching is usually 0 on idle normal connections.
Therefore, only when the speed of receiving network messages is greater than the speed at which the application reads the message, the upper limit of the read cache may be reached, and the upper limit of cache usage will only work. The effect is: Discard the newly received message to prevent the TCP connection from consuming too much server resources. Similarly, when an application sends a message faster than the recipient acknowledges the ACK message, the write cache may reach an upper limit, so that a method such as send fails and the kernel does not allocate memory for it.

Second, the size of the cache and the TCP sliding window in the end what is the relationship.
(1) The size of the sliding window is definitely related to the size of the cache, but it is not a one by one corresponding relationship, not to have a one by one corresponding relationship with the cache upper bound. Therefore, a lot of information on the Internet, such as Rmem_max configuration set the maximum sliding window, and we tcpdump grasp the package when we see the value of the win window is completely inconsistent, it makes sense. Let's find out where the difference is.
Read caching has 2 functions: 1, the disorderly, fall in the receiving sliding window of the TCP message cache; 2, when ordered, can be read by the application of the message appears, because the application read is delayed, so the message to be read by the application is also saved in the read cache. So, read the cache in Split, part of the cache of unordered messages, part of the cache to be delayed read ordered messages. The sum of both parts of the cache size is constrained by the same upper bound value, so they are interacting with each other, when the application read rate is too slow, this large application cache will affect the socket cache, so that the receiver sliding window narrowing, so as to inform the connection to reduce the transmission speed, to avoid unnecessary network transport. When the application does not read the data for a long time, causing the application cache to squeeze the socket cache into space, the connection pair receives a notification that the receiving window is 0, telling each other that I can't digest more messages now.
Conversely, the receive sliding window is also constantly changing, we use tcpdump grasp three times handshake message:
14:49:52.421674 IP houyi-vm02.dev.sd.aliyun.com.6400 > R14a02001.dg.tbsite.net.54073:s 2,736,789,705:2,736,789,705 (0) Ack 1609024383 win 5792 <mss 1460,sackok,timestamp 2925954240 2940689794,nop,wscale 9>

You can see that the initial receive window is 5792 and, of course, far less than the maximum receive cache (the tcp_rmem[1 described later). Of course, there is a reason for this, TCP protocols need to consider complex network environment, so use slow start, congestion window (see High-performance Network Programming 2----TCP message sent), the initial window when the connection is established does not follow the maximum received cache initialization. This is because, too large the initial window from the macroscopic angle, the entire network may cause overload caused a vicious circle, that is, considering the links in many routers, switches may not be able to carry the pressure of lost packets (especially wide area network), and the micro-TCP connections are only the upper limit of their read cache as a receiving window, The larger the sending window (the receiver window) of both sides, the worse the impact on the network. Slow start is to make the initial window as small as possible, with the receipt of a valid message to the other side, confirmed the network's effective transmission capacity, before starting to increase the receive window.
Different Linux kernels have different initial windows, we take the widely used linux2.6.18 kernel for example, in Ethernet, the MSS size is 1460, at this time the initial window size is 4 times times the MSS, the Simple column code (*rcv_wnd is the initial receive window):
  int init_cwnd = 4;
  if (MSS > 1460*3)
   init_cwnd = 2;
  else if (MSS > 1460)
   init_cwnd = 3;
  if (*rcv_wnd > Init_cwnd*mss)
   *rcv_wnd = INIT_CWND*MSS;

You may want to ask why the top of the bag on the display window is actually 5792, not 1460*4 for 5840. This is because 1460 wants to express the meaning: after the 1500-byte MTU is removed from the 20-byte IP header, 20-byte TCP header, a maximum packet can host the valid data length. But in some networks, in TCP's optional head, use 12 bytes as a timestamp, so that the valid data is the MSS minus 12, and the initial window is (1460-12) *4=5792, which is consistent with what the window wants to say: the length of the valid data I can handle.
In later versions of LINUX3, the initial window was adjusted to 10 MSS sizes, mainly from Google's recommendations. The reason is that, although the receiving window often exponentially increases the window size (the congestion threshold below is exponential growth, threshold above the threshold to enter the congestion avoidance phase is linear growth, and the congestion threshold itself received more than 128 data packets also have the opportunity to increase rapidly, if the transmission of large data such as video, Then, as the window increases to (near) the maximum read cache, the data is "at full throttle", but if the page is typically dozens of KB, the connection ends when the smaller initial window has not been added to the appropriate window. This compares to the larger initial window, which allows the user more time (RTT) to transmit the data and experience bad.
Then you may have doubts that when the window expands from the initial window to the maximum receive window, the maximum Receive window is the maximum read cache. No, because there must be a partial cache of deferred message reads for the application. How much will be divided. This is a configurable system option, as follows:
Net.ipv4.tcp_adv_win_scale = 2

The tcp_adv_win_scale here means that the 1/(2^ tcp_adv_win_scale) cache will be taken out to do the application cache. That is, the default Tcp_adv_win_scale is configured to 2 o'clock, is to take out at least 1/4 of the memory to apply read cache, then the largest receive sliding window can only reach 3/4 of the read cache.

(2) The maximum read cache should be set to how much is appropriate.
When the share of the application cache is determined through the Tcp_adv_win_scale configuration, the upper limit of the read cache should be determined by the largest TCP receive window. The initial window may have only 4 or 10 MSS, but in the absence of packet loss, as the Message interaction window will increase, when the window is too large, what does "too big" mean. That is, for the communication of the two machines, the memory is not large, but for the entire Network load is too large, will cause a vicious circle of network equipment, and constantly because of the busy network equipment caused packet loss. The window is too small to make full use of network resources. As a result, the maximum receive window is typically set with the BDP (the maximum read cache can be computed). The BDP is called the bandwidth delay product, which is the product of bandwidth and network delay, for example, if our bandwidth is 2Gbps and the delay is 10ms, then the bandwidth delay product BDP is 2G/8*0.01=2.5MB, So the maximum receive window can be set to 2.5MB in such a network, so that the maximum read cache can be set to 4/3*2.5MB=3.3MB.
Why, then? Because the BDP represents the network load capacity, the maximum receive window represents a message that can be sent without acknowledgement within the network load capacity. As shown in the following illustration:
Often referred to the so-called Long fertilizer network, "Long" is the time to lengthen, "fat" is a large bandwidth, either a large, BDP, should lead to maximum window increase, which leads to the upper limit of read cache. So in the long Fat network server, the cache limit is relatively large. (Of course, TCP's original 16-bit length number indicates that the window has an upper limit, but the elastic sliding window defined in RFC1323 allows the sliding window to expand to large enough.) )
The Send window is actually a TCP connection to the other side of the receiving window, so you can press the receive window to infer that there is no longer verbose.

Third, Linux TCP cache upper bound automatic adjustment strategy So, set the maximum cache limit and then rest easy. For a TCP connection, it is possible to make full use of network resources, using large windows, large cache to maintain high-speed transmission. For example, in a long fat network, the cache limit may be set to several 10 megabytes, but the total memory of the system is limited, when each connection is running at full speed to use the maximum window, 10,000 connections will occupy memory to hundreds of G, which restricts the use of high concurrency scenarios, fairness is not guaranteed. We want the scenario to be that when the concurrent connection is relatively little, enlarge the cache limit to allow each TCP connection to work at full throttle; when there are many concurrent connections, the system is low on memory resources, so the cache limit is reduced, so that each TCP connection cache as small as possible to accommodate more connections.
Linux in order to achieve this scenario, the introduction of the automatic adjustment of memory allocation function, determined by TCP_MODERATE_RCVBUF configuration, as follows: Net.ipv4.tcp_moderate_rcvbuf = 1 Default Tcp_moderate_ The RCVBUF configuration is 1, which means that the TCP Memory Auto tuning feature is turned on. If configured to 0, this feature will not take effect (with caution).
Also note that when we set up SO_SNDBUF and so_rcvbuf the connection in programming, the Linux kernel will no longer perform automatic tuning on such connections.
So, how does this function work in the end? Look at the following configuration:
Net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.tcp_wmem = 8192 65536 16777216 net.ipv4.tcp_mem
= 8388608 125829 12 16777216

The Tcp_rmem[3] array represents the maximum read cache on any TCP connection, where Tcp_rmem[0] represents the minimum upper bound, tcp_rmem[1] represents the initial upper limit (note that it overrides the Rmem_default configuration for all protocols), tcp_rmem[ 2] represents the maximum limit. The tcp_wmem[3] array represents write caching, similar to Tcp_rmem[3], and is no longer to repeat.
TCP_MEM[3] array is used to set the overall usage of TCP memory, so its value is large (its unit is not byte, but the page--4k or 8K, etc.). )。 These 3 values define the pressure-free value of the TCP overall memory, the pressure mode open threshold, and the maximum usage value. With these 3 values as marker points, there are 4 cases of memory:
1, when the overall TCP memory is less than tcp_mem[0], indicating that the overall system memory without pressure. If previous memory has exceeded tcp_mem[1] to put the system into memory pressure mode, then the pressure mode will also be closed. In this case, the allocation of the new memory must be successful as long as the cache used by the TCP connection does not reach the upper limit (note that although the initial upper limit is tcp_rmem[1], the value is variable and is detailed below).
2, when there are tcp_mem[0] and tcp_mem[1 in TCP, the system may be in memory pressure mode, for example, the total memory has just come down from tcp_mem[1], or it may be in a non pressure mode, for example, the total memory just below the tcp_mem[0. At this point, the new memory must be successfully allocated, whether in pressure mode or not, as long as the TCP connection does not have more cache than tcp_rmem[0] or tcp_wmem[0. Otherwise, the allocation failure will basically be a situation. (Note: There are also exceptions that allow memory to be allocated successfully, skipping because it doesn't make sense for us to understand these configuration items.) )
3, when there is tcp_mem[1 between TCP and Tcp_mem[2], the system must be in the system pressure mode. Other acts and the same.
4, when there is tcp_mem[2 in TCP, there is no doubt that the system must be in the pressure mode, and at this point all new TCP cache allocations will fail.
The following figure provides a simplified logic for the kernel when new caching is required:
When the system is in non pressure mode, the upper read and write cache caps for each connection I mentioned above are likely to increase, of course, not exceeding tcp_rmem[2] or tcp_wmem[2. Conversely, in pressure mode, the read-write cache limit may be reduced, although the upper limit may be less than tcp_rmem[0] or tcp_wmem[0.
So, a rough summary of these 3 arrays can look like this: 1, as long as the system TCP total memory is tcp_mem[2, the new memory allocation will fail. 2, tcp_rmem[0] or tcp_wmem[0] priority is also high, as long as the condition 1 does not exceed the limit, so long as the connection memory is less than these two values, ensure that the new memory allocation must be successful. 3, as long as the total memory does not exceed tcp_mem[0], then there is no more than the upper limit of the connection cache can also guarantee the allocation of success. 4, tcp_mem[1] and tcp_mem[0] constitute a switch to open and close the memory pressure mode. In pressure mode, the connection cache limit may be reduced. In non-stress mode, the connection cache limit may increase to up to tcp_rmem[2] or tcp_wmem[2].







Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.