High-performance network programming 7--tcp connected memory usage

Source: Internet
Author: User
Tags ack memory usage socket first row
When the number of concurrent TCP connections for a server is timed at 100,000, we are interested in how much memory a TCP connection consumes on the operating system kernel. The socket programming method provides interfaces such as SO_SNDBUF, SO_RCVBUF to set the read/write cache for the connection, and Linux also provides the following system-level configuration to set up the TCP memory usage on the server as a whole, but these configurations look at the name but some conflicting, vague sense of the concept, The following (Sysctl-a command can see these configurations): [CPP]View plain Copy net.ipv4.tcp_rmem = 8192 87380 16777216 net.ipv4.tcp_wmem = 8192 65536 16777216 net.ipv4.tcp_mem =    8388608 12582912 16777216 net.core.rmem_default = 262144 Net.core.wmem_default = 262144 Net.core.rmem_max = 16777216 Net.core.wmem_max = 16777216
There are also fewer mentioned configurations that are related to TCP memory: [CPP]View plain Copy net.ipv4.tcp_moderate_rcvbuf = 1 Net.ipv4.tcp_adv_win_scale = 2
(Note: For the convenience of the following, the introduction of the above system configuration prefix omitted, configuration values are separated by a number of numbers in an array to address, such as tcp_rmem[2] represents the first row of the last column 16777216. )
Many of these system configuration items can be found on the web, but it is often confusing, for example, that tcp_rmem[2] and Rmem_max seem to be related to receiving the maximum cache value, but they can be inconsistent, what is the difference. or tcp_wmem[1] and wmem_default all seem to indicate the default value of the send cache, and what happens if it conflicts. In the SYN handshake package caught with the capture software, why the TCP receive window size seems to be completely irrelevant to these configurations.
TCP connections are used in a process with ever-changing memory sizes, usually the program is more complex than directly based on the socket programming, then the platform-level components may encapsulate the TCP connection to use the user-state memory. Different platforms, components, middleware, and network libraries are all the same. While the kernel state allocates memory for the TCP connection, the algorithm is basically the same, this article will try to explain how much memory TCP connections will use in the kernel state, how the operating system uses the strategy to balance the macro throughput and micro connection transfer speed. This article will also continue to target application developers, not system-level kernel developers, so there is no detailed description of how many bytes of memory are allocated for a TCP connection, a TCP message operating system, and a kernel-level data structure that is not the focus of this article, nor is it an application-level programmer's point of concern. This article mainly describes how the Linux kernel manages read and write caching for data transmitted over a TCP connection.

One, what is the cache cap.
(1) Start with the SO_SNDBUF and so_rcvbuf that can be set when the application is programmed.
Regardless of the language, the TCP connection is provided based on the SetSockOpt method implementation of SO_SNDBUF, SO_RCVBUF, how to understand the meaning of these two properties. SO_SNDBUF, so_rcvbuf are all personalized settings, that is, only affect the set of connections, but not the other connection to take effect. The SO_SNDBUF represents the upper limit of the kernel write cache on this connection. In fact, the process set SO_SNDBUF is not really the upper limit, in the kernel will be doubled this value as the write cache upper limit, we do not need to tangle with this detail, just need to know, when the SO_SNDBUF is set, is the equivalent of the maximum memory that can be used by the write cache on the TCP connection that is being manipulated. However, this value can not be arbitrarily set by the process, it will be subject to system-level upper and lower limits, when it is greater than the above system configuration Wmem_max (Net.core.wmem_max), will be replaced by Wmem_max (same double); For example, the minimum write cache design in the 2.6.18 kernel is 2K bytes, which is also directly substituted for 2K.
SO_RCVBUF represents the upper read cache limit on a connection, similar to So_sndbuf, which is also subject to Rmem_max configuration items, and is actually twice times the size of the kernel as the usage limit for the read cache. The SO_RCVBUF setting also has a lower limit, also in the 2.6.18 kernel if this value is less than 256 bytes will be replaced by 256.

(2) Then, you can set the SO_SNDBUF, SO_RCVBUF cache use the upper limit and the actual memory exactly how it relates.
The memory used by the TCP connection is mainly determined by the read-write cache, and the size of the read-write cache is only related to the actual usage scenario, and SO_SNDBUF and so_rcvbuf are useless when the actual usage is not up to the limit. For the read cache, receiving a TCP message from the connection to the end causes the read cache to increase, of course, if the message size after the read cache has exceeded the read cache limit, then the message will be discarded so that the read cache size remains unchanged. When to read the memory used by the cache will be reduced. The read cache is reduced when the process calls read, recv, and so on, reading the TCP stream. Therefore, the read cache is a dynamically changing, how much is actually used to allocate how much buffer memory, when the connection is very idle, and the user process has been the connection received data are consumed, the read cache memory is 0.
The same is true for write caching. When a user process calls a method such as send or write to send a TCP stream, it causes the write cache to grow. Of course, if the write cache has reached the upper limit, the write cache remains unchanged and returns to the user process failed. The write cache is reduced whenever the ACK received by the TCP connection to the end confirms the successful transmission of the message, because the TCP reliability determines that the message is not destroyed by the loss of the message, and may be re-sent by the retransmission timer. Therefore, the write cache is also dynamically changing, idle normal connection, the memory used by the write cache is usually also 0.
Therefore, the cache usage limit will only work if the receiving network message is greater than the speed at which the application reads the message, which may result in a read cache reaching the upper limit. The function is to discard the newly received message and prevent the TCP connection from consuming too much server resources. Similarly, when an application sends a message at a speed greater than the rate at which the ACK is received, the write cache may reach a limit so that a method such as send fails and the kernel does not allocate memory for it.

Second, the size of the cache and the TCP sliding window exactly what the relationship.
(1) The size of the sliding window is definitely related to the size of the cache, but it is not a one by one corresponding relationship, and it does not have a one by one corresponding relationship with the cache upper limit. Therefore, a lot of information on the Internet, such as Rmem_max configuration set the maximum value of the sliding window, and we tcpdump grasp the package when we see the value of the win window is completely inconsistent, it makes sense. Let's explore where the difference is.
The read cache has 2 functions: 1, caches the random TCP messages that fall within the receiving sliding window, and 2. When an ordered message that can be read by the application appears, the application reads the messages that are read in the read cache as the applications are read-deferred. Therefore, the read cache is divided into a partial cache of unordered packets, and some of the ordered messages are cached for delay reading. The sum of the two-part cache size is subject to the same upper limit value, so they will affect each other, when the application read rate is too slow, this too large application cache will affect the socket cache, so that the receiving sliding window to reduce, so that the connection to notify the end of the slow sending speed, avoid unnecessary network transmission. When the application does not read the data for a long time, causing the application cache to squeeze the socket cache to no space, the connection peer receives a notification that the receive window is 0, telling the other: I can't digest more messages now.
Conversely, the receiving sliding window is always changing, we use tcpdump to grasp three times the handshake message: [CPP]View plain copy 14:49:52.421674 IP houyi-vm02.dev.sd.aliyun.com.6400 > R14a02001.dg.tbsite.net.54073:s 2,736,789,705:2 736789705 (0) Ack 1609024383 win 5792 <mss 1460,sackok,timestamp 2925954240 2940689794,nop,wscale 9>
You can see that the initial receive window is 5792 and, of course, is much smaller than the maximum receive cache (Tcp_rmem[1, described later). This is of course for a reason, the TCP protocol needs to consider a complex network environment, so the use of slow-start, congestion window (see High-performance Network programming 2----TCP message delivery), the initial window when establishing a connection does not follow the maximum value of the received cache initialization. This is because the large initial window from the macro point of view, the entire network may cause an overload of the vicious cycle, that is, considering the links on many routers, switches may not be able to carry the pressure of packet loss (especially WAN), while the micro-TCP connection of the two sides only according to their own read cache limit as the receiving window, The larger the sending window of both sides (the receiver window), the worse the network will be. Slow start is to make the initial window as small as possible, with the receipt of the other side of the effective message, confirmed the network's effective transmission capacity, only began to increase the receiving window.
Different Linux kernels have different initial windows, we use the widely used linux2.6.18 kernel for example, in Ethernet, MSS size is 1460, the initial window size is 4 times times the MSS, simple column code (*rcv_wnd is the initial receiving window): [CPP]View plain copy int init_cwnd = 4;   if (MSS > 1460*3) Init_cwnd = 2;   else if (MSS > 1460) Init_cwnd = 3; if (*rcv_wnd > Init_cwnd*mss) *rcv_wnd = INIT_CWND*MSS;
People may ask, why the above-mentioned Grab Bag display window is actually 5792, not 1460*4 for 5840. This is because 1460 want to express the meaning is: the 1500 byte MTU removed 20 bytes of IP header, 20 bytes of TCP header, a maximum message can be hosted by the valid data length. But some networks, in the optional head of TCP, using 12 bytes as a timestamp, so that the valid data is MSS minus 12, the initial window is (1460-12) *4=5792, which is consistent with the meaning of the window to express, that is, I can handle the length of the valid data.
In later versions of LINUX3, the initial window was resized to 10 mss, mainly from Google's recommendations. The reason is that, although the receiving window often exponentially to increase the window size (the congestion threshold below is exponential growth, the threshold above into the congestion avoidance phase is linear growth, and the congestion threshold itself in the receipt of more than 128 data packets also have the opportunity to increase rapidly), if the transmission of video such big data, So as the window increases to (near) the maximum read cache, it will "full throttle" transfer data, but if it is usually dozens of KB pages, then the Small initial window has not been added to the appropriate window, the connection is over. This makes the user need more time (RTT) to transmit the data than the larger initial window, and the experience is not good.
Then there may be doubt that when the window expands from the initial window to the maximum receive window, the maximum Receive window is the maximum read cache. No, because there must be a partial cache of deferred message reads for the application. How much it will be divided. This is a system option that is available, as follows: [CPP]View Plain Copy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.