Improve socket performance on Linux
Four Methods for accelerating network applications
M. Tim Jones (mtj@mtjones.com), senior software engineer,
Emulex
Tim Jones is an embedded software engineer. He isGNU/Linux Application Programming,AI Application ProgrammingAnd
BSD sockets programming from a multilanguage perspectiveAnd other books. His engineering background is very extensive, from synchronizing the kernel development of the spacecraft to the embedded architecture design, to the development of network protocols. Tim is a senior software engineer at Emulex Corp.
(An IBM developerworks contributing author)
Introduction:Using the Sockets API, we can develop client and server applications that can communicate over the local network or over the Internet around the world. Like other APIs, you can use the Sockets API in some ways to improve socket performance or limit socket performance. This article explores four ways to use the Sockets API to obtain the maximum performance of an application and optimize the GNU/Linux environment to achieve the best results.
When developing a socket application, the first task is to ensure reliability and meet specific requirements. With the four tips given in this article, you can design and develop a socket program for optimal performance from the beginning. This article includes the use of sockets APIs, two socket options that can improve performance, and GNU/Linux optimization.
Follow these skills to develop applications with superior performance:
- Minimize the delay of message transmission.
- Minimize the load of system calls.
- Adjusts the TCP window for the bandwidth delay product.
- Dynamic Optimization of the GNU/Linux TCP/IP stack.
Tip 1. Minimize packet transmission latency
When communication is performed through TCP socket, data is split into data blocks so that they can be encapsulated into the TCP payload (the payload in the TCP packet) of the given connection. The size of TCP payload depends on several factors (such as the maximum message length and Path), but these factors are known when the connection is initiated. To achieve the best performance, we aim to fill each packet with as much available data as possible. When there is not enough data to fill in payload (also knownMaximum Segment Length (maximum segment size)Or MSS), TCP will use Nagle
The algorithm automatically connects some small buffers to a packet segment. In this way, the application efficiency can be improved by minimizing the number of sent packets, and the overall network congestion problem can be reduced.
Although John Nagle's algorithm can connect the data to a larger packet to minimize the number of sent packets, sometimes you may want to send only a smaller packet. A simple example is the telnet program, which allows users to interact with a remote system. This is usually done through a shell. If a user is required to fill a segment with characters entered before sending the packet, this method cannot meet our needs.
Another example is the HTTP protocol. Generally, the client browser generates a small request (an HTTP request message), and the Web server returns a larger response (web page ).
Solution
The first thing you should consider is that the Nagle algorithm meets a requirement. Because this algorithm combines data and tries to form a complete TCP packet segment, it introduces some latency. However, this algorithm can minimize the number of packets sent online and thus minimize network congestion.
However, the Sockets API provides a solution to minimize the transmission latency. To disable the Nagle algorithm, you can setTCP_NODELAY
Socket options, as shown in Listing 1.
Listing 1. Disable the Nagle Algorithm for TCP socket
int sock, flag, ret;/* Create new stream socket */sock = socket( AF_INET, SOCK_STREAM, 0 );/* Disable the Nagle (TCP No Delay) algorithm */flag = 1;ret = setsockopt( sock, IPPROTO_TCP, TCP_NODELAY, (char *)&flag, sizeof(flag) );if (ret == -1) { printf("Couldn't setsockopt(TCP_NODELAY)\n"); exit(-1);} |
Tip:The Samba experiment shows that disabling the Nagle algorithm almost doubles the read performance when reading data from the samba drive on a Microsoft Windows Server.
When you use a socket to read and write dataSystem Call). This call (for exampleread
Or
write
Across the boundaries between the user space application and the kernel. In addition, before entering the kernel, your call will use the C library to enter a common function (system_call()
). Slave
system_call()
, This call will enter the file system layer, and the kernel will determine the type of device being processed here. Finally, the call enters the socket layer, where data is read or queued for transmission through the socket (this involves data copies ).
This process indicates that the system call is not only performed in the application and kernel, but also through many layers in the application and kernel. This process consumes a lot of resources, so the more calls, the longer the time required to work through this call chain, the lower the performance of the application.
Since we cannot avoid these system calls, the only choice is to minimize the number of times these calls are used. Fortunately, we can control this process.
Solution
When writing data to a socket, try to write all the data at a time, instead of performing multiple write operations. For read operations, it is best to pass in the maximum buffer that can be supported, because if there is not enough data, the kernel will also try to fill the entire buffer (and also need to keep the TCP Notification window open ). In this way, you can minimize the number of calls and achieve better overall performance.
The performance of TCP depends on several factors. The two most important factors are:Link bandwidth)(Packet transmission rate on the network) andRound-trip time)Or RTT (the delay between sending a message and receiving a response from the other end ). These two values are identified
Bandwidth Delay Product(BDP.
Given the link bandwidth and RTT, you can calculate the BDP value. What does this mean? BDP provides a simple method to calculate the theoretically optimal TCP socket buffer size (which stores the data waiting for transmission and waiting for the application to receive ). If the buffer is too small, the TCP window cannot be fully opened, which will limit the performance. If the buffer area is too large, valuable memory resources will be wasted. If the buffer size you set is suitable, you can fully utilize the available bandwidth. Here is an example:
BDP = link_bandwidth * RTT
If an application communicates over a 100 Mbps LAN, its RRT is 50 MS, then BDP is:
100MBps * 0.050 sec / 8 = 0.625MB = 625KB
Note:Dividing by 8 is the byte used for communication.
Therefore, you can set the TCP window to BDP or 1.25 MB. However, in Linux 2.6, the default TCP window size is 2.2 kb, which limits the connection bandwidth to Mbps. The calculation method is as follows:
throughput = window_size / RTT
110KB / 0.050 = 2.2MBps
If the window size calculated above is used, the bandwidth is 12.5 Mbps. The calculation method is as follows:
625KB / 0.050 = 12.5MBps
The difference is indeed great, and it can provide a larger throughput for the socket. So now you know how to calculate the optimal buffer size for your socket. But how can we change it?
Solution
The Sockets API provides several socket options, two of which can be used to modify the size of the socket sending and receiving buffer. Listing 2 shows how to use
SO_SNDBUF
AndSO_RCVBUF
To adjust the size of the sending and receiving buffer.
Note:Although the size of the socket buffer determines the size of the advertised TCP window, TCP maintains a congestion window in the advertised window. Therefore, due to the existence of this congestion window, the given socket may never use the largest announcement window.
List 2. manually set the buffer size of the sending and receiving socket
int ret, sock, sock_buf_size;sock = socket( AF_INET, SOCK_STREAM, 0 );sock_buf_size = BDP;ret = setsockopt( sock, SOL_SOCKET, SO_SNDBUF, (char *)&sock_buf_size, sizeof(sock_buf_size) );ret = setsockopt( sock, SOL_SOCKET, SO_RCVBUF, (char *)&sock_buf_size, sizeof(sock_buf_size) ); |
In the Linux 2.6 kernel, the size of the sending buffer is defined by the caller, but the receiving buffer is automatically doubled. You cangetsockopt
To verify the size of each buffer.
Jumbo Frame)
We can also consider changing the package size from 1,500 bytes to 9,000 bytes (referred to as a giant frame ). In the local network, you can set the maximum transmission unit (MTU) to set up a giant frame, which can greatly improve the performance.
For Window Scaling, TCP initially supports a window with a maximum size of 64 KB (use a 16-bit value to define the window size ). With Window Scaling (RFC 1323) extension, you can use a 32-bit value to indicate the window size. The TCP/IP stack provided in GNU/Linux supports this option (and other options ).
Tip:The Linux kernel also includes the ability to automatically optimize these socket buffers (see below)
In table 1tcp_rmem
Andtcp_wmem
), But these options will affect the entire stack. If you only need to adjust the window size for a connection or a type of connection, this mechanism may not meet your needs.
The standard GNU/Linux release attempts to optimize various deployment conditions. This means that the standard release may not have special Optimizations to your environment.
Solution
GNU/Linux provides many adjustable kernel parameters that you can use to dynamically configure the operating system for your own purposes. Next, let's take a look at some of the more important options that affect socket performance.
In/proc
Some adjustable kernel parameters exist in the virtual file system. Each file in this file system represents one or more parameters.cat
Tool to read or use
echo
Command. Listing 3 shows how to query or enable an adjustable parameter (in this case, IP Forwarding can be enabled on the TCP/IP stack ).
Listing 3. Optimization: Enable IP Forwarding in the TCP/IP stack
[root@camus]# cat /proc/sys/net/ipv4/ip_forward0[root@camus]# echo "1" > /poc/sys/net/ipv4/ip_forward[root@camus]# cat /proc/sys/net/ipv4/ip_forward1[root@camus]# |
Table 1 provides several adjustable parameters that can help you improve the performance of the Linux TCP/IP stack.
Table 1. Adjustable Kernel Parameters for TCP/IP stack Performance
Adjustable Parameters |
Default Value |
Option description |
/proc/sys/net/core/rmem_default |
"110592" |
Defines the default size of the receiving window. For larger BDP, this size should also be larger. |
/proc/sys/net/core/rmem_max |
"110592" |
Defines the maximum size of the receiving window. For larger BDP, this size should also be larger. |
/proc/sys/net/core/wmem_default |
"110592" |
Defines the default size of the sending window. For larger BDP, this size should also be larger. |
/proc/sys/net/core/wmem_max |
"110592" |
Defines the maximum size of the sending window. For larger BDP, this size should also be larger. |
/proc/sys/net/ipv4/tcp_window_scaling |
"1" |
Enable the Window Scaling defined in RFC 1323. To support Windows larger than 64 KB, This value must be enabled. |
/proc/sys/net/ipv4/tcp_sack |
"1" |
Enable selective acknowledgment, which can improve the performance by selectively responding to messages received in disordered Order (this allows the sender to send only lost packets ); (For Wan communication) this option should be enabled, but this will increase the CPU usage. |
/proc/sys/net/ipv4/tcp_fack |
"1" |
Forward acknowledgment can be enabled to select a response (sack) to reduce congestion. This option should also be enabled. |
/proc/sys/net/ipv4/tcp_timestamps |
"1" |
Enable RTT computing with a more precise method (see RFC 1323). This option should be enabled for better performance. |
/proc/sys/net/ipv4/tcp_mem |
24576 32768 49152" |
Determine how the TCP stack reflects memory usage. The unit of each value is a memory page (usually 4 kb ). The first value is the lower limit of memory usage. The second value is the maximum application pressure on the buffer zone in memory pressure mode. The third value is the upper limit of memory. At this level, messages can be discarded to reduce memory usage. For larger BDP values, you can increase these values (but remember that the unit is the memory page, not the byte ). |
/proc/sys/net/ipv4/tcp_wmem |
4096 16384 131072" |
Defines the memory used by each socket for automatic optimization. The first value is the minimum number of bytes allocated for the socket sending buffer. The second value is the default value (this value will bewmem_default The buffer can be increased to this value when the system load is not heavy. The third value is the maximum number of bytes in the sending buffer space (this value will be
wmem_max Overwrite ). |
/proc/sys/net/ipv4/tcp_rmem |
4096 87380 174760" |
Andtcp_wmem Similar, but it indicates the value of the receiving buffer used for automatic tuning. |
/proc/sys/net/ipv4/tcp_low_latency |
"0" |
Allow TCP/IP stack to adapt to high throughput and low latency; this option should be disabled. |
/proc/sys/net/ipv4/tcp_westwood |
"0" |
Enable the sender's congestion control algorithm to maintain the throughput evaluation and try to optimize the overall bandwidth utilization. This option should be enabled for Wan communication. |
/proc/sys/net/ipv4/tcp_bic |
"1" |
Enable binary increase congestion for a fast long-distance network; this allows for better access to links for operations at the GB speed; this option should be enabled for Wan communication. |
Like any tuning effort, the best way is to continuously experiment. The behavior of your application, the speed of the processor, and the amount of memory available will affect the way these parameters affect performance. In some cases, beneficial operations may be harmful (and vice versa ). Therefore, we need to test each option one by one and then check the results of each option. In other words, we need to trust our own experience, but we need to verify each modification.
Tip:The following describes a permanent configuration issue. Note: If you restart the GNU/Linux system, any adjustable kernel parameters you need will be restored to the default value. You can use
/etc/sysctl.conf
When the system starts, set these parameters to the values you set.
GNU/Linux is very attractive to me because there are many tools available. Although most of them are command line tools, they are both very useful and intuitive. GNU/Linux provides several tools-some of which are provided by GNU/Linux and some are open source software-used to debug network applications and measure bandwidth/throughput, and check the usage of the link.
Table 2 lists the most useful GNU/Linux tools and their usage. Table 3 lists several useful tools not provided by the GNU/Linux release. For more information about tools in table 3, see
References.
Table 2. tools available in any GNU/Linux release
GNU/Linux tools |
Purpose |
ping |
This is the most common tool used to check the availability of the host, but it can also be used to identify the RTT for bandwidth Delay Product computing. |
traceroute |
Print the path (route) of a series of routers and gateways attached to a network host to determine the delay between each hop. |
netstat |
Determine statistical information about network subsystems, protocols, and connections. |
tcpdump |
Displays the protocol-level message tracing information of one or more connections. It also includes the time information, which you can use to study the packet time of different protocol services. |
Table 3. Useful performance tools not provided in the GNU/Linux release
GNU/Linux tools |
Purpose |
netlog |
Provides some network performance information for applications. |
nettimer |
Generates a metric for the bandwidth of the bottleneck link. It can be used for automatic protocol optimization. |
Ethereal |
Provides an easy-to-use graphical interfacetcpump (Packet tracking) features. |
iperf |
Measure the network performance of TCP and UDP, measure the maximum bandwidth, and report the loss of latency and datagram. |
Try to use the techniques and techniques described in this article to Improve the Performance of socket applications, including disabling the Nagle algorithm to reduce transmission latency and setting the buffer size to improve socket bandwidth utilization, reduce the load of system calls by minimizing the number of system calls, and optimize the Linux TCP/IP stack by using adjustable kernel parameters.
The features of the application must be considered during optimization. For example, will your application communicate over the Internet based on a LAN? If your application only operates within the LAN, increasing the size of the socket buffer may not significantly improve, but enabling the jumbo frame will definitely improve the performance!
Finally, usetcpdump
OrEthereal
To check the optimized results. The changes seen at the packet level can help demonstrate the successful results after optimization using these technologies.
References
Learning
- You can refer to
Original English.
- The two-part series of articles "Linux socket programming" (developerworks, October 2003 and January 2004) can help you write socket applications.
- See pitt0000gh supercomputing center
Other articles on TCP-friendly congestion control algorithms.
- Increasing MTU can greatly affect the performance. For more information, see
Jumbo frame and its advantages.
- For more information, see
Optional response articles.
- Go to the TCP Westwood homepage to learn more about the TCP Westwood algorithm.
- Study of North Carolina Lina State University
Binary increase congestion TCP.
- Please read the book prepared by the author
BSD sockets programming from a multilanguage perspective(Charles River media, September 2003), which describes the technology for writing socket programs in six different languages.
- In the developerworks Linux area, you can find more resources for Linux developers.
- Tracking
The latest developments in developerworks technical events and webcasts.
Obtain products and technologies
- You can link the netlog library to an application to facilitate performance analysis.
- Ethereal is a graphical network protocol analyzer, which includes the plug-in architecture for protocol analysis.
- Please refer to the following link for more information on National Laboratory for Applied network research.
The content of the iperf tool.
- Use
IBM trial software, which can be downloaded directly from developerworks.