"0 copies" __linux in the Sendfile:linux

Source: Internet
Author: User
Tags prototype definition sendfile signal handler switches

Almost everyone today has heard of the so-called "0 copy" feature in Linux, but I often encounter people who don't fully understand the problem. Therefore, I decided to write some articles about this problem in a bit more detail, hoping to explain this useful feature clearly. In this article, the problem is explained from the perspective of the user space application, so the complex kernel implementation is deliberately ignored.

What is "0 copy"

To better understand the solution to the problem, we first need to understand the problem itself. First we take a Web service daemon as an example, taking into account the operations involved in the simple process of sending information stored in a file over the network to a customer. Here are some of the simple generations:

Read (file, tmp_buf, Len);
Write (socket, tmp_buf, Len);

It can't seem simpler. You may think that the execution of these two system calls does not incur much overhead. In fact, it was a mess. During the execution of these two system calls, the the target data has been replicated at least 4 times, while the same number of user/kernel space transitions have occurred (in fact, the process is far more complex than described here, but I want to describe it in a simple way to better understand the subject of this article).


For a better understanding of the operations involved in these two lines of code, look at Figure 1. The upper part of the graph shows context switching, while the lower half shows the copy operation.
Figure 1. Copying in two Sample System Calls
Step One: System call read results in context switching from user space to kernel space. The DMA module reads the contents of the file from disk and stores it in the buffer area of the kernel space, completing the 1th replication.

Step two: The data is copied from the kernel space buffer to the user space buffer, and then the system calls read back, which results in a context switch from kernel space to user space. At this point, the required data is stored in the specified user space buffer (parameter tmp_buf), and the program can continue with the following actions.

Step three: System call write causes a context switch from user space to kernel space. The data is replicated again from the user space buffer to the kernel space buffer, completing the 3rd replication. However, this data is stored in a specific buffer in the kernel space associated with the socket being used, rather than a buffer in step one.

Step four: The system call returns, resulting in a 4th-time context switch. The 4th replication occurs when the DMA module passes data from the kernel space buffer to the protocol engine, which occurs independently and asynchronously with the execution of our code. You may wonder: "Why should it be said to be independent, asynchronous." Is it not before the write system call returns that the data has been transmitted. The return of the write system call does not mean that the transmission succeeds-it does not even guarantee the start of the transmission. The return of the call simply indicates that the Ethernet driver has an empty seat in its transmission queue and has accepted our data for transmission. There may be a lot of data in front of our data. Unless the driver or hardware adopts a priority queue method, the data is transmitted in FIFO order (the fork-like DMA copy in Figure 1 indicates that the last copy can be deferred).

As you can see, there is a lot of data redundancy in the process. Some redundancy can be eliminated to reduce overhead and improve performance. As a driver developer, my work revolves around the hardware with advanced features. Some hardware supports the characteristic of completely bypassing memory and transmitting data directly to other devices. This feature eliminates a copy of the data in the system's memory and is therefore a good choice, but not all hardware supports it. In addition, data from the hard disk must be repackaged (address continuous) for network transport, which introduces some complexity. To reduce overhead, we can start by eliminating the duplication between the kernel buffer and the user buffer.

One way to eliminate replication is to change the read system call to MMAP system calls, for example:

Tmp_buf = mmap (file, Len);
Write (socket, tmp_buf, Len);

For a better understanding of the operation of this design, see Figure 2. The context-switching section is consistent with Figure 1.

Figure 2. Calling Mmap
Step one: The MMAP system call causes the contents of the file to be copied through the DMA module to the kernel buffer, which is then shared with the user process, so that replication between the kernel buffer and the user buffer does not occur.

Step Two: The Write system call causes the kernel to copy the data from the kernel buffer into the kernel buffer associated with the socket.

Step three: The 3rd time replication occurs when the DMA module passes data from the socket's buffer to the protocol engine.

By calling Mmap instead of read, we have halved the copy operations that the kernel needs to perform. When there is a large amount of data to be transmitted, this will have quite a good effect. However, performance improvements pay a price; there are some hidden pitfalls with the combination of mmap and write. For example, consider calling write when a file is mapped in memory, while another process truncates the same file. The write system call is interrupted by a sigbus signal received by the process, because the current process has access to the illegal memory address. The default processing for the Sigbus signal is to kill the current process and generate the dump core file-which is not the most desirable operation for a Web server program.

There are two ways to resolve this problem:

The first way is to set up a signal handler for the sigbus signal and simply execute the return statement in the handler. In this way, the write system call returns the number of bytes written before the signal is interrupted and sets the errno global variable to success. It must be pointed out that this is not a good solution-the symptom is not the root of the problem. I do not encourage such a solution, as the receipt of sigbus signals means that the process has been seriously wrong.

The second method applies file leasing (known as an "opportunity lock" in a Microsoft Windows system). This is the correct way to mollify the problem ahead. By renting a file descriptor, you can reach a lease on a particular file with the kernel. Read/write leases can be obtained from the kernel. When another process attempts to truncate the file you are transmitting, the kernel sends real-time signal--rt_signal_lease to your process. The signal informs your process that the kernel is about to terminate the lease you have obtained on the file. Thus, the write system call is interrupted by the rt_signal_lease signal before the write call accesses the illegal memory address and is killed by the Sigbus signal that is subsequently received. The return value of write is the number of bytes written before being interrupted, and the global variable errno is set to succeed. Here is a sample code showing how to obtain a lease from the kernel.

if (Fcntl (FD, F_setsig, rt_signal_lease) = =-1) {
Perror ("Kernel lease set signal");
return-1;
}

/* L_type can be f_rdlck f_wrlck * *
if (Fcntl (FD, F_setlease, L_type)) {
Perror ("Kernel lease set type");
return-1;
}

Sendfile

The Sendfile system call was introduced in kernel version 2.1 to simplify the process of data transfer over the network between two local files. The introduction of Sendfile system calls not only reduces data replication, but also reduces the number of context switches. Use the following methods:

Sendfile (socket, file, Len);

For a better understanding of the operations involved, see Figure 3

Figure 3. Replacing Read and Write with Sendfile
Step one: The Sendfile system call causes the contents of the file to be copied through the DMA module to a kernel buffer, which is then copied into the buffer area associated with the socket.

Step two: When the DMA module passes data in the socket-associated buffer to the protocol engine, the 3rd copy is performed.

You might wonder what happens when we call Sendfile to send data, and if another process truncates the file. If the process does not register any signal processing functions for Sigbus, the Sendfile system call returns the number of bytes sent before the signal is interrupted and errno the global variable to success.

However, if a file lease is obtained from the kernel before Sendfile is invoked, then, similarly, a rt_signal_lease is received before the Sendfile call returns.

So far, we've been able to avoid multiple copies of the kernel, but we still have a redundant copy. This copy can also be eliminated. Of course, with some help from the hardware, it's OK. In order to eliminate the data redundancy generated by the kernel, the network adapter is required to support the aggregation operation characteristics. This feature means that the data to be sent is not required to be stored in a contiguous memory space, but instead can be scattered across memory locations. In kernel version 2.4, the socket buffer descriptor structure has been altered to accommodate the requirements of aggregation operations-this is what Linux calls "0 copies." This approach not only reduces the number of context switches, but also eliminates data redundancy. From the perspective of the user-tier application, no changes have taken place, and all code is still similar to the following form:

Sendfile (socket, file, Len);

For a better understanding of the operations involved, see Figure 4

Figure 4. Hardware that supports gather can assemble data from multiple memory, locations eliminating copy.
Step one: The Sendfile system call causes the contents of the file to be copied through the DMA module to the kernel buffer.

Step two: The data is not copied into the buffer area associated with the socket. Instead, only the descriptor for the location and length of the record data is added to the socket buffer. The DMA module passes data directly from the kernel buffer to the protocol engine, eliminating the last copy of the legacy.

Since the data is actually still copied from the disk to memory and then copied from memory to the sending device, one might claim that it is not a true "0 copy". However, from the operating system point of view, this is "0 copies" because there is no redundant data in the kernel space. With the "0 copy" feature, there are other performance advantages, such as fewer context switches, less CPU cache contamination, and no CPU necessary to compute checksums, in addition to avoiding replication.

Now that we understand what a "0 copy" is, let's put the theory into practice and write some code. You can download the complete source code from the Www.xalien.org/articles/source/sfl-src.tgz place. Perform "TAR-ZXVF sfl-src.tgz" to extract the source code. Run make command, compile source code, and create random data file Data.bin

Starting from scratch file Introduction code:

/* SFL.C Sendfile Example Program
Dragan Stancevic <
Header name Function/variable
-------------------------------------------------*/
#include <stdio.h>/* printf, perror * *
#include <fcntl.h>/* Open * *
#include <unistd.h>/* Close * *
#include <errno.h>/* errno * *
#include <string.h>/* memset * *
#include <sys/socket.h>/* Socket * *
#include <netinet/in.h>/* sockaddr_in * *
#include <sys/sendfile.h>/* sendfile * *
#include <arpa/inet.h>/* inet_addr * *
#define BUFF_SIZE (10*1024)/* SIZE of the TMP buffer * *

In addition to the <sys/socket.h> and <netinet/in.h> header files required for basic socket operations, we also need to include a prototype definition of the Sendfile system call, which can be <sys/sendfile.h > Header files found.

Server flag:

/* are we sending or receiving * *
if (argv[1][0] = = ' s ') is_server++;
/* Open Descriptors * *
SD = socket (pf_inet, sock_stream, 0);
if (is_server) fd = open ("Data.bin", o_rdonly);

The program can be run either as a server/sender or as a client/receiver. We need to check one of the command line arguments and then set the IS_SERVER flag accordingly. A stream socket with a pf_inet address family is opened in the program; the server-side runtime needs to send data to the customer, so open a data file. Because the program uses the Sendfile system to send data, there is no need to read the contents of the file and store it in the buffer area of the program.

Next is the server address:

* Clear the memory * *
memset (&sa, 0, sizeof (struct sockaddr_in));
/* Initialize Structure * *
sa.sin_family = pf_inet;
Sa.sin_port = Htons (1033);
SA.SIN_ADDR.S_ADDR = inet_addr (argv[2]);


Set the protocol family, port, and IP address after the service-side address structure is cleared 0. The IP address of the server side is passed to the program as a command-line parameter. The port number is hard-coded to 1033, which is selected because it is above the port range that requires root permission.

The following is the branch code for the service side:

if (is_server) {
int client; /* NEW Client Socket * *
printf ("Server binding to [%s]\n", argv[2]);
if (Bind (SD, (struct sockaddr *) &sa,sizeof (sa)) < 0) {
Perror ("bind");
Exit (errno);
}

As a service-side, you need to assign an address to the socket descriptor, which is done through system call bind, which assigns the server address (SA) to the Socket Descriptor (SD).

if (Listen (sd,1) < 0) {
Perror ("Listen");
Exit (errno);
}

Because you use a stream socket, you must declare the intent of the kernel to accept foreign connection requests and set the dimensions of the connection queue. The queue length is set to 1, but the value is typically set higher to accept the established connection. In the old version of the kernel, the queue was used to block SYN flood attacks. The attribute was abandoned by the listen call because the Listen system call changed to set the number of established connections. Kernel parameter Tcp_max_syn_backlog assumes the ability to protect the system from SYN flood attacks.

if (client = accept (SD, NULL, NULL) < 0) {
Perror ("accept");
Exit (errno);
}

The Accept system call selects the first connection request from the connected queue to be processed, creating a new socket for it. The return value of the accept call is the descriptor for the newly established connection, and the new socket can be used for read, write, and Poll/select system calls.

if (cnt = Sendfile (Client,fd,&off, buff_size)) < 0) {
Perror ("Sendfile");
Exit (errno);
}

printf ("Server sent%d bytes.\n", CNT);
Close (client);

The connection is already established on the client socket descriptor, so you can start transferring the data to the remote system-by invoking the Sendfile system call. The prototype of the call in Linux is as follows:

extern ssize_t
Sendfile (int __out_fd, int __in_fd, off_t *offset, size_t __count) __throw;

The first two parameters are file descriptors, and the third parameter represents the offset at which the sendfile begins to transmit data. The fourth parameter is the number of bytes that are intended to be transmitted. For Sendfile to be able to use the "0 copy" feature, the NIC needs to support aggregation operations, in addition to the ability to verify and compute. If your NIC does not have these features, you can still use Sendfile to send the data, except that the kernel merges the contents of all buffers before it is transmitted.

Portability issues

One of the problems with Sendfile system calls, in general, is the lack of a standardized implementation, which is the same as the open system invocation class. The implementation of Sendfile in Linux, Solaris, or HP-UX is very different. This poses a problem for developers who want to leverage "0 copies" in network transport code.

One of the differences in these implementations is the sendfile provided by Linux, defined as the transport interface between two file descriptors and between files and sockets. On the other hand, in HP-UX and Solaris, Sendfile can only be used for file to socket transmissions.

The 2nd difference is that Linux is not implemented to quantify transmission. Sendfile system calls in Solaris and HP-UX contain additional parameters to eliminate the overhead of adding headers to the data to be transferred.

Prospect

The implementation of "0 copies" in Linux is far from over and is likely to change in the near future. More features will be added, for example, the current sendfile does not support quantization transfer, and servers such as samba and Apache have to use TCP_COKR flags to perform multiple sendfile calls. The flag tells the system that there is data to be reached in the next Sendfile call. Tcp_cork and Tcp_nodelay are incompatible, which is used when we want to add headers to the data. This is also a perfect example of how sendfile that support quantization will, in those cases, eliminate multiple sendfile calls and delays that are enforced by the current implementation.

A rather unpleasant limitation of the current sendfile is that it cannot transmit files larger than 2GB to the user. The size of this document is not so rare today that it is very disappointing to have to replicate the data. Because Sendfile and mmap are not available in this case, providing sendfile64 in a future kernel version will provide a great help.

Conclusion

Despite some drawbacks, the "0 copy" Sendfile is a useful feature. I want the reader to think that this article provides enough information to start using Sendfile in the program. If you have a deeper interest in this topic, please look forward to my second article-"Zero copy Ii:kernel Perspective", in which will be a further step in the "0 copies" of the internal implementation of the kernel.


Http://t.matong.me/2011/03/29/zero-copy-linux.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.