Zero Copy I:user-mode Perspective

Source: Internet
Author: User
Tags prototype definition sendfile signal handler switches root access

Explaining what is zero-copy functionality for Linux, why it ' s useful and where it needs work.

By is almost everyone has heard of so-called zero-copy functionality under Linux, but I often run into people who don ' t h Ave a full understanding of the subject. Because of this, I decided to write a few articles which dig into the matter a bit deeper, in the hope of unraveling of this U Seful feature. In this article, we take a look at zero copy from a User-mode application point of view, so gory kernel-level details are Omitted intentionally. What is Zero-copy?

To better understand the solution to a problem, we are need to understand the problem. Let's look at what was involved in the simple procedure of a network server dæmon serving data stored in a file to a client Over the network. Here ' s some sample code:

Read (file, tmp_buf, Len);
Write (socket, tmp_buf, Len);

Looks simple enough; You are not very overhead with only those two system calls would the there. In reality, this couldn ' t is further from the truth. Behind those two calls, the data has been copied at least four times, and almost as many User/kernel context switches have been performed. (actually this process are much more complicated, but I wanted to keep it simple). To get a better idea of the process involved, take a look at Figure 1. The top side shows context switches, and the bottom side shows copy operations.

Figure 1. Copying in two Sample System Calls

Step One:the Read system call causes a context switch from user mode to kernel mode. The performed by the DMA engine, which reads file contents to the disk and stores them into a kernel SS Space Buffer.

Step Two:data are copied from the kernel buffer to the user buffer, and the read system call returns. The return of the call caused a context switch from kernel back to user mode. Now the "data is" stored in the user address spaces buffer, and it can begin its way down again.

Step Three:the Write system call causes a context switch from user mode to kernel mode. A third copy is performed to put the "data into a kernel" address spaces buffer again. This time, though, the "data is" put into a different buffer, a buffer of that are associated with sockets specifically.

Step four:the Write system call returns, creating we fourth context switch. Independently and asynchronously, a fourth copy happens as the DMA engine passes the data from the kernel buffer to the PR Otocol engine. You are probably asking yourself, "What did you mean independently and asynchronously?" Wasn ' t the data transmitted before the call returned? Call return, in fact, doesn ' t guarantee transmission; It doesn ' t even guarantee the start of the transmission. It simply means the Ethernet driver had free descriptors into its queue and has we data for accepted. There could be numerous packets queued before. Unless the driver/hardware implements priority rings or queues, the data is transmitted on a first-in-first-out basis. (The forked DMA copy in Figure 1 illustrates the fact, the last copy can is delayed).

As you can, a lot of the data duplication is isn't really necessary to hold the up. Some of the duplication could is eliminated to decrease overhead and increase. As a driver developer, I work with hardware that has some pretty advanced. Some hardware can bypass the main memory altogether and transmit data directly to another. This feature eliminates a copy in the system memory and are a nice thing to have, but does all hardware supports it. The There is also the issue of the the the ' the ' the ' the ' is ' repackaged for the network, which introduces some ions. To eliminate overhead, we could start by eliminating some of the copying between the kernel and user buffers.

One way to eliminate a copy are to skip calling read and instead call mmap. For example:

Tmp_buf = mmap (file, Len);
Write (socket, tmp_buf, Len);

To get a better idea of the process involved, take a look at Figure 2. Context switches remain the same.

Figure 2. Calling Mmap

Step one:the mmap System call causes the file contents to is copied into a kernel buffer by the DMA engine. The buffer is a shared then with the user process, without any copy being performed between the kernel and user memory spaces S.

Step Two:the Write system call causes the kernel to copy of the data from the original kernel buffers into the kernel buffer s associated with sockets.

Step three:the Third copy happens as the DMA engine passes the "data from" kernel socket buffers to the protocol .

By using mmap instead of read, we ' ve cut in half the amount of data the kernel has to copy. This yields reasonably good results when a lot of the data is being transmitted. However, this is improvement doesn ' t come without a price; There are hidden pitfalls when using the Mmap+write method. You'll fall into one of them when you memory map a file and then call write while another process truncates the same fil E. Your write system call'll be interrupted by the bus error signal Sigbus, because you performed a bad memory access. The default behavior to kill the process and dump core-not the most desirable operation for a network s Erver. There are two ways to get around this problem.

The install a signal handler for the Sigbus signal, and then call return in the simply. By doing this is the write system call returns with the number of bytes it wrote before it got interrupted and the errno set To success. Let me point out that this would is a bad solution, one that treats the symptoms and not the cause of the problem. Because Sigbus signals that something has gone seriously and the process, I wrong would using this as a discourage On.

The second solution involves file leasing (which is called "opportunistic locking" in Microsoft Windows) from the Kerne L. This is the correct way to fix this problem. By using leasing on the file descriptor, your take a lease with the kernel on a particular file. You are then can request a read/write lease from the kernel. When another process tries to truncate the file for your are transmitting, the kernel sends you a real-time signal, the Rt_sign Al_lease signal. It tells you kernel are breaking your write or read lease on that file. Your write call are interrupted before Your program accesses a invalid address and gets killed by the sigbus signal. The return value of the ' Write call ' is the number of bytes written before the interruption, and the errno'll be set to Su Ccess. This is some sample code this shows how to get a lease from the kernel:

if (Fcntl (FD, F_setsig, rt_signal_lease) = = 1) {
    perror ("Kernel LEASE set SIGNAL");
    return-1;
}
/* L_type can be f_rdlck f_wrlck/
if (Fcntl (FD, F_setlease, L_type)) {
    perror ("Kernel lease set type");
    return-1;
}

You are should get your lease before mmaping the file, and break your lease do. This are achieved by calling Fcntl F_setlease with the lease type of F_UNLCK. Sendfile

In kernel version 2.1, the Sendfile system call is introduced to simplify the transmission of data over the network and b Etween two local files. Introduction of Sendfile not only reduces data copying, it also reduces the context switches. Use it as this:

Sendfile (socket, file, Len);

To get a better idea of the process involved, take a look at Figure 3.

Figure 3. Replacing Read and Write with Sendfile

Step one:the sendfile System call causes the file contents to is copied into a kernel buffer by the DMA engine. Then the "Data is copied" by the kernel to the kernel buffer associated with sockets.

Step two:the Third copy happens as the DMA engine passes the "data from" kernel socket buffers to the protocol.

Your are probably wondering what happens if another process truncates the file we are transmitting with the Sendfile system Call. If we don ' t register any signal handlers, the sendfile call simply returns with the number of bytes it transferred before It got interrupted, and the errno is set to success.

If we get a lease from the kernel to the file before we call Sendfile, however, the behavior and the return status are Exa Ctly the same. We also get the Rt_signal_lease SIGNAL before the Sendfile call returns.

So far, we have been able to avoid has the kernel make several copies, but we are still left with one copy. Can that is avoided too? Absolutely, with a little help from the hardware. To eliminate all of the data duplication by the kernel, we need a network interface this supports gather. This is simply means that data awaiting transmission doesn ' t need to be in consecutive memory; It can be scattered through various memory locations. In kernel version 2.4, the socket buffer descriptor be modified to accommodate those requirements-what is known as Zero C Opy under Linux. This approach is reduces multiple context switches, it also eliminates data duplication do by the processor. For User-level applications No has changed, so the code still looks as this:

Sendfile (socket, file, Len);

To get a better idea of the process involved, take a look at Figure 4.

Figure 4. Hardware that supports gather can assemble data from multiple memory, locations eliminating copy.

Step one:the sendfile System call causes the file contents to is copied into a kernel buffer by the DMA engine.

Step two:no The data are copied into the socket buffer. Instead, only descriptors with information about the whereabouts and length of the the data are appended to the socket buffer. The DMA engine passes data directly from the kernel buffer to the protocol engine, thus eliminating the remaining final C Opy.

Because data still is actually copied from the disk to the memory and from the memory, wire some s not a true zero copy. This is zero copy from the operating system standpoint, though, because the "not" duplicated between, kernel. When using zero copy, the other performance benefits can is had besides copy avoidance, such as fewer context switches, less CPU data cache pollution and no CPU checksum calculations.

Now so we know what zero copy is, let's put theory into practice and write some code. You can download the full source code from WWW.XALIEN.ORG/ARTICLES/SOURCE/SFL-SRC.TGZ. To unpack the source code, type TAR-ZXVF sfl-src.tgz at the prompt. To compile the code and create the random data file Data.bin, run make.

Looking at the code starting with header files:

/* SFL.C Sendfile Example program
Dragan Stancevic <
header name                 function/variable
--------------- ----------------------------------/*
#include <stdio.h>/          * printf, perror * *
#include <fcntl.h >/          * Open/
#include <unistd.h>/* Close/
#include <errno.h>/          * errno *
* #include <string.h>/         * memset/
#include <sys/socket.h>/     * Socket/
#include < netinet/in.h>/*     sockaddr_in/
#include <sys/sendfile.h>/* sendfile//
#include < Arpa/inet.h>      /* inet_addr
/#define BUFF_SIZE (10*1024)/* SIZE of the TMP
                               buffer * *

Besides the regular <sys/socket.h> <netinet/in.h> required for basic socket operation, we need a prototype Definition of the Sendfile system call. This can is found in the <sys/sendfile.h> server flag:

/* are we sending or receiving
/if (argv[1][0] = = ' s ') is_server++;
/* Open descriptors
/SD = socket (pf_inet, sock_stream, 0);
if (is_server) fd = open ("Data.bin", o_rdonly);

The same program can act as either a server/sender or a client/receiver. We have to check one of the command-prompt parameters, and then set the flag Is_server to run in sender mode. We also open a stream socket of the INET protocol family. As part of running in server mode we need some type of data to transmit to a client, so we open our data file. We are using the system call Sendfile to transmit data, so we don ' t have to read the actual contents of the the file and store It in our program memory buffer. Here's the server address:

/* Clear the memory * *
memset (&sa, 0, sizeof (struct sockaddr_in));
/* Initialize structure * *
sa.sin_family = pf_inet;
Sa.sin_port = Htons (1033);
SA.SIN_ADDR.S_ADDR = inet_addr (argv[2]);

We clear the server address structure and assign the protocol family, port and IP address of the server. The address of the ' server is passed as a command-line parameter. The port number is hard coded to unassigned Port 1033. This port number was chosen because it are above the port range requiring root access to the system.

This is the server execution branch:

if (is_server) {
    int client; * New client Socket
    /printf ("Server binding to [%s]/n", argv[2]);
    if (Bind (SD, (struct sockaddr *) &sa,
                      sizeof (SA)) < 0) {
        perror ("bind");
        Exit (errno);
    }

As a server, we need to assign a address to our socket descriptor. This is achieved by the system call bind, which assigns the socket Descriptor (SD) a server address (SA):

if (Listen (sd,1) < 0) {
    perror ("Listen");
    Exit (errno);
}

Because we are using a stream socket, we have to advertise my willingness to accept incoming and set the connections ection Queue size. I ' ve set the backlog queue to 1, but it are common to set the backlog a bit higher for established connections waiting to B E accepted. In older versions of the kernel, the backlog queue is used to prevent SYN flood. Because the system call listen changed to set parameters for only established connections, the backlog queue feature has B Een deprecated for this call. The kernel parameter Tcp_max_syn_backlog has taken over the "role of protecting" system from SYN Flood attacks:

if (client = accept (SD, NULL, NULL) < 0) {
    perror ("accept");
    Exit (errno);
}

The system call accept creates a new connected socket from the ' the ' the ' I ' connection request on the Pending connections queue. The return value from the ' call is ' a descriptor for a newly created connection; The socket is now ready for read, write or poll/select system calls:

if (cnt = Sendfile (Client,fd,&off,
                          buff_size)) < 0) {
    perror ("Sendfile");
    Exit (errno);
}
printf ("Server sent%d bytes./n", CNT);
Close (client);

A Connection is established on the client socket descriptor and so we can start transmitting data to the remote system. We do this by calling the Sendfile system call, which are prototyped under Linux in the following manner:

extern ssize_t
sendfile (int __out_fd, int __in_fd, off_t *offset,
          size_t __count) __throw;

The two parameters are file descriptors. The third parameter points to a offset from which sendfile should start sending data. The fourth parameter is the number of bytes we want to transmit. In order for the Sendfile transmit to use zero-copy functionality, you need memory gather operation the support from your Orking card. You are also need checksum capabilities for protocols that implement checksums, such as TCP or UDP. If your NIC is outdated and doesn ' t support those features, you can still the to sendfile files. The difference is the kernel would merge the buffers before transmitting them. Portability Issues

One of the problems with the Sendfile system call, at general, is the lack of a standard implementation, as there are for T He open system call. Sendfile implementations in Linux, Solaris or HP-UX are quite different. This is poses a problem for the developers who wish to use zero copy in their network data transmission code.

One of the implementation differences is Linux provides a sendfile that defines a interface for transmitting data between Two file descriptors (File-to-file) and (File-to-socket). HP-UX and Solaris, on the other hand, can is used only for file-to-socket submissions.

The second difference is Linux doesn ' t implement vectored transfers. Solaris Sendfile and HP-UX Sendfile have extra parameters that eliminate overhead associated with prepending to th e data being transmitted. Looking ahead

The implementation of zero copy under Linux is far from finished and are likely to the near. More functionality should to be added. For example, the sendfile call doesn ' t support vectored transfers, and servers such as Samba and Apache have to use Multip Le Sendfile calls with the Tcp_cork flag set. This flag tells the "system more data" coming through in the next sendfile. Tcp_cork also is incompatible with tcp_nodelay and are used when we want to prepend or append to the data. This is a perfect example of where a vectored call would eliminate the need for multiple sendfile calls and delays mandate D by the current implementation.

One rather unpleasant limitation in the current sendfile is it cannot to used when transferring files greater than 2GB. The Files of such size are not all so uncommon today, and it's rather disappointing have to duplicate Way out. Because both Sendfile and mmap methods are unusable in the case, a sendfile64 would is really handy in a future kernel ve Rsion. Conclusion

Despite some drawbacks, zero-copy sendfile is a useful feature, and I hope your have found this article informative enough To start using the IT in your programs. If you are have a more in-depth interest in the subject, keep a eye out for my second article, titled "Zero Copy Ii:kernel P Erspective ", where I'll dig a bit more into the kernel internals of zero copy.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.