0 Copy technology in Linux, part 2nd

Source: Internet
Author: User
Tags sendfile

Technology implementation

This series consists of two articles, which describe several 0 copy technologies currently used on the Linux operating system, and simply describes the implementation of various 0 copy technologies, as well as their characteristics and application scenarios. The first part mainly introduces some background knowledge about 0 copy technology, briefly outlines why Linux needs 0 copy technology and what kind of 0 copy technology is available in Linux. This article is the second part of this series, for the first part of the 0 copy technology mentioned in a more detailed introduction, and the advantages and disadvantages of these 0 copy technology analysis.

0 Reviews:

Huang, software engineer, IBM

Feng Rui, software engineer, IBM

January 27, 2011

    • Content

Develop and deploy your next application on the IBM Bluemix cloud platform.

Start your free trial now

Direct I/O in Linux

If the application has direct access to the network interface store, the storage bus does not need to be traversed before the application accesses the data, and the overhead of data transfer will be minimal. The application, or library functions running in user mode, can directly access the storage of the hardware device, and the operating system core does not participate in anything else in the data transfer process except for the necessary virtual storage configuration work. Direct I/O allows data to be transferred directly between applications and peripherals, without the need for OS kernel page cache support. Details of the implementation of the direct I/O technology can be described in another article on DeveloperWorks, "Introduction to the direct I/O mechanism in Linux", which is not overly descriptive.

Figure 1. Data transfer using direct I/O for 0 copy technology that does not require application address space for data transmission use MMAP ()

One way to reduce the number of copies in Linux is to call Mmap () instead of calling read, for example:

Tmp_buf = mmap (file, Len);  Write (socket, tmp_buf, Len);

First, after the application calls Mmap (), the data is first copied to the operating system kernel buffer through DMA. The application then shares this buffer with the operating system, so that the operating system kernel and application storage space do not require any further data copying operations. After the application calls write (), the operating system kernel copies the data from the original kernel buffer to the kernel buffer associated with the socket. Next, the data is copied from the kernel socket buffer to the protocol engine, which is the third data copy operation.

Figure 2. Use Mmap () instead of Read ()

By using mmap () instead of read (), you can halve the number of times the operating system needs to copy data. When large amounts of data need to be transferred, this will have a better efficiency. However, there is a cost to this improvement, and the use of MMA () p is actually a potential problem. When a file is memory mapped and then called by the Write () system call, if the other process truncates the file at this point, then the write () system call will be interrupted by the bus error signal Sigbus, because an incorrect storage access is being performed at this time. This signal will cause the process to be killed, which can be solved by either of the following methods:

    1. Install a new signal processor for Sigbus so that the write () system call returns the number of bytes written before it is interrupted, and errno is set to success. But this approach also has its drawbacks, and it does not reflect the root cause of the problem, because the bigbus signal simply shows that a process has some serious errors.
    2. The second approach is to solve the problem by renting a file lock, which is a relatively good approach. We can use the kernel to read or write the file rental lock, when another process attempts to the user is transmitting the file truncation, the kernel will send to the user a real-time signal: Rt_signal_lease signal, This signal tells the user that the kernel has broken the write or read lease lock that the user added to the file, then the write () system call is interrupted, and the process is killed by the Sigbus signal, the return value is the number of bytes written before the interrupt, and errno is set to success. File rental locks need to be set before memory is mapped to the file.

Using MMAP is POSIX compliant, but using mmap does not necessarily achieve the desired data transfer performance. The process of data transfer still requires a CPU copy operation, and the mapping operation is also a costly virtual storage operation that needs to maintain storage consistency by changing the page table and flushing the TLB (making the contents of the TLB invalid). However, because the mappings are generally applicable to a larger range, the cost of mapping for data of the same length is much lower than the overhead of a CPU copy.

Sendfile ()

To simplify the user interface while continuing to preserve the benefits of mmap ()/write () technology: Reducing the number of copies of the CPU, Linux introduced the Sendfile () system call in version 2.1.

Sendfile () Not only reduces the data copy operation, it also reduces the context switch. First: the Sendfile () system call uses the DMA engine to copy the data from the file into the operating system kernel buffer, and then the data is copied to the socket-related kernel buffer. Next, the DMA engine copies the data from the kernel socket buffer to the protocol engine. If another process truncates the file while the user calls the Sendfile () system call for data transfer, the Sendfile () system call simply returns the number of bytes transferred before the user's application is interrupted, and errno is set to success. If the operating system adds a lease lock to the file before calling Sendfile (), the operation and return status of Sendfile () will be the same as mmap ()/write ().

Figure 3. Data transfer with Sendfile ()

Sendfile () system calls do not need to copy or map data to the application address space, so Sendfile () is only applicable to applications where the address space does not require processing of the data being accessed. As opposed to the mmap () method, Sendfile () greatly reduces the overhead of storage management because the data transmitted by Sendfile does not cross the boundary of the user application/Os core. However, Sendfile () also has many limitations, as listed below:

    • Sendfile () is limited to file service-based Web applications such as Web servers. It is said that the implementation of Sendfile () in the Linux kernel is only intended to use the Apache program of Sendfile () on other platforms.
    • Because of the asynchronous nature of the network transmission, it is difficult to implement the pairing at the receiving end of the Sendfile () system call, so the receiving end of the data transmission is not used in this technique.
    • For performance-based considerations, sendfile () still requires a CPU copy operation from the file to the socket buffer, which causes the page cache to be contaminated with data that is transmitted.
Sendfile () with DMA collection copy function

The Sendfile () technology described in the previous section will still require an extra copy of the data to be transferred, and by introducing a bit of hardware help, this only one copy of the data operation can be avoided. To avoid a copy of the data from the operating system kernel, a network interface is used to support the collection operation, which means that the data to be transferred can be scattered in different locations of the storage without having to be stored in continuous storage. In this way, the data read from the file does not need to be copied into the socket buffer at all, but only need to upload the buffer descriptor to the network protocol stack, after which it establishes the structure of the packet in the buffer, and then through the DMA collection copy function to combine all the data into a network packet. The DMA engine of the network card reads the header and data from multiple locations in a single operation. This is true for the socket buffers in Linux 2.4, which is the well-known 0 copy technology for Linux, which reduces the overhead of multiple context switches and reduces the number of copies of data that the processor creates. There is no change in the code for the user application. First, the Sendfile () system call uses the DMA engine to copy the contents of the file to the kernel buffer, and then adds a buffer descriptor with the file location and length information to the socket buffer, which does not need to copy the data from the operating system kernel buffer into the socket buffer. The DMA engine copies the data directly from the kernel buffer to the protocol engine, thus avoiding the last copy of the data.

Figure 4. Sendfile with DMA-collected copy function

In this way, the CPU in the process of data transmission not only avoids the copy operation, in theory, the CPU will never be associated with the transmitted data, which for the performance of the CPU has played a positive role: first, the high-speed buffer memory was not contaminated; The consistency of the buffer memory does not need to be maintained, and the buffer memory does not need to be refreshed before the DMA is transferred or transmitted. In practice, however, the latter is very difficult to achieve. The source buffer is likely to be part of the page cache, which means that a generic read operation can access it, and that access can be done in a traditional way. As long as the storage area can be accessed by the CPU, the consistency of the cache memory needs to be maintained by flushing the new buffer memory prior to the DMA transfer. Furthermore, the implementation of this data-collection copy feature requires hardware and device driver support.

Splice ()

Splice () is a method similar to Mmap () and Sendfile () in Linux. It can also be used for data transfer between the user application address space and the operating system address space. Splice () is suitable for user applications that can determine the data transfer path, and it does not require explicit data transfer operations using buffers in the user's address space. Spice () is a good choice when data is transferred from one place to another, and the data transmitted in the process does not need to be processed by the user application. Splice () can move data throughout the operating system address space, reducing most data copy operations. Moreover, splice () data transfer can be done asynchronously, the user application can be returned from the system call, while the operating system kernel process will control the data transfer process continue. Splice () can be thought of as an implementation of a stream-based pipeline that allows two file descriptors to be connected to each other, while the splice caller can control the connection of two devices (or protocol stacks) to the operating system kernel.

Splice () system calls and Sendfile () are very similar, the user application must have two open file descriptors, one for the input device and one for the output device. Unlike Sendfile (), splice () allows any two files to be connected to each other, not just files to the socket for data transfer. For the exception of sending data from a file descriptor to the socket, the Sendfile () is always used as a system call, and splice has always been a mechanism, and it is not limited to the function of Sendfile (). That is, Sendfile () is only a subset of splice (), in the Linux 2.6.23, the implementation of the Sendfile () mechanism is gone, but the API and the corresponding functions exist, but the API and the corresponding function is to take advantage of the splice ( This mechanism is implemented.

During data transfer, the splice () mechanism alternately sends read and write operations for the associated file descriptor, and the read buffer can be re-used for write operations. It also leverages a simple flow control that blocks write requests with pre-defined watermarks (watermark). Experiments have shown that using this method to transfer data from one disk to another increases the throughput by 30% to 70%, and the CPU load is halved during data transfer.

The Linux 2.6.17 kernel introduced the splice () system call, but this concept has existed for a long time. In 1988, Larry McVoy proposed this concept as a technique to improve I/O performance of server-side systems, although it is often mentioned in the following years, but splice system calls have never been implemented in the mainstream Linux operating system kernel until The advent of the Linux 2.6.17 release. Splice system calls require four parameters, two of which are file descriptors, one for file length, and one for controlling how data is copied. Splice system calls can be implemented synchronously or asynchronously. When using asynchronous mode, the user application will use the signal SIGIO to learn that the data transfer has been terminated. The interface for the splice () system call is as follows:

Long splice (int fdin, int fdout, size_t len, unsigned int flags);

Calling the splice () system call causes the operating system kernel to move up to Len bytes of data from the data source Fdin to Fdout, which moves through the operating system kernel space and requires a minimum number of copies. Using the splice () system call requires that one of these two file descriptors must be used to represent a piping device. It is not difficult to see that this design has limitations, and the subsequent version of Linux will improve on this issue. The parameter flags are used to represent the execution method of the copy operation, and the current flags have the following values:

    • The splice_f_nonblock:splice operation is not blocked. However, if the file descriptor is not set to an I/O that is not blocked, then the call splice may still be blocked.
    • Splice_f_more: Tells the OS kernel that the next SPLICE system call will have more data coming.
    • Splice_f_move: If the output is a file, this value causes the operating system kernel to attempt to read the data directly from the input pipeline buffer into the output address space, and this data transfer process does not occur without any data copy operations.

The Splice () system call takes advantage of Linux's proposed pipe buffer mechanism, which is why at least one of the two file descriptor parameters that the system calls must refer to the piping device. To support this splice mechanism, Linux adds the following two definitions to the file_operations structure used for devices and file systems:

ssize_t (*splice_write) (struct inode *pipe, strucuct file *out,                        size_t len, unsigned int flags);  ssize_t (*splice_read) (struct inode *in, strucuct file *pipe,                        size_t len, unsigned int flags);

These two new operations can move Len bytes between pipe and in or out according to the flags setting. The Linux file system has implemented the functions described above and can be used, and also implements a Generic_splice_sendpage () function for interfacing with sockets.

0 Copy technology optimized for data transfer between the application address space and the kernel

Several of the 0 copy technologies mentioned above are implemented by minimizing the duplication of data between the user application and the operating system kernel buffers, and applications that use the 0 copy technology above are usually confined to certain special situations: they cannot process the data in the operating system kernel. You can either not process the data in the user address space. The 0 copy technology presented in this section preserves the traditional technique of passing data between the user application address space and the operating system kernel address space, but is optimized for transmission. We know that the transfer of data between system software and hardware can be more efficient through DMA transfers, but there is no such tool available for data transfer between user applications and the operating system. The techniques presented in this section are presented in response to this situation.

Take advantage of write-time replication

In some cases, the page cache in the Linux operating system kernel may be shared by multiple applications, and the operating system may map pages in the user application's address space buffer to the operating system kernel address space. If an application wants to call the write () system call on this shared data, it can break the shared data in the kernel buffer, and the traditional write () system call does not provide any display lock-in operations, and Linux introduces a technique of write-time replication to protect the data.

What is a write-time copy

Copy-on-write is an optimization strategy in computer programming, and its basic idea is that if there are multiple applications that need to access the same piece of data at the same time, you can assign pointers to the data for these applications, and in each application it has a copy of the data, When one of these applications needs to make modifications to its own copy of the data, it is necessary to actually copy the data into the application's address space, that is, the application has a true copy of the private data, in order to avoid the application's changes to this piece of data to be seen by other applications. This process is transparent to the application, and if the application never makes any changes to the data it accesses, then it is never necessary to copy the data into the application's own address space. This is also the most important advantage of writing-time replication.

The implementation of write-time replication requires MMU support, the MMU need to know which special pages in the process address space is read-only, when the need to write data to these pages, the MMU issued an exception to the operating system kernel, the operating system kernel will allocate new physical storage space, The page that will be written to the data needs to correspond to the new physical storage location.

The biggest benefit of copying on write is that you can save memory. However, for the operating system kernel, write-time replication increases the complexity of the processing process.

The realization of data transmission and its limitation

Data sending end

For the sending side of the data transfer, the implementation is relatively straightforward, locking the physical pages associated with the application buffer, and mapping the pages to the address space of the operating system kernel and identifying them as "write only". When the system call returns, both the user application and the network stack can read the data in the buffer. After the operating system has delivered all the data, the application can write to the data. If an application attempts to write to the data before the transfer is complete, an exception is generated, at which point the operating system copies the data into the application's own buffer and resets the application-side mappings. After the data transfer is complete, unlock the locked page and reset the COW identity.

Data receiving end

For the data receiving end, the implementation of this technology needs to deal with much more complicated situations. If the read () system call is made before the packet arrives and the application is blocked, then the read () system call tells the operating system where the data in the packet received should be stored. In this case, there is no need for page remapping at all, and the network interface card can provide sufficient support for the data to be stored directly in the user's application's buffer. If the data reception is asynchronous, the operating system does not know where to write the data until the read () system call is made, because it does not know the location of the user application buffer, so the operating system kernel must first store the data in its own buffer.


Write-time replication technology can cause the operating system to be expensive to process. All related buffers must be page-aligned, and the MMU page used must be an integer number. For the sending side, this does not cause any problems. But for the receiving end, it needs to be able to handle more complex situations. First, the size of the packet is appropriate, the size needs to be just right to cover a full page of data, which limits the MTU size is larger than the system memory pages of the network, such as FDDI and ATM. Second, in order to remap a page to a stream of packets without any interruption, the data portion of the packet must occupy an entire number of pages. In the case of asynchronously receiving data, in order to efficiently move the data into the user address space, you can use a method: With the support of network interface card, the packet can be divided into two parts of Baotou and data, the data is stored in a separate buffer area, The virtual storage System then maps the data to the user address space buffer. Using this method requires two prerequisites, which are mentioned above: first, the application buffers must be page-aligned and contiguous on the virtual storage, and the data packets can be split when the data comes in a page size. In fact, these two prerequisites are difficult to meet. If the application buffer is not page-aligned, or if the packet size exceeds one page, then the data needs to be copied. For the data sender, even if the data is write-protected for the application during transmission, the application still needs to avoid using these busy buffers, because the cost of the copy operation at write-time is significant. If there is no end-to-end notification, it is difficult for the application to know whether a buffer has been released or is still in use.

This 0 copy technique is better suited to situations where there is less write-time replication events, because the overhead of copying events at write time is much higher than the cost of a single CPU copy. In practice, most applications often reuse the same buffers multiple times, so it is more efficient to not release the page mappings from the operating system address space once the data has been used at one time. Given that the same page might be accessed again, preserving the mapping of the page can save administrative overhead, but this mapping reservation does not reduce the overhead of page-table round-trips and TLB Flushing, because the page's read-only flag is changed every time the page is locked or unlocked because of a copy-on-write.

Buffer sharing

There is another way to use the pre-mapping mechanism of shared buffers to quickly transfer data between the application address space and the operating system kernel. The architecture that uses the idea of buffer sharing is first implemented on Solaris, which uses the concept of "fbufs". This approach requires modifying the API. The data transfer between the application address space and the operating system kernel address space needs to be implemented strictly according to the FBUFS architecture, and the communication between the operating system cores is strictly done according to the FBUFS architecture. Each application has a buffer pool, which is mapped to both the user address space and the kernel address space, and can be created when necessary. By completing a virtual storage operation to create a buffer, FBUFS can effectively reduce most performance issues caused by storage consistency maintenance. The technology is still in the experimental phase of Linux.

Why extend the Linux I/O API

Traditional Linux input and output interfaces, such as read and write system calls, are copy-based, which means that the data needs to be copied between the operating system kernel and the application-defined buffers. For a read system call, the user application is presented to a pre-allocated buffer in the operating system kernel, and the kernel must put the read-in data into the buffer. For write system calls, the user application is free to re-use the data buffer as long as the system call returns.

To support this mechanism, Linux needs to be able to establish and delete virtual storage mappings for each operation. This mechanism of page remapping relies on a variety of factors such as machine configuration, cache architecture, the overhead of TLB misses processing, and whether the processor is a single processor or multiprocessor. The performance of I/O can be greatly improved by avoiding the overhead incurred by virtual storage/TLB operations when processing I/O requests. Fbufs is such a mechanism. Virtual storage operations can be avoided using the FBUFS architecture. The data show that FBUFS this structure will achieve much better performance than the page remapping method mentioned above on decstation™5000/200, a single-processor workstation. If you want to use the FBUFS architecture, you must extend the Linux API to achieve an efficient and comprehensive 0 copy technology.

Quick buffer (Fast buffers) principle Introduction

I/O data is stored in a buffer area called FBUFS, each of which contains one or more contiguous virtual storage pages. Application access Fbuf is implemented through the protection domain, as in the following two ways:

    • If the application is assigned FBUF, then the application has access to the FBUF
    • If the application receives FBUF through the IPC, then the application has access to the FBUF.

In the first case, this protection domain is called the "originator" of the Fbuf, and in the latter case it is called the "receiver" of FBUF.

The traditional Linux I/O interface enables data to be exchanged between the application address space and the operating system kernel, which causes all data to be copied. If you take fbufs this way, you need to swap the buffer that contains the data, which eliminates the extra copy operation. The application passes FBUF to the operating system kernel, which reduces the data copy overhead generated by traditional write system calls. Similarly, applications receive data through FBUF, which can also reduce the cost of copying data from traditional read system calls. As shown in the following:

Figure 5. Linux I/O API

The I/O subsystem or application can be assigned FBUFS through the Fbufs manager. Once the FBUFS is assigned, these fbufs can be passed from the program to the I/O subsystem, or from the I/O subsystem to the program. Once used, these fbufs will be released back to the FBUFS buffer pool.

The FBUFS has the following features on the implementation, as shown in 9:

    • Fbuf needs to be allocated from the FBUFS buffer pool. Each fbuf has a owning object, either an application or an operating system kernel. Fbuf can be passed between the application and the operating system, fbuf need to be freed back to a specific fbufs buffer pool, and they need to carry relevant information about the FBUFS buffer pool during fbuf delivery.
    • Each FBUFS buffer pool is associated with an application, and an application can be associated with at most one fbufs buffer pool. The application is only eligible to access its own buffer pool.
    • FBUFS does not require virtual address remapping because, for each application, they can reuse the same collection of buffers. In this way, the information of the virtual storage transformation can be cached, and the overhead of the virtual storage subsystem can be eliminated.
    • The I/O subsystem (device driver, file system, etc.) can allocate fbufs and put the incoming data directly into these fbuf. In this way, the copy operation between buffers can be avoided.
Figure 6. FBUFS Architecture

As mentioned earlier, this method needs to modify the API, if you want to use the FBUFS architecture, the application and Linux operating system kernel drivers need to use the new API, if the application to send data, then it will be from the buffer pool to get a fbuf, the data is filled in, The data is then sent out through the file descriptor. The received fbufs can be retained by the application for a period of time, after which the application can use it to continue sending additional data, or to return the buffer pool. However, in some cases, the data within the packet needs to be reassembled, and the application that receives the data through FBUF needs to copy the data into another buffer. Furthermore, the application cannot modify the data currently being processed by the kernel, and based on this, the FBUFS architecture introduces the concept of forced locking to ensure its implementation. For applications, if FBUFS has already been sent to the operating system kernel, the application will no longer process these fbufs.

Some problems existed in Fbufs

Managing a shared buffer pool requires close collaboration between applications, network software, and device drivers. For the data receiver, the network hardware must be able to transfer the incoming packets using DMA to the correct storage buffer pool allocated by the receiving end. Also, the application is slightly less aware of changes to the content of the data previously sent to the shared storage, causing data corruption, but this problem is difficult to debug on the application side. At the same time, shared storage is difficult to associate with other types of storage objects, but close collaboration between applications, network software, and device drivers requires the support of other storage managers. For shared buffers This technique looks promising, but this technology requires not only changes to the API, but also changes to the driver, and there are some unresolved problems with the technology itself, which makes this technology currently only for the experimental phase. In the test system, this technique has been greatly improved in performance, but the overall installation of this new architecture does not seem to be feasible at present. This mechanism of pre-allocating shared buffers is sometimes also because of the granularity problem that requires copying data into another buffer.


This article series describes the 0 copy technology in Linux, which is the second part of this article. This article introduced in the first part of the Linux operating system on the emergence of several 0 copy technology is described in more detail, mainly describes their respective advantages, shortcomings and applicable scenarios. For network data transmission, the application of 0 copy technology is hindered by many architectural factors, including virtual storage architecture and network protocol architecture. Therefore, 0 copy technology is still only in some very special situations can be applied, such as file services or use a special protocol for high-bandwidth communication. However, the feasibility of using 0 copy technology for disk operations is much higher, probably because the disk operation is synchronized and the data transmission unit is based on the granularity of the page.

Many 0 copy technologies are proposed and implemented for the Linux operating system platform, but not all of these 0 copy technologies are widely used in real-world operating systems. For example, the FBUFS architecture, which looks attractive in many ways, but uses it to change the API and drivers, it also has some other implementation difficulties, which makes fbufs just stay in the experimental phase. The dynamic address remapping technique only requires minor modifications to the operating system, although there is no need to modify the user software, but the current virtual storage architecture does not support frequent virtual address remapping operations. and to ensure storage consistency, the TLB and first-level caches must also be refreshed after remapping. In fact, the 0 copy technology implemented with address remapping is very small in scope because the overhead of virtual storage operations is often greater than the overhead of a CPU copy. In addition, in order to completely eliminate the CPU access to storage, often require additional hardware to support, and this hardware support is not very popular, but also very expensive.

The purpose of this series of articles is to help readers figure out how these 0 copy technologies appear in the Linux operating system to help improve the performance issues encountered during data transfer. The detailed implementation details of the various 0 copy technologies are not described in detail in this article series. At the same time, 0 copy technology has been continuously developed and perfected, this series of articles does not cover all the 0 copy technology appearing on Linux.

0 Copy technology in Linux, part 2nd

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.