Zero Copy Technology in Linux, part 1

Source: Internet
Author: User
Tags sendfile

Direct I/O in Linux
If an application can directly access network interface storage, the storage bus does not need to be traversed before the application accesses data. The overhead caused by data transmission is minimal. Applications or database functions running in user mode can directly access the storage of hardware devices. In addition to necessary virtual storage configuration, the operating system kernel, do not participate in anything else during data transmission. Direct I/O allows data to be transmitted directly between applications and peripheral devices, without the support of the operating system kernel page cache. For details about the implementation of direct I/O technology, refer to another article on developerworks, "Introduction to direct I/O mechanism in Linux". This article will not be described too much.


Figure 1. Use Direct I/O Data Transmission


Zero Copy technology for data transmission that does not need to pass through the Application address space
Use MMAP ()

In Linux, one way to reduce the number of copies is to call MMAP () instead of calling read. For example:


tmp_buf = mmap(file, len);  write(socket, tmp_buf, len);

First, after the application calls MMAP (), the data is first copied to the buffer zone of the operating system kernel through DMA. Then, the application shares the buffer with the operating system, so that the operating system kernel and application storage space do not need to perform any data copy operations. After the application calls write (), the operating system kernel copies data from the original kernel buffer to the socket-related kernel buffer. Next, the data from the kernel
The socket buffer is copied to the Protocol engine. This is the third data copy operation.


Figure 2. Use MMAP () instead of read ()

By using MMAP () instead of read (), the number of data copies required by the operating system can be halved. When a large amount of data needs to be transmitted, this will have a better efficiency. However, this improvement also requires a cost. The use of MMA () P is actually a potential problem. When memory ing is performed on the file and the write () system call is called, if other processes have truncated the file, write () the system call will be interrupted by the bus error signal sigbus because a wrong storage access is being executed at this time. This signal will cause the process to be killed. To solve this problem, you can use the following two methods:

  1. Install a new signal processor for sigbus, so that the write () system calls will return the number of written bytes before it is interrupted, and errno will be set to success. However, this method also has its disadvantages. It does not reflect the root cause of this problem, because the bigbus signal only shows that a process has encountered some serious errors.
  2. The second method is to solve this problem through the file lease lock, which is relatively better. We can use the kernel to add a read or write lease lock to the file. When another process attempts to truncate the file being transmitted by the user, the kernel will send a real-time signal to the user: rt_signal_lease signal, which indicates that the user's kernel has damaged the write or read lease lock applied to the file, and the write () system call will be interrupted, in addition, the process is killed by the sigbus signal. The returned value is the number of bytes written before the interruption, and errno is set to success. The file lease lock must be set before memory ing of files.

MMAP is POSIX compatible, but MMAP does not necessarily achieve ideal data transmission performance. A cpu copy operation is still required during data transmission, and the ing operation is also a virtual storage operation with high overhead, this operation requires you to maintain storage consistency by changing the page table and flushing TLB (making the content of TLB invalid. However, because ing is usually applicable to a large range, for data of the same length, the overhead of ing is much lower than that of CPU copy.

Sendfile ()

To simplify user interfaces and retain the advantages of MMAP ()/write () technology: Reduce the number of CPU copies, Linux introduced sendfile () in version 2.1.

Sendfile () not only reduces data copy operations, but also reduces context switching. First, the sendfile () system calls the DMA engine to copy the data in the file to the kernel buffer of the operating system, and then the data is copied to the socket-related kernel buffer. Next, the DMA engine copies data from the kernel Socket buffer to the Protocol engine. If another process truncates the file when you call sendfile () for data transmission, sendfile () the system calls will simply return the number of bytes transmitted before the user's application is interrupted, and errno will be set to success. If
Before sendfile (), the operating system adds a lease lock to the file, so the sendfile () operation and return status will be the same as MMAP ()/write.


Figure 3. Data Transmission Using sendfile ()


Sendfile () system calls do not need to copy or map data to the application address space, so sendfile () it is only applicable when the application address space does not need to process the accessed data. Compared with the MMAP () method, sendfile () greatly reduces storage management overhead because the data transmitted by sendfile does not go beyond the boundaries of the user application/operating system kernel. However, sendfile () also has many limitations, as listed below:

  • Sendfile () is limited to network applications based on file services, such as web servers. It is said that in the Linux kernel, sendfile () is implemented only to use the Apache program of sendfile () on other platforms.
  • Because network transmission is asynchronous, it is difficult to implement pairing at the receiving end of the sendfile () system call. Therefore, this technology is not used at the receiving end of data transmission.
  • Based on performance considerations, sendfile () still requires a CPU copy operation from the file to the socket buffer, which may cause page cache to be contaminated by transmitted data.

Sendfile () with the DMA collection and copy Function ()
The sendfile () technology introduced in the previous section still requires a redundant data copy operation for data transmission. By introducing some hardware help, this only one data copy operation can be avoided. To avoid data copies caused by the operating system kernel, a network interface supporting collection operations is required. That is to say, the data to be transmitted can be distributed across different storage locations, instead of storing them in continuous storage. In this way, the data read from the file does not need to be copied to the socket buffer, but only needs to pass the buffer descriptor to the network protocol stack, then it establishes the relevant structure of the data packet in the buffer, and then combines all the data into a network data packet through the DMA collection and copy function. Nic
The DMA engine reads headers and data from multiple locations in one operation. The socket buffer in Linux 2.4 can meet this condition, which is the well-known zero copy Technology in Linux. This method not only reduces the overhead caused by multiple context switches, it also reduces the number of data copies caused by the processor. For user applications, the Code has not changed. First, the sendfile () system calls the DMA engine to copy the file content to the kernel buffer. Then, the buffer descriptor with file location and length information is added to the socket buffer, in this process, you do not need to copy data from the operating system kernel buffer to the socket
In the buffer, the DMA engine directly copies data from the kernel buffer to the Protocol engine, thus avoiding the last data copy.


Figure 4. sendfile with DMA collection and copy Function


In this way, the CPU not only avoids data copying during data transmission, but theoretically, the CPU will never be associated with the transmitted data, this has played a positive role in CPU performance: First, high-speed buffer memory is not contaminated; second, high-speed buffer memory consistency does not need to be maintained, high-speed buffer memory does not need to be refreshed before or after DMA data transmission. However, it is very difficult to implement the latter. The source buffer may be part of the page cache, which means that a general read operation can access it, and the access can also be performed in a traditional way. As long as the storage area can be accessed by the CPU, the high-speed buffer memory consistency needs to pass through

Splice ()

Splice () is a method similar to MMAP () and sendfile () in Linux. It can also be used for data transmission between the user's application address space and the operating system address space. Splice () is suitable for user applications that can determine the data transmission path. It does not need to use the user address space buffer for explicit data transmission operations. Then, when the data is transmitted from one place to another, the data transmitted in the process does not need to be processed by the user application, spice () it becomes a better choice. Splice () can move data in a whole block in the operating system address space, thus reducing most data copy operations. Also, splice ()
Data transmission can be performed in asynchronous mode. user applications can first return data from the system call, while the kernel process of the operating system controls the data transmission process to continue. Splice () can be seen as a stream-based pipeline. The pipeline can connect two file descriptors, And the splice caller can control two devices (or protocol stacks) connect to each other in the operating system kernel.

The splice () system call is very similar to sendfile (). A user application must have two open file descriptors, one for the input device and the other for the output device. Different from sendfile (), splice () allows any two files to connect to each other, not just the file to the socket for data transmission. For the special case of sending data from a file descriptor to a socket, sendfile () is always used, and splice has always been just a mechanism, and it is not limited to sendfile (). That is to say, sendfile () is only
A subset of splice (). in Linux 2.6.23, the implementation of sendfile () is no longer available, but this API and related functions still exist, however, the API and related functions are implemented using the splice () mechanism.

During data transmission, the splice () mechanism sends the read and write operations of the relevant file descriptors alternately, and can re-use the read buffer for write operations. It also uses a simple Stream Control to block write requests through predefined watermarks. Some experiments show that using this method to transmit data from one disk to another increases the throughput by 30% to 70%, and the CPU load will be halved during data transmission.

In Linux 2.6.17, the kernel introduced the splice () system call. However, this concept exists for a long time. In 1988, Larry mcvoy proposed this concept. It was seen as a technology to improve the I/O performance of server-side systems, although it was often mentioned in the following years, however, splice system calls have never been implemented in the mainstream Linux operating system kernel until the emergence of Linux 2.6.17. Splice System calls require four parameters, two of which are file descriptors, one representing the file length, and the other used to control how data is copied. Splice
System calls can be implemented synchronously or asynchronously. When the asynchronous mode is used, the user application will use the signal sigio to know that the data transmission has been terminated. The splice () system calls the following interface:


long splice(int fdin, int fdout, size_t len, unsigned int flags);

When splice () is called, the operating system kernel moves data of up to len bytes from the data source fdin to fdout. The data movement process only goes through the operating system kernel space, the minimum number of copies required. To use the splice () system call, one of the two file descriptors must be used to represent a pipe device. It is not hard to see that this design has limitations, and subsequent Linux versions will be improved to solve this problem. The flags parameter indicates the execution method of the copy operation. The current flags has the following values:

  • Splice_f_nonblock: The splice operation will not be blocked. However, if the file descriptor is not set to an I/O that is not blocked, the call to splice may still be blocked.
  • Splice_f_more: Inform the operating system kernel that more data will be sent to the next Splice System Call.
  • Splice_f_move: If the output is a file, this value will allow the operating system kernel to try to read data directly from the input pipeline buffer to the output address space. This data transmission process does not involve any data copy operations.

Splice () system calls use the pipe buffer mechanism proposed by Linux, this is why at least one of the two file descriptor parameters called by the system must refer to the MPs queue device. To support the splice mechanism, Linux adds the following two definitions in the file_operations structure for devices and file systems:

ssize_t (*splice_write)(struct inode *pipe, strucuct file *out,                        size_t len, unsigned int flags);  ssize_t (*splice_read)(struct inode *in, strucuct file *pipe,                        size_t len, unsigned int flags); 

The two new operations can move len bytes between pipe and in or out based on the flags settings. The Linux File System has implemented the operations that can be used with the above functions, and also implemented a generic_splice_sendpage () function for joining with the socket.


Zero Copy Technology optimized for data transmission between application address space and Kernel
The zero-copy technique mentioned above is implemented by avoiding data copies between the user application and the operating system kernel buffer as much as possible, applications using the above zero copy technology are generally limited to some special situations: either they cannot process data in the operating system kernel or they cannot process data in the user address space. The zero copy technology proposed in this section retains the traditional technology of transferring data between the user application address space and the operating system kernel address space, but is optimized in transmission. We know that data transmission between system software and hardware can improve efficiency through DMA transmission, but for data transmission between user applications and operating systems, there are no similar tools available. The technology introduced in this section is designed to address this situation.
Use Write-time Replication
In some cases, the page cache in the Linux kernel may be shared by multiple applications, the operating system may Map pages in the user application address space buffer to the operating system kernel address space. If an application wants to call the write () system call for the shared data, it may destroy the shared data in the kernel buffer () the system call does not provide any explicit locking operations. In Linux, a write-time replication technology is introduced to protect data.
What is write-time replication?Write-time replication is an optimization strategy in computer programming. Its basic idea is as follows: if multiple applications need to access the same piece of data at the same time, you can assign pointers to this piece of data to these applications. In every application, they all have a copy of this piece of data, when an application needs to modify its own copy of the data, it needs to copy the data to the address space of the application, that is, this application has a real copy of private data to prevent the changes made to this data from being seen by other applications. This process is transparent to the application. If the application never makes any changes to the accessed data, you do not need to copy data to the application's own address space. This is also the primary advantage of replication During writing.
The implementation of replication during write requires MMU support. MMU needs to know which special pages in the process address space are read-only. When you need to write data to these pages, MMU will issue an exception to the operating system kernel, and the operating system kernel will allocate a new physical storage space. The page that will be written into the data needs to correspond to the new physical storage location.
The biggest benefit of Data Replication During writing is that it can save memory. However, for the operating system kernel, writing-time replication increases the complexity of its processing process.


Implementation and limitations of Data Transmission
Data sender

For the sender of data transmission, the implementation is relatively simple, and the physical pages related to the application buffer are locked, map these pages to the address space of the operating system kernel and mark them as "write only ". When the system calls the returned data, both the user application and the network stack can read the data in the buffer. After the operating system has transferred all the data, the application can write the data. If the application tries to write data before the data transmission is completed, an exception occurs. At this time, the operating system copies the data to the application's own buffer zone, and reset the application-side ing. After data transmission is complete, unlock the locked page and reset it.
Cow ID.

Data receiving end
For the data receiving end, the implementation of this technology needs to handle a much more complex situation. If the read () system call is sent before the data packet arrives and the application is blocked, read () the system will inform the operating system of where the data in the received data packets should be stored. In this case, there is no need for page re- ing. The network interface card can provide sufficient support for data to be directly stored in the buffer zone of your application. If the data is received asynchronously, the operating system does not know where to write the data before the read () system calls are sent, because it does not know the location of the user application buffer, therefore, the operating system kernel must first store data in its own buffer zone.

Limitations
The replication technology at write time may cause high processing overhead of the operating system. Page alignment is required for all related buffers, and MMU pages must be integers. This does not cause any problems for the sender. However, the acceptor must be able to handle more complex situations. First, the data packet size should be appropriate, and the data packet size should be able to overwrite the entire page, which limits the networks whose MTU size is greater than the system memory page, such as FDDI and ATM. Second, in order to rescreen the page to the data packet without any interruption, the data part of the data packet must occupy an integer page. When receiving data asynchronously, you can use the network interface card to efficiently move data to the user address space, incoming data packets can be divided into two parts: packet header and data. The data is stored in a separate buffer zone. The virtual storage system then maps the data to the user address space buffer zone. To use this method, two prerequisites must be met, that is, the application buffer must be page-aligned and continuous in virtual storage; second, data packets can be split only when one page of data is sent. In fact, these two prerequisites are hard to satisfy. If the application buffer is not page aligned, or the data packet size exceeds one page, the data needs to be copied. For the data sender, even if the data is written to the application during transmission, the application still needs to avoid using these busy buffers, this is because the overhead of the copy operation during write operations is large. Without end-to-end notifications, it is difficult for applications to know whether a buffer zone has been released or is still in use.

This zero copy technique is more suitable for scenarios where there are few replication events during writing, because the overhead of replication events during writing is much higher than that of a CPU copy. In practice, most applications usually use the same buffer multiple times. Therefore, do not remove the page ing from the operating system address space after the data is used at one time, this will increase efficiency. Considering that the same page may be accessed again, retaining the page ing can save management overhead. However, this ing will not reduce the overhead caused by round-trip movement of the page table and TLB erosion, this is because the read-only mark of the page is changed every time the page is locked or unlocked due to the copy at write time.


Buffer sharing
Another method of using the pre- ing mechanism to share the buffer can also quickly transmit data between the application address space and the operating system kernel. The architecture with the idea of buffer sharing is first implemented on Solaris, which uses the concept of "fbufs. This method requires modifying the API. Data transmission between the application address space and the operating system kernel address space must be implemented in strict accordance with the fbufs architecture. The communication between the operating system kernel is also completed in strict accordance with the fbufs architecture. Each application has a buffer pool, which is mapped to both the user address space and the kernel address space, and can be created only when necessary. Create a buffer by completing a virtual storage operation, fbufs
This can effectively reduce most performance problems caused by storage Consistency Maintenance. This technology remains in the experimental stage in Linux.

Why is Linux I/O API extended?
Traditional Linux Input and Output interfaces, such as read and write system calls, are all based on copies. That is to say, data needs to be copied between the operating system kernel and the buffer defined by the application. For read system calls, user applications present a pre-allocated buffer to the operating system kernel, and the kernel must put the read data in the buffer. For write system calls, as long as the system call returns, the user application can freely reuse the data buffer zone.

To support this mechanism, Linux needs to be able to create and delete virtual storage mappings for each operation. This page re- ing mechanism depends on the overhead caused by machine configuration, Cache architecture, TLB miss processing, and whether the processor is a single processor or multi-processor. If the overhead of the virtual storage/TLB operation is avoided when I/O requests are processed, the I/O performance will be greatly improved. Fbufs is such a mechanism. Using the fbufs architecture can avoid virtual storage operations. According to the data, the structure of fbufs is much better than the page re ing method mentioned above on the single-processor workstation decstation 5000/200. If you want to use
To implement an effective and comprehensive zero copy technology, the Linux API must be extended for the fbufs architecture.


Principles of fast Buffers
I/O data is stored in some buffer zones called fbufs. Each buffer zone contains one or more consecutive virtual storage pages. The application accesses fbuf by protecting the domain. The following two methods are available:

  • If the application is assigned a fbuf, the application has the permission to access the fbuf.
  • If the application receives fbuf through IPC, the application also has access to this fbuf.

In the first case, this protected domain is called the "originator" of fbuf; in the latter case, this protected domain is called the "receiver" of fbuf ".

The traditional Linux I/O interface supports data exchange between the application address space and the operating system kernel. This switching operation causes all data to be copied. If fbufs is used, the buffer containing data needs to be exchanged, so that redundant copy operations are eliminated. The application passes fbuf to the operating system kernel, which can reduce the overhead of Data Copying caused by traditional write system calls. Similarly, the application receives data through fbuf, which can reduce the overhead of Data Copying caused by the call of the traditional read system. As shown in:

Figure 5. Linux I/O API



I/O subsystems or applications can use the fbufs manager to allocate fbufs. Once fbufs is assigned, these fbufs can be transferred from the program to the I/O subsystem or from the I/O subsystem to the program. After use, these fbufs will be released back to the fbufs buffer pool.

Fbufs has the following features in implementation, as shown in 6:

  • Fbuf needs to be allocated from the fbufs buffer pool. Each fbuf has an object, either an application or an operating system kernel. Fbuf can be transferred between applications and operating systems. After fbuf is used, it needs to be released back to a specific fbufs buffer pool, during fbuf transfer, they need to carry information about the fbufs buffer pool.
  • Each fbufs buffer pool is associated with an application. An application can only be associated with one fbufs buffer pool at most. The application is only qualified to access its own buffer pool.
  • Fbufs does not require virtual address re ing because each application can reuse the same Buffer Set. In this way, the converted information of the virtual storage can be cached, and the overhead of the virtual storage subsystem can be eliminated.
  • I/O sub-systems (device drivers, file systems, etc.) can allocate fbufs and put the data to these fbuf directly. In this way, the copy operation between the buffers can be avoided.

Figure 6. fbufs Architecture



As mentioned above, this method requires modifying the API. To use the fbufs architecture, both the application and the Linux operating system kernel driver need to use the new API. If the application needs to send data, then it needs to get a fbuf from the buffer pool, fill in the data, and then send the data through the file descriptor. The received fbufs can be retained by the application for a period of time. Then, the application can use it to send other data or return it to the buffer pool. However, in some cases, the data in the data packet needs to be reassembled, so the application that receives the data through fbuf needs to copy the data to another buffer zone. Furthermore, the application cannot modify the data currently being processed by the kernel. Based on this, fbufs
The system structure introduces the concept of forced lock to ensure its implementation. For applications, if fbufs has been sent to the operating system kernel, the application will not process these fbufs.

Fbufs Problems

Managing Shared Buffer pools requires close cooperation between applications, network software, and device drivers. For the data receiving end, network hardware must be able to transmit the data packets to the correct storage buffer pool allocated by the receiving end using DMA. In addition, the application changes the content that is previously sent to the shared storage, causing data destruction. However, this problem is difficult to debug on the application end. At the same time, the shared storage model is difficult to associate with other types of storage objects, but close cooperation between applications, network software and device drivers requires the support of other storage managers. Although this technology looks promising for the shared buffer
The API needs to be modified, and the driver needs to be modified. This technology also has some unsolved problems, which makes this technology only in the experimental stage. In the testing system, this technology has greatly improved the performance. However, the overall installation of this new architecture still seems unfeasible. This pre-allocated shared buffer mechanism sometimes needs to copy data to another buffer due to granularity issues.


Summary
This series of articles introduces the zero copy Technology in Linux. This article is the second part. This article gives a more detailed introduction of the zero-Copy Technology in the first part of the article on the Linux operating system. It mainly describes their respective advantages, disadvantages and applicable scenarios. For network data transmission, the application of the zero copy technology is hindered by many architecture factors, including the Virtual Storage Architecture and network protocol architecture. Therefore, the zero copy technology can only be used in some special cases, such as file services or high-bandwidth communication using certain special protocols. However, the application of the zero copy Technology in disk operations is much more feasible, probably because disk operations have the characteristics of synchronization, and the data transmission unit is based on the page granularity.

Many zero copy technologies are proposed and implemented for the Linux operating system platform. However, not all these zero copy technologies are widely used in real-world operating systems. For example, the fbufs architecture seems attractive in many aspects, but using it requires modifying the API and driver, and it has other implementation difficulties, this makes fbufs still stuck in the experiment stage. The Dynamic Address re ing technology only requires a small amount of modifications to the operating system, although the user's software does not need to be modified, however, the current virtual storage architecture does not support frequent virtual address re ing operations. To ensure storage consistency, you must refresh the TLB and level-1 cache after re ing. In fact, the zero copy technology implemented by address re ing is applicable to a very small extent, because the overhead of virtual storage operations is usually
The overhead of CPU copy is even greater. In addition, in order to completely eliminate CPU access to storage, additional hardware is usually required for support, which is not very popular and expensive.

The purpose of this series of articles is to help readers identify the performance issues arising from the zero copy Technology in Linux operating systems. The specific implementation details of various non-copy technologies are not described in detail in this series. At the same time, the zero copy technology has been constantly developing and improving. This series of articles does not cover all the zero copy technologies that have emerged on Linux.


References


Learning

  • Refer to the first part of this series: Linux's zero copy technology, Part 1 Overview.


  • Zero copy I: User-mode perspective, Linux Journal (2003), http://www.linuxjournal.com/node/6345 the zero copy Technology in Several Linux operating systems.


  • In the http://www.ibm.com/?works/cn/aix/library/au-tcpsystemcalls/article, you can find an introduction to the TCP System Call sequence.


  • For the design and implementation of direct I/O Technology in Linux, please refer to the article on developerworks "Introduction to direct I/O mechanism in Linux ".


  • On http://lwn.net/articles/28548/, you can find the introduction to the user address space access technology in the zero copy Technology on Linux.


  • Http://articles.techrepublic.com.com/5100-10878_11-1044112.html? Tag = content; leftcol, this article describes how to use sendfile () to optimize data transmission.


  • For more information about splice () system calling, see http://lwn.net/articles/178199 /.


  • I/O performance and CPU availability, and describes in detail how the splice system calls asynchronous data transmission without the involvement of user applications.


  • In understanding the Linux kernel (3rd edition), there is a Linux kernel implementation.


  • Http://pages.cs.wisc.edu /~ CAO/cs736/slides/cs736-class18.ps This article puts forward and introduces in detail the fbufs for I/O buffer management mechanism, and based on shared storage data transmission mechanism.


  • In others Linux, the zero copy technology is introduced, and the fbufs technology is implemented in the test system.


  • In the developerworks Linux area, find more references for Linux developers (including new Linux beginners) and refer to our most popular articles and tutorials.


  • Read all Linux tips and Linux on developerworks
    Tutorial.


  • Stay tuned to developerworks technical events and network broadcasts.


Discussion

  • Welcome to my developerworks Chinese community.


Original article: http://www.ibm.com/developerworks/cn/linux/l-cn-zerocopy2/index.html

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.