If an application can directly access network interface storage, the storage bus does not need to be traversed before the application accesses data. The overhead caused by data transmission is minimal. The database functions run by applications or users in user mode can directly access the storage of hardware devices. In addition to necessary virtual storage configuration, the operating system kernel, do not participate in anything else during data transmission. Direct I/O allows data to be transmitted directly between applications and peripheral devices, without the support of the operating system kernel page cache. For details about the implementation of direct I/O technology, refer to another article on developerworks, "Introduction to direct I/O mechanism in Linux". This article will not be described too much.
1. Use Direct I/O Data Transmission
Figure 1. Use Direct I/O Data Transmission
Zero Copy technology for data transmission that does not need to pass through the Application address space
Use MMAP ()
In Linux, one way to reduce the number of copies is to call MMAP () instead of calling read. For example:
Tmp_buf = MMAP (file, Len ); Write (socket, tmp_buf, Len ); |
First, after the application calls MMAP (), the data is first copied to the buffer zone of the operating system kernel through DMA. Then, the application shares the buffer with the operating system, so that the operating system kernel and application storage space do not need to perform any data copy operations. After the application calls write (), the operating system kernel copies data from the original kernel buffer to the socket-related kernel buffer. Next, the data is copied from the kernel Socket buffer to the Protocol engine. This is the third data copy operation.
Use MMAP () instead of read ()
Figure 2. Use MMAP () instead of read ()
By using MMAP () instead of read (), the number of data copies required by the operating system can be halved. When a large amount of data needs to be transmitted, this will have a better efficiency. However, this improvement also requires a cost. The use of MMA () P is actually a potential problem. When memory ing is performed on the file and the write () system call is called, if other processes have truncated the file, write () the system call will be interrupted by the bus error signal sigbus because a wrong storage access is being executed at this time. This signal will cause the process to be killed. To solve this problem, you can use the following two methods:
- Install a new signal processor for sigbus, so that the write () system calls will return the number of written bytes before it is interrupted, and errno will be set to success. However, this method also has its disadvantages. It does not reflect the root cause of this problem, because the bigbus signal only shows that a process has encountered some serious errors.
- The second method is to solve this problem through the file lease lock, which is relatively better. We can use the kernel to add a read or write lease lock to the file. When another process attempts to truncate the file being transmitted by the user, the kernel will send a real-time signal to the user: rt_signal_lease signal, which indicates that the user's kernel has damaged the write or read lease lock applied to the file, and the write () system call will be interrupted, in addition, the process is killed by the sigbus signal. The returned value is the number of bytes written before the interruption, and errno is set to success. The file lease lock must be set before memory ing of files.
MMAP is POSIX compatible, but MMAP does not necessarily achieve ideal data transmission performance. A cpu copy operation is still required during data transmission, and the ing operation is also a virtual storage operation with high overhead, this operation requires you to maintain storage consistency by changing the page table and flushing TLB (making the content of TLB invalid. However, because ing is usually applicable to a large range, for data of the same length, the overhead of ing is much lower than that of CPU copy.
Sendfile ()
To simplify user interfaces and retain the advantages of MMAP ()/write () technology: Reduce the number of CPU copies, Linux introduced sendfile () in version 2.1.
Sendfile () not only reduces data copy operations, but also reduces context switching. First, the sendfile () system calls the DMA engine to copy the data in the file to the kernel buffer of the operating system, and then the data is copied to the socket-related kernel buffer. Next, the DMA engine copies data from the kernel Socket buffer to the Protocol engine. If another process truncates the file when you call sendfile () for data transmission, sendfile () the system calls will simply return the number of bytes transmitted before the user's application is interrupted, and errno will be set to success. If
Before sendfile (), the operating system adds a lease lock to the file, so the sendfile () operation and return status will be the same as MMAP ()/write.
Figure: Data Transmission Using sendfile ()
3. Use sendfile () for data transmission
Sendfile () system calls do not need to copy or map data to the application address space, so sendfile () it is only applicable when the application address space does not need to process the accessed data. Compared with the MMAP () method, sendfile () greatly reduces storage management overhead because the data transmitted by sendfile does not go beyond the boundaries of the user application/operating system kernel. However, sendfile () also has many limitations, as listed below:
- Sendfile () is limited to network applications based on file services, such as web servers. It is said that in the Linux kernel, sendfile () is implemented only to use the Apache program of sendfile () on other platforms.
- Because network transmission is asynchronous, it is difficult to implement pairing at the receiving end of the sendfile () system call. Therefore, this technology is not used at the receiving end of data transmission.
- Based on performance considerations, sendfile () still requires a CPU copy operation from the file to the socket buffer, which may cause page cache to be contaminated by transmitted data.
The sendfile () technology introduced in the previous section still requires a redundant data copy operation for data transmission. By introducing some hardware help, this only one data copy operation can be avoided. To avoid data copies caused by the operating system kernel, a network interface supporting collection operations is required. That is to say, the data to be transmitted can be distributed across different storage locations, instead of storing them in continuous storage. In this way, the data read from the file does not need to be copied to the socket buffer, but only needs to pass the buffer descriptor to the network protocol stack, then it establishes the relevant structure of the data packet in the buffer, and then combines all the data into a network data packet through the DMA collection and copy function. Nic
The DMA engine reads headers and data from multiple locations in one operation. The socket buffer in Linux 2.4 can meet this condition, which is the well-known zero copy Technology in Linux. This method not only reduces the overhead caused by multiple context switches, it also reduces the number of data copies caused by the processor. For user applications, the Code has not changed. First, the sendfile () system calls the DMA engine to copy the file content to the kernel buffer. Then, the buffer descriptor with file location and length information is added to the socket buffer, this process does not need to copy data from the operating system kernel buffer
In the socket buffer, the DMA engine directly copies data from the kernel buffer to the Protocol engine, thus avoiding the last data copy.
Figure: sendfile () with the DMA collection and copy Function ()Sendfi for Ma's copy collection function
Sendfile with DMA collection and copy Function
In this way, the CPU not only avoids data copying during data transmission, but theoretically, the CPU will never be associated with the transmitted data, this has played a positive role in CPU performance: First, high-speed buffer memory is not contaminated; second, high-speed buffer memory consistency does not need to be maintained, high-speed buffer memory does not need to be refreshed before or after DMA data transmission. However, it is very difficult to implement the latter. The source buffer may be part of the page cache, which means that a general read operation can access it, and the access can also be performed in a traditional way. As long as the storage area can be accessed by the CPU, the high-speed buffer memory consistency needs to pass through
Refresh high-speed buffer memory for maintenance before DMA transmission. Moreover, the implementation of this data collection and copy function requires support from hardware and device drivers.
Splice ()
Splice () is a method similar to MMAP () and sendfile () in Linux. It can also be used for data transmission between the user's application address space and the operating system address space. Splice () is suitable for user applications that can determine the data transmission path. It does not need to use the user address space buffer for explicit data transmission operations. Then, when the data is transmitted from one place to another, and the data transmitted in the process does not need to be processed by the user application, spice () it becomes a better choice. Splice () can move data in a whole block in the operating system address space, thus reducing most data copy operations. Also, splice ()
Data transmission can be performed in asynchronous mode. user applications can first return data from the system call, while the kernel process of the operating system controls the data transmission process to continue. Splice () can be seen as a stream-based pipeline. The pipeline can connect two file descriptors, And the splice caller can control two devices (or protocol stacks) connect to each other in the operating system kernel.
The splice () system call is very similar to sendfile (). A user application must have two open file descriptors, one for the input device and the other for the output device. Different from sendfile (), splice () allows any two files to connect to each other, not just the file to the socket for data transmission. For the special case of sending data from a file descriptor to a socket, sendfile () is always used, and splice has always been just a mechanism, and it is not limited to sendfile (). That is, sendfile ()
It is only a subset of splice (). in Linux 2.6.23, The sendfile () mechanism is no longer implemented, but this API and corresponding functions still exist, however, the API and related functions are implemented using the splice () mechanism.
During data transmission, the splice () mechanism sends the read and write operations of the relevant file descriptors alternately, and can re-use the read buffer for write operations. It also uses a simple Stream Control to block write requests through predefined watermarks. Some experiments show that using this method to transmit data from one disk to another increases the throughput by 30% to 70%, and the CPU load will be halved during data transmission.
In Linux 2.6.17, the kernel introduced the splice () system call. However, this concept exists for a long time. In 1988, Larry mcvoy proposed this concept. It was seen as a technology to improve the I/O performance of server-side systems, although it was often mentioned in the following years, however, splice system calls have never been implemented in the mainstream Linux operating system kernel until the emergence of Linux 2.6.17. Splice System calls require four parameters, two of which are file descriptors, one representing the file length, and the other used to control how data is copied. Splice
System calls can be implemented synchronously or asynchronously. When the asynchronous mode is used, the user application will use the signal sigio to know that the data transmission has been terminated. The splice () system calls the following interface:
Long Splice (INT fdin, int fdout, size_t Len, unsigned int flags ); |
When splice () is called, the operating system kernel moves data of up to len bytes from the data source fdin to fdout. The data movement process only goes through the operating system kernel space, the minimum number of copies required. To use the splice () system call, one of the two file descriptors must be used to represent a pipe device. It is not hard to see that this design has limitations, and subsequent Linux versions will be improved to solve this problem. The flags parameter indicates the execution method of the copy operation. The current flags has the following values:
- Splice_f_nonblock: The splice operation will not be blocked. However, if the file descriptor is not set to an I/O that is not blocked, the call to splice may still be blocked.
- Splice_f_more: Inform the operating system kernel that more data will be sent to the next Splice System Call.
- Splice_f_move: If the output is a file, this value will allow the operating system kernel to try to read data directly from the input pipeline buffer to the output address space. This data transmission process does not involve any data copy operations.
Splice () system calls use the pipe buffer mechanism proposed by Linux, this is why at least one of the two file descriptor parameters called by the system must refer to the MPs queue device. To support the splice mechanism, Linux adds the following two definitions in the file_operations structure for devices and file systems:
Ssize_t (* splice_write) (struct inode * pipe, strucuct file * Out, Size_t Len, unsigned int flags ); Ssize_t (* splice_read) (struct inode * In, strucuct file * pipe, Size_t Len, unsigned int flags ); |
The two new operations can move len bytes between pipe and in or out based on the flags settings. The Linux File System has implemented the operations that can be used with the above functions, and also implemented a generic_splice_sendpage () function for joining with the socket.
Optimizes the data transmission between the application address space and the kernel.
Zero Copy Technology optimized for data transmission between application address space and Kernel
The zero-copy technique mentioned above is implemented by avoiding data copies between the user application and the operating system kernel buffer as much as possible, applications using the above zero copy technology are generally limited to some special situations: either they cannot process data in the operating system kernel or they cannot process data in the user address space. The zero copy technology proposed in this section retains the traditional technology of transferring data between the user application address space and the kernel address space of the operating system, but it is optimized in transmission. We know that data transmission between system software and hardware can improve efficiency through DMA transmission, but for data transmission between user applications and operating systems, there are no similar tools available. The technology introduced in this section is designed to address this situation.
Use Write-time Replication
In some cases, the page cache in the Linux kernel may be shared by multiple applications, the operating system may Map pages in the user application address space buffer to the operating system kernel address space. If an application program wants to call the write () system call for the shared data, it may destroy the shared data in the kernel buffer () the system call does not provide any explicit locking operations. In Linux, a write-time replication technology is introduced to protect data.
Another method of using the pre- ing mechanism to share the buffer can also quickly transmit data between the application address space and the operating system kernel. The architecture that uses the idea of buffer sharing is first implemented on Solaris, which uses the concept of "fbufs. This method requires modifying the API. Data transmission between the application address space and the operating system kernel address space must be implemented in strict accordance with the fbufs architecture. The communication between the operating system kernel is also completed in strict accordance with the fbufs architecture. Each application has a buffer pool, which is mapped to both the user address space and the kernel address space, and can be created only when necessary. Completed once
Virtual Storage Operations to create a buffer, fbufs can effectively reduce most of the performance problems caused by storage Consistency Maintenance. This technology remains in the experimental stage in Linux.5 Linux I/O API
I/O subsystems or applications can use the fbufs manager to allocate fbufs. Once fbufs is assigned, these fbufs can be transferred from the program to the I/O subsystem or from the I/O subsystem to the program. After use, these fbufs will be released back to the fbufs buffer pool.
Fbufs has the following features in implementation, as shown in Figure 9:
- Fbuf needs to be allocated from the fbufs buffer pool. Each fbuf has an object, either an application or an operating system kernel. Fbuf can be transferred between applications and operating systems. After fbuf is used, it needs to be released back to a specific fbufs buffer pool, during fbuf transfer, they need to carry information about the fbufs buffer pool.
- Each fbufs buffer pool is associated with an application. An application can only be associated with one fbufs buffer pool at most. The application is only qualified to access its own buffer pool.
- Fbufs does not require virtual address re ing because each application can reuse the same Buffer Set. In this way, the converted information of the virtual storage can be cached, and the overhead of the virtual storage subsystem can be eliminated.
- I/O sub-systems (device drivers, file systems, etc.) can allocate fbufs and put the data to these fbuf directly. In this way, the copy operation between the buffers can be avoided.
Figure 6. fbufs Architecture
As mentioned above, this method requires modifying the API. To use the fbufs architecture, both the application and the Linux operating system kernel driver need to use the new API. If the application needs to send data, then it needs to get a fbuf from the buffer pool, fill in the data, and then send the data through the file descriptor. The received fbufs can be retained by the application for a period of time. Then, the application can use it to send other data or return it to the buffer pool. However, in some cases, the data in the data packet needs to be reassembled, so the application that receives the data through fbuf needs to copy the data to another buffer zone. Furthermore, the application cannot modify the data currently being processed by the kernel. Based on this, fbufs
The system structure introduces the concept of forced lock to ensure its implementation. For applications, if fbufs has been sent to the operating system kernel, the application will not process these fbufs.
This series of articles introduces the zero copy Technology in Linux. This article is the second part. This article gives a more detailed introduction of the zero-Copy Technology in the first part of the article on the Linux operating system. It mainly describes their respective advantages, disadvantages and applicable scenarios. For network data transmission, the application of the zero copy technology is hindered by many physical structure factors, including the Virtual Storage Architecture and network protocol architecture. Therefore, the zero copy technology can only be used in some special cases, such as file services or high-bandwidth communication using a special protocol. However, the application of the zero copy Technology in disk operations is much more feasible, probably because disk operations have the characteristics of synchronization and the data transmission unit is
Page granularity.
Many zero copy technologies are proposed and implemented for the Linux operating system platform. However, not all these zero copy technologies are widely used in real-world operating systems. For example, the fbufs architecture seems attractive in many aspects, but using it requires modifying the API and driver, and it has other implementation difficulties, this makes fbufs still stuck in the experiment stage. The Dynamic Address re ing technology only requires a small amount of modifications to the operating system, although the user's software does not need to be modified, however, the current virtual storage architecture does not support frequent virtual address re ing operations. To ensure storage consistency, you must refresh the TLB and level-1 cache after re ing. In fact, the zero copy technology implemented by address re ing is applicable to a very small extent, because the overhead of virtual storage operations is usually
The overhead of CPU copy is even greater. In addition, in order to completely eliminate CPU access to storage, additional hardware is usually required for support, which is not very popular and expensive.
The purpose of this series of articles is to help readers identify the performance issues arising from the zero copy Technology in Linux operating systems. The specific implementation details of various non-copy technologies are not described in detail in this series. At the same time, the zero copy technology has been constantly developing and improving. This series of articles does not cover all the zero copy technologies that have emerged on Linux.