One of the device IO (mmap, direct IO, and asynchronous IO)

Source: Internet
Author: User
Now, in Linux often can be seen in the user space to write the driver, such as x server, some vendors private drive, etc., which means that user space has access to the hardware, this is usually through the MMAP device memory map to the user process space, This allows the user to gain access to the hardware by reading and writing these memory.
The kernel typically buffers I/O operations for better performance, but also provides direct I/O and asynchronous I/O capabilities.
In the data interaction with the hardware, some hardware support DMA,DMA can reduce the processor burden, some hardware memory space can not read directly, need to use special instructions.
one, virtual memory areaWhen using mmap, you need to map the kernel's address block to the user's address space, which involves a very critical data structure VMA, A VMA represents the same class zone in the process's virtual address space: A contiguous virtual address range that has the same permissions and is backed up by the same object (a file or swap space).
You can view the memory area of a process by looking at/proc/${pid}/maps in the following format:
Start-end Perm offset Major:minor inode image
Below is a maps fragment of an init process:
00000000-00000000 R-xp 00000000 01:00 149505/sbin/init.sysvinit
00000000-00000000 Rw-p 00000000 01:00 149505/sbin/init.sysvinit
00000000-00000000---P 00000000 00:00 0
00000000-00000000 Rw-p 00000000 00:00 0 [Heap]
00000000-00000000 Rw-p 00000000 00:00 0
00000000-00000000 Rw-p 00000000 00:00 0 [Stack]
The meaning of each of these parts is as follows:
Start end: The beginning and ending virtual addresses of the slice's memory area. Perm: A bitmask of read, write, and execute permissions for the memory area, and the last character of the Perm is either a private or a share of the P representation. Offset: The offsets of the memory area in the mapping file (the memory area is mapped to a file) major minor: the primary and secondary number of the device that owns the mapping file. For device mappings, a primary or secondary device number refers to the primary and secondary number of the disk files on the disk that represent the device, rather than the primary and secondary device numbers that the kernel assigns to the real device. Inode: The inode number of the mapping file. Image: Map filename VMA The corresponding data structure is vm_area_struct, which contains several function pointers as follows:
1.1 Open

Its prototype is: void (*open) (struct vm_area_struct *vma)

The kernel calls it when it generates a new reference to a VMA so that the kernel part that implements the VMA has the opportunity to initialize itself. However, when a new VMA is created, it is not invoked, but the MMAP function provided by the kernel part is invoked. 1.2 Close

The prototype is: void (*close) (struct vm_area_struct *vma) is called when the memory area is destroyed, VMA does not have a reference count, so a process can only open and close one VMA area at a time.
1.3 Nopage

Its prototype is: struct page * (*nopage) (struct vm_area_struct *vma, unsigned long address, int *type); When a process tries to access a legitimate VMA page, However, when the page is not currently in memory, the kernel calls its nopage function for the VMA. This function returns a page pointer to the physical page. If the VMA does not define its own Nopage interface, the kernel assigns it an empty page.
Second, mmapMMAP allows the device memory to be mapped to user space, allowing the user program to gain access to the hardware, and the mmap action needs to be implemented by the driver in the kernel. After using the mmap mapping, the user program reads and writes to a given range of memory into the device memory, that is, accessing the device.
Not all hardware supports mmap, such as serial devices do not support mmap. Mmap There is a limitation, that is, it maps to a granularity of page_size, so the kernel can only manage virtual memory addresses at the page table level, so using MMAP to map device memory to the virtual memory space of the user process must be in page units, And the physical address that the kernel is mapped must also start with an integer multiple of the page_size, that is, the starting address of the mapped physical address must be aligned to the page_size.
Most PCI peripherals map their control registers to memory addresses, and this type of device only needs to be able to gain control over the hardware by mapping this memory to user space, which is tempting compared to the conventional IOCTL method.
Mmap is part of the file_operations structure. Because in the *nix, everything is a file, so the kernel part is easy to use this structure to realize its own mmap.
User space programs are called Through the system:
Mmap (caddr_t addr, size_t len, int prot, int flags, int fd, off_t offset)
To invoke the Mmap function on the FD. When using the MMAP system call, the kernel does some preparatory work before invoking mmap on FD. The Mmap function on FD is prototyped in the kernel as follows:
Int (*mmap) (struct file *filp, struct vm_area_struct *vma);
FILP is the mapping file, VMA contains information about the virtual address of the device. The mmap on FD need to do is create the appropriate page table for the range of virtual addresses that VMA contains, and initialize the function pointers in VMA so that the appropriate functions can be used later.
2.1 Creating page TablesCreating a page table is the most important work that mmap needs to accomplish. There are two ways to create a page table:
Call the Remap_pfn_range function all at once and build it one page at a time through the Nopage function.2.1.1 Use Remap_pfn_rangeRemap_pfn_range and Io_remap_page_range are responsible for creating a new page table for a physical address, and their prototypes are as follows:
int Remap_pfn_range (struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t prot);
It maps the address space starting at PFN to size (the size will be aligned up to the page page_size) to where the VMA starts from addr. Because the size is variable, it can be used to map the entire area, or it can be used to map only a portion of it.
VMA: The user VMA addr that the physical address is mapped to: the user-space start address (usually Vam->start) when the physical address maps to the user's VMA address space, but can also be vam->start. PFN: Mapped kernel physical address size: Zone sizes Prot: The protected mode of the page in this map int ioremap_page_range (unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot);
It maps the I/O memory starting at PHYS_ADDR (End-addr+1 and up to page_size) to the virtual address starting from addr.
Addr: Virtual address Start Value end: Virtual Address ending value PHYS_ADDR: Physical Address starting value prot: The difference between the protected mode of the zone is that when the address to be mapped to user space is real RAM, the remap_pfn_range is used, Use Ioremap_page_range if the address to be mapped to user space is I/O memory. Note that if it is I/O memory, the kernel usually does not cache it.
2.1.2 uses nopage to map memoryA one-time build of the entire page table is a good choice in most cases, but in some cases nopage is more appropriate. Because it's more flexible. The two typical scenarios for using Nopage are as follows:
The application invokes the MREMAP system call to change the mapping area. When this call causes the VMA area to change to an hour, the kernel does not notify the driver, but instead refreshes the unnecessary pages, but when the call causes the VMA to become larger, the kernel invokes the Nopage method to request the new page. So in this sense, if you want to support this system call, you must implement the Nopage method. The Nopage function is invoked when the user accesses a page in VMA, but the page is not in memory. The Nopage function returns the pointer to the obtained page and increases its reference count to indicate that someone is using the page.
If the parameter type of nopage is not NULL, it can be used to return an error that is different from the return value, typically vm_fault_minor. Because Nopage needs to return a page pointer to the acquired memory, but the storage space for the PCI does not have a page pointer, the Nopage method does not apply to the PCI address space.
Nopage returns a pointer to the struct page when the call succeeds. Otherwise nopage will return an error. If the Nopage function is null, the kernel code responsible for handling the page fault maps the 0 memory pages to the failed virtual address. The 0 memory page is a special page, read it will return 0, write it will modify the private copy of the process.
2.2 Adding a VMA actionAnother important action of mmap is to update the VMA function pointer. That is, nopage,open,close and other function pointers.
2.3 Remapping RAMRemap_pfn_range can only be used to keep pages and physical addresses on top of physical memory, and in fact, memory that is not managed by the memory management system. That is, conventional memory cannot be mapped using it, including memory obtained with __get_free_page. So if you want to use it to map a piece of memory, you need to reserve this part of the memory when the system starts up (because after remap_pfn_range mapping, the process can initiate direct read and write to it, while the memory managed by the inner kernel memory management system may be allocated for other purposes. There is a potential conflict).
Although it is not possible to use Remap_pfn_range to map RAM to user space, there is a workaround to use the VMA Nopage method to map RAM to the user address space, which is a one-page mapping of memory to user space. If a kernel part wants to map the RAM address to the user address space, implement the Nopage function interface, and the page returned to it one page at a time in the function.

Note that when you use the Nopage function to return to page, you need the real page, so you need to find the real page pointer, which can be virt_to_page to get its page for regular kernel memory, but for the address that Vmalloc returns, You want to get the page by Vmalloc_to_page.third, direct I/OMost I/O operations are buffered by the kernel, in order to improve I/O efficiency, but in some scenarios buffering does not necessarily achieve good performance. So the kernel also provides an API for scenarios that do not want to use buffering, and if a peripheral driver does not want to use the kernel's buffering mechanism, you can use the following Api:long get_user_pages (struct task_struct *tsk, struct mm_struct * Mm
Unsigned long start, unsigned long nr_pages, int write,
int force, struct page **pages, struct vm_area_struct **vmas)
This function maps the pages of the user process to the kernel's address space, and then the code in the kernel can access the pages directly. Meaning of its parameters: Tsk: A pointer to the I/O task, which is to tell the kernel who is responsible for the page fault, and if no records can be set to null mm: A pointer to a memory management structure, describing the mapped address space start: User space start Address nr_pages page Write: Does the caller want to write data to this part of the page whether the pages will be written the caller force: If it is set, writes are enforced even if you are using a read-only user mapping process map area. Usually that's not the effect you want. Pages: An array of pointers to the obtained page, which should be at least nr_pages, or null if you do not want to get this information. VMAs: A pointer array that points to the VMA area corresponding to each page. If the caller does not want this information, it can be null.
Because the function needs to set up the page table for mapping, as a result, it is time-consuming, and direct I/O ignores the kernel buffer, because the lack of kernel buffering, so the use of direct I/O will often also use asynchronous I/O, otherwise the user of direct IO to know when the operation is completed, When it can reuse information such as the caching of data that it submits to the kernel must wait for IO to complete, which is obviously not what the user expects in most cases (because IO itself is more time-consuming, and waiting on IO will waste valuable CPU time). In fact, for block device drivers and network drivers, the relevant framework code has used direct I/O at the right time, thus driving the writer does not require direct I/O to the test, and for character drivers it is obvious that direct I/O is not attractive (the character stream is not in page). The special emphasis is that the function must be invoked when the Mmap_sem is held. After the direct I/O operation is complete, these pages must be freed, and if the pages are modified, the Setpagedirty tag page must be called dirty, otherwise the kernel assumes that the content of the page has not changed, and therefore does not synchronize its contents to its corresponding device or file, which is usually wrong. The release page is completed by a function page_cache_release.
IV. Asynchronous I/O (AIO)In addition to direct I/O, the kernel provides an additional I/O feature, asynchronous I/O. Asynchronous I/O allows a user program to initiate one or more I/O operations without waiting for the completion of the operation, and the kernel provides a set of APIs to support user programs to initiate AIO.4.1 User InterfaceThe API that the kernel provides to user space and its interfaces are as follows: Io_setup: Creates an asynchronous I/O context for the current process, and it has a parameter that specifies how many asynchronous I/O can be committed for that context. Io_submit: Commit One or more asynchronous I/O requests io_getevents: Get completion Status of committed asynchronous I/O requests Io_cancel: Canceling committed asynchronous I/O requests Io_destroy: Clearing the asynchronous I/O context created for this process These interfaces are defined in both Aio.h and AIO.C, and are system calls. The implication of these APIs is that an application that requires asynchronous I/O needs to first create an asynchronous I/O context and then submit an asynchronous I/O request on that context, a process that can create multiple asynchronous I/O contexts that are stored in the task_struct->mm-> In Ioctx_list. The user process can then submit an asynchronous/io request on the context it submits. If the state of the asynchronous I/O is to be obtained by io_getevents, the process can optionally use Io_cancel to cancel a committed asynchronous I/O. After use, you can use Io_destroy to clear the asynchronous I/O context.4.2 Kernel Implementation 4.2.1 Asynchronous I/O contextThe kernel uses KIOCTX to represent the asynchronous I/O context, where information is stored when the user creates an asynchronous I/O context, and after a successful creation of an asynchronous I/O context, the kernel returns an ID user process that the user process uses to use that context. When you create an asynchronous I/O context, the kernel creates an AIO ring. AIO ring corresponds to a memory buffer in the user-state process address space, which can be accessed by the user-state process and accessed by the kernel. The kernel approach is to call Get_user_pages to get the user page. The AIO ring is an annular buffer that the kernel uses to report the completion of the asynchronous IO, and the user-state process can also directly check the asynchronous IO completion to avoid the overhead of system calls.4.2.2 Asynchronous I/O requestsThe kernel uses KIOCB to represent an asynchronous I/O request, and the user process uses the data structure IOCB to represent an asynchronous I/O request, and the kernel completes the conversion between the two. At Io_submit, users can submit multiple asynchronous I/O requests at a time, and the kernel will process each asynchronous I/O request (which can be set) in the requested mode, which will eventually invoke the asynchronous IO operation function in the file operation Pointer file_operations. If the function returns a value that is not eiocbqueued, the AIO framework will call Aio_complete and return directly, otherwise it is a real asynchronous i/o,file_operations return eiocbqueued part is responsible for processing the i/ O Call Aio_complete after the request to finalize the asynchronous I/O. The implementation of the kernel shows that for parts that support asynchronous I/O, what it needs to do is to implement the asynchronous I/O interface in File_operations (which can implement its own asynchronous I/O through mechanisms such as workqueue) and invoke Aio_ after asynchronous I/O is completed Complete can be.4.2.3 Collect asynchronous I/O statusWhen a user process collects asynchronous I/O state through a system call, the kernel read_events the request, the kernel waits on the waiting queue in the corresponding context, the wait is awakened by the aio_complete or interrupted or the timeout ends.
4.2.4 Cancel an asynchronous I/O requestIf you want to support canceling asynchronous I/O requests, the implementation of the I/O operation needs to call KIOCB_SET_CANCEL_FN to set its cancellation function so that when the user initiates a request to cancel I/O, the AIO framework calls the cancellation function to cancel the specified asynchronous I/O request.4.2.5 Clear Asynchronous I/O contextWhen an asynchronous I/O context is cleared, the AIO framework actively wakes up all processes that are waiting in that context through KILL_IOCTX, and then releases the relevant data structure.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.