Memory mapping is one of the most interesting features of modern Unix systems. As for drivers, memory mappings can be implemented to provide direct access to device memory by user programs.
A clear example of a mmap usage can be seen from a subset of the virtual memory areas that are viewed to the X Windows system server:
Cat/proc/731/maps
000a0000-000c0000 rwxs 000a0000 03:01 282652/dev/mem 000f0000-00100000 r-xs
000f0000 03:01 282652/dev/mem
00400000-005c0000 R-xp 00000000 03:01 1366927/usr/x11r6/bin/xorg 006bf000-006f7000 rw-p
001bf000 03:01 1366927/usr/x11r6/bin/xorg
2a95828000-2a958a8000 rw-s
fcc00000 03:01 282652/dev/mem 2a958a8000-2a9d8a8000 rw-s e8000000 03:01 282652/dev/mem
...
The complete list of X server's VMA is very long, but most of them are not interested here. We do see, however, that/DEV/MM's 4 different mappings give some insight into how the X server uses the video card. The first mapping is in a0000, which is the standard location for video memory in the 640-kb ISA hole. Further down, we see the big map in e8000000, which is the highest RAM address in the system. This is a direct mapping of the video memory on the adapter.
These areas can also be seen in/proc/iomem:
000a0000-000bffff:video RAM area
000c0000-000ccfff:video rom
000d1000-000d1fff:adapter rom
000f0000-000fffff:system ROM
D7f00000-f7efffff:pci bus #01
e8000000-efffffff:0000:01:00.0
Fc700000-fccfffff:pci Bus #01
fcc00000-fcc0ffff:0000:01:00.0
Mapping a device means associating some user-space addresses to device memory. Whenever a program is read or written within a given range, it is actually accessing the device. In the X server example, using mmap allows for fast and easy access to video card memory. For a performance-critical application like this, direct access is very different.
As you may expect, not every device lends itself to mmap abstraction; This makes no sense, for example, for serial ports or other streaming-oriented devices. Another limitation of MMAP is that the mapping granularity is page_size. The kernel can manage virtual addresses only at the page table level; Therefore, the mapped region must be an integer multiple of the page_size and must be located at the physical address that is the beginning of the page_size integer. The kernel enforces the granularity of size by making a slightly larger area if its size is not an integer multiple of the page size.
These restrictions are not a big constraint on drivers because the programs that access the devices are device dependent. Because the program must know how the device works, programmers are less bothered by the need to know details such as page alignment. A larger limit exists when ISA devices are used on a x86 platform because their Isa hardware view may be discontinuous. For example, some Alpha computers treat ISA memory as a decentralized 8-bit, 16-bit, 32-bit collection of items without direct mapping. In this case, you can't use mmap at all. For the inability to directly map the ISA address to the Alph address may only occur in 32-bit and 64-bit memory access, ISA can do only 8-bit and 16-bit forwarding, and there is no way to transparently map one protocol to another.
The use of mmap has considerable advantages when it is possible to do so. For example, we have seen the X server, which transmits large amounts of data to and from video memory; Dynamic mapping graphics display to user space improves throughput as opposed to a lseek/write implementation. Another typical example is a program that controls a PCI device. Most PCI peripherals map their control registers to a memory address, and a high-performance application might prefer direct access to registers instead of repeatedly calling IOCTL to complete its work.
The Mmap method is part of a file_operation structure that is referenced when a mmap system call is issued. With Mmap, the kernel does a lot of work before the actual method is invoked, and, therefore, the prototype of the method is very different from the prototype of the system call. Unlike IOCTL and poll calls, the kernel does not do too much before invoking these methods.
The system calls are declared as follows (as described in the mmap (2) manual page);
Mmap (caddr_t addr, size_t len, int prot, int flags, int fd, off_t offset)
On the other hand, the file operation declaration is as follows:
Int (*mmap) (struct file *filp, struct vm_area_struct *vma);
The Filp parameter in the method is as described in Chapter 3rd, and VMA contains information about the range of virtual addresses used to access the device. Therefore, a lot of work is done by the kernel; To achieve mmap, the driver simply creates the appropriate page table to this address range and, if necessary, replaces vma->vm_ops with the new set of operations.
There are 2 ways to create a page table: Call Remap_pfn_range to complete all at once, or one page at a time through the Nopage VMA method. Each method has its advantages and limitations. We start with the "all at once" approach, which is simpler. From here, we add complexity to the implementation needs of a real world. 15.2.1. Use Remap_pfn_range
The work of creating new pages to map physical addresses is handled by Remap_pfn_range and Io_remap_page_range, which have the following prototypes:
int Remap_pfn_range (struct vm_area_struct *vma, unsigned long virt_addr, unsigned long pfn, unsigned long size, pgprot_t p ROT);
int Io_remap_page_range (struct vm_area_struct *vma, unsigned long virt_addr, unsigned long phys_addr, unsigned long size, pgprot_t prot);
The value returned by this function is often 0 or a negative error value. Let's take a look at the exact meaning of these function arguments: VMA
The virtual memory area to which the page range is mapped virt_addr
Remap the user virtual address that should start. This function establishes the page table for this virtual address range from virt_addr to virt_addr_size. Pfn
The page frame number, which corresponds to the physical address where the virtual address should be mapped. This page frame number is simply the physical address to the right page_shift bit. For most use, the Vm_paoff member of the VMA structure contains exactly the value you need. This function affects the physical address from (Pfn<<page_shift) to (Pfn<<page_shift) +size. Size
The size of the zone that is being remap, in bytes. Prot
"Protection" required for the new VMA. The driver can (and should) use the value found in the Vma->vm_page_prot.
The parameters given to Remap_fpn_range are fairly straightforward, and most of them are already available to you in VMA when your mmap method is invoked. You may be wondering why there are 2 functions, but. The first (Remap_pfn_range) is intended to be used when the PFN points to the actual system RAM, and Io_remap_page_range should be used when the phys_addr points to I/O memory. In fact, these 2 functions are consistent on every system except SPARC, and you are used in most cases to see remap_pfn_range. To write a portable driver, however, you should use a variant of the Remap_pfn_range that fits your particular case.
Another complication has to do with caching: Often, reference device memory should not be cached by the processor. The system BIOS is often set correctly, but it may also turn off a specific VMA cache by protecting the field. Unfortunately, turning off caching at this level is highly processor-dependent. Curious readers want to look at the pgprot_noncached function from DRIVERS/CHAR/MEM.C to find what it contains. We do not discuss the subject further here. 15.2.2. A simple implementation
If your driver needs to do a simple linear device memory map to a user address space, Remap_pfn_range is almost all you really need to do this job. The following code is derived from DRIVERS/CHAR/MEM.C and shows how the task is in a typical module called simple (Simple implementation Mapping the Pages with Little enthusiasm) Carried on.
static int simple_remap_mmap (struct file *filp, struct vm_area_struct *vma)
{
If Remap_pfn_range (VMA, vma-> Vm_start, Vm->vm_pgoff,
Vma->vm_end-vma->vm_start,
Vma->vm_page_prot))
Return-eagain;
Vma->vm_ops = &simple_remap_vm_ops;
Simple_vma_open (VMA);
return 0;
}
As you can see, remapping memory simply calls Remap_pfn_rage to create the necessary page tables. 15.2.3. Add VMA Action
As we have seen, the VM_AREA_STRUCT structure contains a set of operations that can be used to VMA. Now let's look at providing these operations in a simple way. In particular, we provide open and close operations for VMA. These operations are invoked whenever a process is turned on or off VMA; In particular, the Open method is invoked any time a process produces and creates a new reference to the VMA. The open and close VMA methods are invoked and processed by the kernel, so they do not need to implement any work done there. They are used as a method to drive the presence of any additional processing that they may require.
As it turns out, a simple drive such as simplicity does not require any extra special processing. We have created the open and Close method, which prints a message to the system log to inform everyone that they have been invoked. Not particularly useful, but it does allow us to show how these methods are provided and see when they are invoked.
In this, we ignore the default vma->vm_ops use of the Invoke PRINTK operation:
void Simple_vma_open (struct vm_area_struct *vma)
{
PRINTK (kern_notice "Simple VMA open, Virt%lx, Phys%lx/n", VM A->vm_start, Vma->vm_pgoff << page_shift);
void Simple_vma_close (struct vm_area_struct *vma)
{
PRINTK (kern_notice "simple VMA close./n");
}
static struct Vm_operations_struct Simple_remap_vm_ops = {
. open = Simple_vma_open,
. Close = simple_vma_close,< c10/>};
To enable these exercises to be activated as a specific mapping, it is necessary to store a pointer to the simple_remap_um_ops in the Vm_ops member of the associated VMA. This is often done in the Mmap method. If you look back at the Simple_remap_mmap example, you see these lines of code:
Vma->vm_ops = &simple_remap_vm_ops;
Simple_vma_open (VMA);
Note the explicit invocation of the Simple_vma_open. Because the open method is not invoked when Mmap is initialized, we must explicitly call it if we want it to run. 15.2.4. Mapping memory using Nopage
Although Remap_pfn_range works well for many people, if not most people, it is necessary to drive MMAP implementations sometimes with a little greater flexibility. In such cases, an implementation that uses the Nopage VMA method can be invoked.
A useful case for a nopage method can be caused by a mremap system call, which is used by the application to change the binding address of a mapped region. As it happens, the kernel does not directly notify the driver when a mapped VMA is changed by Mremap. If the size of the VMA is reduced, the kernel can silently brush out unwanted pages without having to tell the driver. Conversely, if this VMA is enlarged, when the mapping must be established for a new page, the drive is eventually discovered through the invocation of nopage, so there is no need for special notification. Nopage method, so if you want to support MREMAP system calls must be implemented. Here, we show a simple nopage implementation to the simplicity device.
Nopage method, remember that there are the following prototypes:
struct page * (*nopage) (struct vm_area_struct *vma, unsigned long address, int *type);
When a user process attempts to access a page in a VMA that is not in memory, the associated nopage function is invoked. The address parameter contains the virtual address that caused the error, and the next round to the beginning of the page. The Nopage function must locate and return the struct page pointer for the page that the user wants. This function must also be responsible for incrementing the use count of the pages it returns by calling the Get_page macro.
Get_page (struct page *pageptr);
This step is required to keep the reference count correct on the mapped page. The kernel maintains this count for each page; When counting to 0, the kernel knows that the page can be placed in a free list. When a VMA is mapped, the kernel decrements the use of the count to each page in the zone. If your driver does not increment the count when adding a page to the area, the use count becomes prematurely 0, the integrity of the system is destroyed.
The Nopage method should also store the error type in the position pointed to by the type parameter-but only if that argument is not NULL. In device drivers, the correct value for the type will always be vm_fault_minor.
If you use Nopage, when the call mmap often seldom have work to do; Our version looks like this:
static int simple_nopage_mmap (struct file *filp, struct vm_area_struct *vma)
{
unsigned long offset = vma->vm_ Pgoff << Page_shift;
if (offset >= __pa (high_memory) | | (Filp->f_flags & O_sync))
Vma->vm_flags |= Vm_io;
Vma->vm_flags |= vm_reserved;
Vma->vm_ops = &simple_nopage_vm_ops;
Simple_vma_open (VMA);
return 0;
}
The main thing mmap must do is to replace the default (NULL) Vm_ops pointer with our own operations. The Nopage method then maps one page at a time and returns the address of its struct page structure. Since we're only implementing a window to the physical memory here, the remapping step is simple: we just need to locate and return a pointer to the struct page to the desired address. Our nopage approach seems to be as follows:
struct page *simple_vma_nopage (struct vm_area_struct *vma, unsigned long address, int *type)
{
struct page *pagep TR;
unsigned long offset = Vma->vm_pgoff << page_shift;
unsigned long physaddr = Address-vma->vm_start + offset;
unsigned long pageframe = physaddr >> page_shift;
if (!pfn_valid (pageframe)) return
Nopage_sigbus;
Pageptr = Pfn_to_page (pageframe);
Get_page (pageptr);
if (type)
*type = Vm_fault_minor;
return pageptr;
}
Because, again, here we simply map the main memory, the Nopage function only needs to find the correct struct page to the error address and increment its reference count. Therefore, the request sequence for the event is to compute the physical address that is required and to convert it to a page frame number by page_shift the bit to the right. Because user space can give us any address it likes, we have to make sure that we have a valid page frame; The Pfn_valid function does this for us. If the address is beyond range, we return Nopage_sigbus, which generates a bus signal that is submitted to the calling process.
Otherwise, Pfn_to_page gets the necessary struct page pointer; We can increment its reference count (using the call get_page) and return it.
The Nopage method normally returns a pointer to the struct page. If, for some reason, a normal page cannot be returned (that is, the requested address exceeds the drive memory area), Nopage_sigbus can be returned to indicate an error; This is done by the simple code on. Nopage can also return nopage_oom to indicate a failure caused by resource constraints.
Note that this implementation works for the ISA memory area, but not for those on the PCI bus. PCI memory is mapped to the highest system memory and does not have access to these addresses in system memory. Nopage cannot be used in these cases because there is no struct page to return a pointing pointer; You must use Remap_pfn_range instead.
If the Nopage method is retained as NULL, the kernel code that handles the page error maps 0 pages to the wrong virtual address. Page 0 is a write-time copy of the page, which reads as 0, and is used, for example, to map the BSS segment. Any process that references 0 pages will see a page that fills 0. If the process writes to this page, it eventually modifies a private page. Therefore, if a process extends a mapped page by invoking Mremap and the driver has not yet implemented nopage, the process ends with 0-filled memory instead of a segment error. 15.2.5. Remap specific I/O zones
All the examples we have seen so far are the/dev/mem of the new; They remap physical addresses to user space. A typical driver, however, wants to map only a small address range that applies to its peripheral devices, not all of the memory. To map to a subset of the entire memory range for user space, the driver only needs to use offsets. Here is a driver to do this trick to map a simple_region_size byte area, at the physical address simple_region_start (should be page-aligned) Start:
unsigned long off = Vma->vm_pgoff << page_shift;
unsigned long physical = Simple_region_start + off;
unsigned long vsize = vma->vm_end-vma->vm_start;
unsigned long psize = Simple_region_size-off;
if (Vsize > Psize)
return-einval/* spans too high *
/Remap_pfn_range (VMA, Vma_>vm_start, physical, vsize, Vma->vm_page_prot);
In addition to calculating offsets, this code introduces a check to report an error when the program attempts to map memory that is more than is available in the I/O area of the target device. In this code, Psize is the amount of physical I/O that is left after the offset has been specified, and the vsize is the size of the virtual memory request; This function rejects the mapping of addresses that exceed the allowable memory range.
Note that user space can always use MREMAP to extend its mappings, possibly exceeding the end of the physical device area. If your driver cannot define a Nopage method, it will never get a notification of this extension, and the extra area is mapped to 0 pages. As a driver, you might want to stop this behavior; Mapping reasons to the end of your zone is not an obvious bad thing, but it is unlikely that programmers want it to happen.
The simplest way to prevent a mapping extension is to implement a simple nopage method that has been causing a bus signal to be sent to an error process. Such a method may seem to be so:
struct page *simple_nopage (struct vm_area_struct *vma,
unsigned long address, int *type);
{return nopage_sigbus/* Send a sigbus * *}
As we have seen, the Nopage method is invoked only when a process solution references an address, which is in a known VMA but currently does not have a valid page table entry to the VMA. If there is a remap_pfn_range used to map all the device areas, the Nopage method shown here is only invoked to refer to the outside of that area. Therefore, it can safely return Nopage_sigbus to indicate an error. Of course, a more complete nopage implementation can check whether an error address is in the device area, and if so, remap. However, again, Nopage cannot work in the PCI memory area, so the expansion of PCI mappings is not possible. 15.2.6. Re-map RAM
An interesting limitation of remap_pfn_range is that it accesses only the reserved page and the physical address on top of the physical memory. In Linux, a physical Address page is flagged as "reserved" in a memory map to indicate that it is not available for memory management. On PCs, for example, 640 KB and 1MB are marked as reserved, like pages that reside in the kernel code itself. Reserved pages are locked in memory and are the only ones that can be safely mapped to user space; This limitation is a fundamental requirement for system stability.
Therefore, Remap_pfn_range does not allow you to remap the traditional address, which includes what you get by calling Get_free_page. Instead, it maps to 0 pages. All looks normal except that the process sees a private, 0-filled page instead of it in the desired remapping RAM. This function does all the things that most hardware drivers need to do because it can remap high-end PCI buffers and ISA memory.
Remap_pfn_range restrictions can be seen by running Mapper, one of the examples of programs in Misc-progs file at O ' Reilly's FTP website. Mapper is a simple tool that can be used to quickly test mmap system calls; It maps a read-only portion of a file specified by the command-line option, and outputs the mapped zone to the standard output. The following section, for example, shows that/DEV/MEM does not map a physical page in the address of KB-instead, we see a page full of 0 (the host in the example is a PC, but the result should be the same on other platforms).
morgana.root#./mapper/dev/mem 0x10000 0x1000 | OD-AX-T x1
Mapped "/dev/mem" from 65536 to 69632 for the 000000 of all *
001000
The remap_pfn_range handling of RAM means that a memory-based device such as Scull cannot easily implement MMAP because its device memory is traditional, not I/O memory. Fortunately, a relatively easy method is available for any driver that needs to map RAM to user space; It uses the Nopage method we have seen before. 15.2.6.1. Re-map RAM using the Nopage method
The way to map real memory to user space is to use vm_ops-<nopage to handle page faults one at a time. A simple implementation is part of the SCULLP module, introduced in the 8th.
SCULLP is a page-oriented character device. Because it is page-oriented, it can implement MMAP on its memory. The code that implements the memory map uses some of the concepts described in the "Memory Management in Linux" section.
Before checking the code, let's look at the design choices that affect the implementation of the mmap in SCULLP.
SCULLP does not release device memory as long as the device is mapped. This is a policy issue rather than a requirement, and it differs from the behavior of scull and similar devices, which are truncated to 0 when opened for writing. Rejection of the release of a mapped SCULLP device allows a process to overwrite areas that are mapped by other processes. So you can test and see how processes and device memory interact. To avoid releasing a mapping device, the driver must maintain a count of activation mappings; VMAs members in the device structure are used for this purpose.
Memory mappings are performed only when the SCULLP order parameter (set at module load time) is 0 o'clock. This parameter controls how __get_free_pages is invoked (see Chapter 8th "Get_free_page and its Friends" section). The limit of the 0 order (which forces a page to be assigned one at a time instead of a large group) is __get_free_pages by the internal allocation function used by the SCULLP. To maximize allocation of performance, the Linux kernel maintains a list of free pages for each allocation level, and only the reference count of the first page in a cluster is incremented by Get_free_pages and decremented by Free_pages. The Mmap method is disabled for a SCULLP device if the allocation level is greater than 0 because nopage processes a single page instead of a cluster page. SCULLP does not know how to properly manage the reference count of pages that are part of the high-level allocation. (If you need to revisit the SCULLP and memory allocation level values, return to chapter 8th, "A SCULL:SCULLP that uses whole pages.")
0--level restrictions are mostly used to keep the code simple. It may correctly implement mmap to multiple-page allocations by using the page's usage count, but it may only add to the complexity of the example without introducing any interesting information.
To map the code of RAM according to the rules just outlined, you need to implement open, close, and nopage VMA methods; It also requires access to memory mappings to adjust page usage counts.
The implementation of this scullp_mmap is very short because it relies on the nopage function to do all the work of interest:
int scullp_mmap (struct file *filp, struct vm_area_struct *vma)
{
struct inode *inode = filp->f_dentry->d_ Inode;
/* Refuse to map if the is not 0 *
/if (Scullp_devices[iminor (inode)].order)
Return-enodev;
/* don ' t do anything here: "Nopage" would fill the holes * *
vma->vm_ops = &scullp_vm_ops;
Vma->vm_flags |= vm_reserved;
Vma->vm_private_data = filp->private_data;
Scullp_vma_open (VMA);
return 0;
}
The purpose of an if statement is to avoid a device that maps an allocation level other than 0. The SCULLP operation is stored in the Vm_ops member, and a pointer to the device structure is hidden in the Vm_private_data member. Finally, the Vm_ops->open is invoked to update the count of activation mappings for the device.
Open and Close simply trace the mapping count and define the following:
void Scullp_vma_open (struct vm_area_struct *vma) {struct Scullp_dev *dev
= vma->vm_private_data;
dev->vmas++;
}
void Scullp_vma_close (struct vm_area_struct *vma) {struct Scullp_dev *dev
= vma->vm_private_data;
dev->vmas--;
}
Most of the work is done next by Nopage. In the SCULLP implementation, the address parameters to the nopage are used to compute the offsets in the device, and this offset is then used to find the correct page in the SCULLP memory tree.
struct page *scullp_vma_nopage (struct vm_area_struct *vma, unsigned long address, int *type) {unsigned long offse
T
struct Scullp_dev *ptr, *dev = vma->vm_private_data;
struct page *page = Nopage_sigbus; void *pageptr = NULL;
* * Default to "missing" * * (&DEV->SEM);
offset = (Address-vma->vm_start) + (Vma->vm_pgoff << page_shift); if (offset >= dev->size) goto out;
/* Out of range */* Now retrieve the SCULLP device from the List,then the page.
* If The device has holes, the process receives a sigbus when * accessing the hole. * * Offset >>= page_shift;
/* Offset is a number of pages */for (ptr = dev; ptr && offset >= dev->qset;)
{ptr = ptr->next;
Offset-= dev->qset; } if (ptr && ptr->data) pageptr = Ptr->data[offset]; if (!pageptr) goto out;
/* hole or end-of-file */page = Virt_to_page (pageptr);
/* got it, now increment the count */get_page (page);
if (type) *type = Vm_fault_minor;
Out:up (&dev->sem);
return page;
}
SCULLP uses the memory acquired by Get_free_pages. That memory uses logical address addressing, so all the scullp_nopage have to do to get a struct page pointer is to invoke Virt_to_page.
Now the SCULLP device works as expected, as you can see in the example output from the Mapper tool. Here, we send a directory listing of/dev (a long) to the SCULLP device and then use the Mapper tool to view the various parts of the list together with Mmap.
morgana% ls-l/dev >/dev/scullp
morgana%./MAPPER/DEV/SCULLP 0 140
mapped "/DEV/SCULLP" from 0 (0x00000000) To 140 (0x0000008c) Total
232
CRW-------1 root, Sep 07:40 adbmouse crw-r--r--1
root root 10, 175 Sep 07:40 agpgart
morgana%/mapper/dev/scullp 8192 mapped "/DEV/SCULLP" from 8192 (0x00002000) to 8392 (0x00 0020C8)
d0h1494
brw-rw----1 root floppy 2, mb Sep 07:40 fd0h1660 brw-rw
----1 root Floppy 2, Sep 07:40 fd0h360
brw-rw----1 root floppy 2, Sep 07:40 fd0h360
15.2.7. Re-map kernel virtual address
Although it is rarely needed, it is interesting to see how a driver uses Mmap to map a kernel virtual address to user space. Remember, a real kernel virtual address is an address that is returned by a function such as vmalloc--that is, a virtual address that maps to the kernel page table. The code in this section comes from SCULLV, which is a storage module that is like SCULLP but allocates it through Vmalloc.
Most SCULLV implementations are like the SCULLP we just saw, except that there is no need to check the order parameters that control the memory allocation. The reason for this is that Vmalloc assigns its pages one at a time, because a single page allocation is more likely to succeed than a multi-page assignment. Therefore, the allocation level problem does not apply to vmalloc allocated space.
In addition, there is only one difference in the nopage implementation that is used by SCULLP and SCULLV. Remember, SCULLP once it finds pages of interest, it will use Virt_to_page to get the corresponding struct page pointer. That function does not use the kernel virtual address, but. Instead, you must use Mvalloc_to_page. So the last part of the SCULLV version of Nopage looks like this:
* * After SCULLV lookup, "page" is now the page * needed by the current
process. Since it ' s a vmalloc address,
* turn it into a struct page.
* *
page = vmalloc_to_page (pageptr);
/* got it, now increment the count *
/get_page (page);
if (type)
*type = Vm_fault_minor;
Out: Up
(&dev->sem);
return page;
Based on this discussion, you may also want to map the addresses returned by Ioremap to user space. But that could be a mistake; the address from Ioremap is special and cannot be treated as a normal kernel virtual address. Instead, you should use Remap_pfn_range to remap I/O memory area to user space.