Linux kernel Source-code scenario Analysis-System call MMAP ()

Source: Internet
Author: User
Tags goto

A process can system invoke mmap () to map the contents of an open file to its user space with the following user interface:

Mmap (void *start, size_t length, int prot, int flags, int fd, off_t offset).

The parameter fd represents an open file, offset is the starting point in the file, and start is mapped to the start address in the user space, and length is long. There are also two parameters, prot and flags, which are used to access patterns of the mapped interval, such as writable, executable, etc., while the latter is used for other control purposes. From the point of view of application design, it is obviously much more convenient to access files like memory than regular file operations such as read (), write (), Lseek (), to map files to user space.


The mmap corresponding system call is SYS_MMAP2 and the code is:

Asmlinkage long sys_mmap2 (unsigned long addr, unsigned long len,unsigned long prot, unsigned long flags,unsigned long FD, unsigned long Pgoff) {return do_mmap2 (addr, Len, prot, Flags, FD, Pgoff);}


DO_MMAP2, the code is as follows:

Static inline long do_mmap2 (unsigned long addr, unsigned long len,unsigned long prot, unsigned long flags,unsigned long FD , unsigned long pgoff) {int error =-ebadf;struct file * File = Null;flags &= ~ (map_executable | Map_denywrite); if (!) ( Flags & map_anonymous) {//map_anonymous is set to 1, which means that no file is actually used for "enclosure" file = Fget (FD);//Gets the file structure if (!file) goto out;} T->mm->mmap_sem; error = Do_mmap_pgoff (file, addr, Len, prot, Flags, Pgoff); up (¤t->mm->mmap_sem); if ( File) fput (file); Out:return error;}


The inline function Do_mmap (), which is used by the kernel itself, is also the mapping of open files to the current process space. The code is:

Static inline unsigned long do_mmap (struct file *file, unsigned long addr,unsigned long len, unsigned long prot,unsigned l Ong flag, unsigned long offset) {unsigned long ret =-einval;if ((offset + page_align (len)) < offset) goto out;if (! ( Offset & ~page_mask) ret = do_mmap_pgoff (file, addr, Len, prot, flag, offset >> page_shift); out:return ret;}

Both are called, Do_mmap_pgoff, and the code is as follows:

unsigned long do_mmap_pgoff (struct file * file, unsigned long addr, unsigned long len,unsigned long prot, unsigned long FL AGS, unsigned long pgoff) {struct mm_struct * mm = current->mm;struct Vm_area_struct * Vma;int correct_wcount = 0;int er ror;.....//various judgments, first ignore if (Flags & map_fixed) {if (addr & ~page_mask) Return-einval;} else {//map_fixed is 0, means that the specified mapping address is only a reference value that cannot be satisfied when the kernel assigns a addr = Get_unmapped_area (addr, len);//The user space of the current process is assigned a start address if (!ADDR) Return-enomem;} /* Determine the object being mapped and call the appropriate * specific mapper. The address has already been validated, but is not unmapped, but the maps is removed from the list. */VMA = Kmem_cache_alloc (Vm_area_cachep, Slab_kernel);//mapping to a specific file is also an attribute, the different sections of the attribute cannot coexist in the same logical interval, so always create a logical interval if (!). VMA) return-enomem;vma->vm_mm = Mm;vma->vm_start = addr;//start Address vma->vm_end = addr + len;//End Address Vma->vm_flags = Vm_flags (prot,flags) | Mm->def_flags;if (file) {//Set Vma->flagsvm_clearreadhint (VMA); vma->vm_raend = 0;if (fIle->f_mode & Fmode_read) vma->vm_flags |= vm_mayread | Vm_maywrite | Vm_mayexec;if (Flags & map_shared) {vma->vm_flags |= vm_shared |  vm_mayshare;/* This looks strange and when we don't have the file open * for writing, we can demote the shared mapping to A simpler * private mapping. That also takes care of a security hole * with Ptrace () writing to a shared mapping without write * permissions. * * We leave the vm_mayshare bit on, just to get correct output * from/proc/xxx/maps. */IF (! ( File->f_mode & Fmode_write) vma->vm_flags &= ~ (Vm_maywrite | vm_shared);}} else {vma->vm_flags |= vm_mayread | Vm_maywrite | Vm_mayexec;if (Flags & map_shared) vma->vm_flags |= vm_shared | Vm_mayshare;} Vma->vm_page_prot = protection_map[vma->vm_flags & 0x0f];vma->vm_ops = Null;vma->vm_pgoff = pgoff;// The mapping of the content in the file starting point, with this starting point, when a page fault occurs, you can calculate the location of the corresponding pages in the file according to the virtual address Vma->vm_file = Null;vma->vm_private_data = null;/* Clear Old maps */error =-enomem;if (do_munmap (mm, addr, len))//check if the target address is already in use in the virtual space of the current process, and if it is already in use, revoke the old mapping, or goto FREE_VMA if the operation fails. This check is not checked because flags has a flag bit of map_fixed of 1 o'clock. Goto free_vma;/* Check against address space limit.  */if ((MM-&GT;TOTAL_VM << page_shift) + len//Virtual space usage exceeded the lower bound > current->rlim[rlimit_as].rlim_cur for which it was set) goto free_vma;/* Private writable mapping? Check memory availability. */if (Vma->vm_flags & (vm_shared | vm_write) = = Vm_write &&//The number of physical pages is enough! (Flags & Map_noreserve) &&!vm_enough_memory (len >> page_shift)) goto free_vma;if (file) {if (vma-> Vm_flags & vm_denywrite) {error = deny_write_access (file);//Repel regular file operations, such as read write if (error) goto FREE_VMA;CORRECT_ Wcount = 1;} Vma->vm_file = file;//Key Oh Get_file (file); error = File->f_op->mmap (file, VMA);//point to GENERIC_FILE_MMAPIF (error ) goto UNMAP_AND_FREE_VMA;} else if (Flags & map_shared) {error = Shmem_zero_setup (VMA); if (error) goto FREE_VMA;} /* Can addr have changed?? * * Answer:yes, several device driveRS can do it in their * F_op->mmap method. -davem */flags = VMA-&GT;VM_FLAGS;ADDR = vma->vm_start;insert_vm_struct (mm, VMA);//INSERT into the corresponding queue if (Correct_wcount) Atomic_inc (&file->f_dentry->d_inode->i_writecount); MM-&GT;TOTAL_VM + = Len >> PAGE_SHIFT;if ( Flags & vm_locked) {//Make_pages_presentmm->locked_vm + = Len >> page_shift;make_pages_present is only called when Lock is added ( addr, addr + len);} Return addr;//The starting virtual address that is returned, typically the last 12 bits of 0unmap_and_free_vma:if (correct_wcount) atomic_inc (&file->f_dentry->d _inode->i_writecount); vma->vm_file = null;fput (file);/* Undo any partial mapping do by a device driver. */flush_cache_range (mm, Vma->vm_start, vma->vm_end); Zap_page_range (mm, Vma->vm_start, vma->vm_end-vma- >vm_start); Flush_tlb_range (mm, Vma->vm_start, vma->vm_end); Free_vma:kmem_cache_free (Vm_area_cachep, VMA) ; return error;}


Generic_file_mmap function, the code is as follows:

int generic_file_mmap (struct file * file, struct vm_area_struct * vma) {struct vm_operations_struct * ops;struct inode *ino de = File->f_dentry->d_inode;ops = &file_private_mmap;if (Vma->vm_flags & vm_shared) && (vma- >vm_flags & Vm_maywrite) {if (!inode->i_mapping->a_ops->writepage) Return-einval;ops = &file_ Shared_mmap;} if (!INODE->I_SB | |!) S_isreg (Inode->i_mode)) return-eacces;if (!inode->i_mapping->a_ops->readpage) return-enoexec; Update_atime (inode); vma->vm_ops = ops;//key Oh return 0;}


Where File_private_mmap, the code is as follows:

static struct Vm_operations_struct File_private_mmap = {nopage:filemap_nopage,};

Inode->i_mapping->a_ops->writepage and Inode->i_mapping->a_ops->readpage pointed out:

struct Address_space_operations ext2_aops = {readpage:ext2_readpage,writepage:ext2_writepage,sync_page:block_sync_ PAGE,PREPARE_WRITE:EXT2_PREPARE_WRITE,COMMIT_WRITE:GENERIC_COMMIT_WRITE,BMAP:EXT2_BMAP};


Finally vm_area_struct the data structure, the important section (Vm_ops,file,pgoff,vm_start,vm_end) is set up, returning the starting virtual address .

Readers may be confused as to how simple it is to establish a mapping between a file and a virtual interval. And we don't even see the build of the page map!

So when do you build the mapping?

When a page of this interval is accessed for the first time, the fault occurs due to the non-mapping of the meeting, and the corresponding processing function is do_no_page () instead of Do_swap_page (). The code is as follows:

static int Do_no_page (struct mm_struct * mm, struct vm_area_struct * vma,unsigned long address, int write_access, pte_t *p age_table) {struct page * new_page;pte_t entry;if (!vma->vm_ops | |!vma->vm_ops->nopage) return DO_ANONYMOUS_ Page (mm, VMA, page_table, write_access, address);/* * The third argument is "No_share", which tells the low-level code * t  o Copy, not share the page even if sharing is possible. It ' s * Essentially an early COW detection. */new_page = Vma->vm_ops->nopage (VMA, Address & Page_mask, (Vma->vm_flags & vm_shared)? 0:write_access );//Call Filemap_nopageif (New_page = = NULL)/* No page was available--Sigbus */return 0;if (new_page = nopage_oom) return- 1;++mm->rss;/* * This silly early page_dirty setting removes a race * due to the bad i386 PAGE protection. But it's valid * for other architectures too. * * Note If write_access is true and we either now has * an exclusive copy of the page, or this is a shared mapping, * So we can make it writAble and dirty to avoid have to * handle that later. */flush_page_to_ram (New_page); Flush_icache_page (VMA, new_page); entry = Mk_pte (New_page, Vma->vm_page_prot); if ( write_access) {entry = Pte_mkwrite (Pte_mkdirty (Entry));} else if (Page_count (new_page) > 1 &&! ( Vma->vm_flags & vm_shared)) entry = Pte_wrprotect (entry); Set_pte (page_table, entry);/* No need to invalidate:a not -present page shouldn ' t be cached */update_mmu_cache (VMA, address, entry); return 2;/* Major fault */}
Do_no_page calls Filemap_nopage,filemap_nopage to allocate a free memory page and reads the corresponding page from the file, and then establishes the mapping.

Linux kernel Source-code scenario Analysis-System call MMAP ()

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.