Linux kernel source Scenario analysis-Extension of the user stack for memory management

Last Update:2015-03-01 Source: Internet

Author: User

Tags goto

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the following cases, a page error exception (also known as a fault) occurs:

1, the corresponding page directory entry or Page table entry is empty, that is, the linear address and physical address mapping relationship has not been established, or has been revoked. This is the case in this article.

2, the corresponding physical page is not in memory.

3. The access method specified in the instruction is inconsistent with the permissions of the page, for example, attempting to write a "read-only" page.

First look at the process address space:

Suppose you need to call a subroutine now, so the CPU needs to press the return address onto the stack, which is where you write the return address to the virtual space address (%esp-4). However, in our scenario, the address (%esp-4) falls into the void, which is an address that has not yet been mapped, so it is necessary to cause a page error exception.

This assumes that the CPU's operation has reached the entrance to the main Do_page_fault () of the page's exception service program. The code is as follows:

Arch/i386/mm/fault.c

asmlinkage void Do_page_fault (struct pt_regs *regs, unsigned long error_code) {struct task_struct *tsk;struct mm_struct *m  M;struct vm_area_struct * vma;unsigned long address;unsigned long page;unsigned long fixup;int write;siginfo_t info;/* get The Address */__asm__ ("Movl%%cr2,%0": "=r" (address)), or//to store the failed addresses of the mappings in the addr, that is%esp-4tsk = current;//task_struct/* * We fault-in kernel-space virtual memory on-demand. The * ' Reference ' page table is INIT_MM.PGD. * * note! We must not take any locks for the this case. We May * is in an interrupt or a critical region, and should * only copy the information from the master page table, * not Hing more.  */if (address >= task_size) goto vmalloc_fault;mm = Tsk->mm;//mm_structinfo.si_code = segv_maperr;/* * If we ' re in an Interrupt or has no user * context, we must not take the fault. */if (In_interrupt () | |!mm) goto no_context;down (&AMP;MM-&GT;MMAP_SEM); VMA = FIND_VMA (mm, address);// Find out that the ending address is greater than the first interval of a given address. if (!VMA)//not found, stating that there is no end address for an interval above the given address, refer to this address is inBelow the stack, which is more than 3G bytes. Goto Bad_area;if (Vma->vm_start <= address)//start addresses are not above address, indicating that the mapping has been established, go to Good_area to further check the cause of failure. Goto good_area;if (! ( Vma->vm_flags & Vm_growsdown))//The start address is greater than addr, the description falls into the void, and if the vm_flags is Vm_growsdown, the description falls in the stack area and does not goto Bad_area. Goto bad_area;if (Error_code & 4) {//occurs in User state/* * Accessing the stack below%esp is always a bug. * the ' + + ' is there Due to some instructions (like * Pusha) doing post-decrement on the stack and that * doesn ' t show up until later. */if (address + < REGS-&GT;ESP)//Make sure this is a push-stack operation, and that a push-in stack is 4 bytes, up to Pusha, and 32 bytes in. Goto Bad_area;} if (Expand_stack (VMA, address))//Look at the following code comment Goto bad_area;/* * Ok, we have a good vm_area for this memory access, so * we can Handle it. */good_area:info.si_code = Segv_accerr;write = 0;switch (Error_code & 3) {//-& 011 = 2default:/* 3:write, pre Sent */#ifdef test_verify_areaif (Regs->cs = = Kernel_cs) printk ("WP Fault at%08lx\n", REGS-&GT;EIP); #endif/* Fall thro Ugh */case 2:/* write, not present */if (! ( vma-&Gt;vm_flags & vm_write) Goto bad_area;write++;//Execute here break;case 1:/* read, present */goto bad_area;case 0:/* Read, no T present */if (!) ( Vma->vm_flags & (Vm_read | vm_exec))) goto Bad_area;} /* * If for any reason @ all we couldn ' t handle the fault, * Make sure we exit gracefully rather than endlessly redo * th E fault. */switch (Handle_mm_fault (MM, VMA, address, write)) {case 1:tsk->min_flt++;break;case 2:tsk->maj_flt++;break; Case 0:goto Do_sigbus;default:goto out_of_memory;} /* Do it hits the DOS screen memory VA from vm86 mode? */if (Regs->eflags & vm_mask) {unsigned long bit = (address-0xa0000) >> page_shift;if (bit < +) tsk-> Thread.screen_bitmap |= 1 << bit;}        Up (&mm->mmap_sem); return; .......}

The interrupt/exception response mechanism of the kernel also passes over two parameters. One is the pt_regs struct pointer regs, which points to a copy of the contents of each register in the CPU on the eve of the exception. Error_code further indicates the specific reason for the failure of the mapping.

Error_code:
bit 0 = = 0 means no page found, 1 means protection fault
bit 1 = = 0 means read, 1 means write
Bit 2 = = 0 means kernel, 1 means User-mode

At this time, Error_code for the user state, not mapped, write.

Expand_stack function, the code is as follows:

static inline int Expand_stack (struct vm_area_struct * VMA, unsigned long address) {unsigned long grow;address &= Page_ mask;//address is aligned by page boundary grow = (vma->vm_start-address) >> page_shift;//In this example grow is 1 pages if (Vma->vm_end-address > C Urrent->rlim[rlimit_stack].rlim_cur | |    ((VMA->VM_MM->TOTAL_VM + grow) << page_shift) > Current->rlim[rlimit_as].rlim_cur) return-enomem; Vma->vm_start = address;//Start address moves a page away from the low address vma->vm_pgoff-= GROW;VMA->VM_MM->TOTAL_VM + = grow;if (vma- >vm_flags & vm_locked) VMA->VM_MM->LOCKED_VM + = Grow;return 0;}

Handle_mm_fault function, the code is as follows:

int Handle_mm_fault (struct mm_struct *mm, struct vm_area_struct * vma,unsigned long address, int write_access) {int ret =- 1;pgd_t *pgd;pmd_t *PMD;PGD = Pgd_offset (mm, address);//Returns the page table item pointer PMD = Pmd_alloc (PGD, address);//brokered, or page catalog table item pointer if (PMD) {pte_t * pte = Pte_alloc (PMD, address);//Returns a pointer to a page table entry if (PTE) ret = Handle_pte_fault (mm, VMA, address, write_access, Pte);} return ret;}

The Pgd_offset function is as follows:

#define PGD_OFFSET (mm, address) ((mm)->pgd+pgd_index (address))

The Pmd_alloc function is as follows:

extern inline pmd_t * Pmd_alloc (pgd_t *pgd, unsigned long address) {if (!PGD) BUG (); return (pmd_t *) PGD;}

The Pte_alloc function is as follows:

extern inline pte_t * PTE_ALLOC (pmd_t * PMD, unsigned long address) {address = (address >> page_shift) & (Ptrs_pe R_PTE-1);//offset in page table if (Pmd_none (*PMD))//whether the page directory entry is present Goto getnew;//if not, go to create if (Pmd_bad (*PMD)) goto Fix;return (pte_t *) PMD _page (*PMD) + address;//has a pointer to a page table entry getnew:{unsigned Long page = (unsigned long) get_pte_fast ();//Create Page table if (!page) return Get_pte_slow (PMD, address), SET_PMD (PMD, __PMD (_page_table + __PA (PAGE)));//Let the page catalog entry point to the page table return (pte_t *) page + address;// Returns a pointer to a page table item}FIX:__HANDLE_BAD_PMD (PMD); return NULL;

The Handle_pte_fault function is as follows:

static inline int Handle_pte_fault (struct mm_struct *mm,struct vm_area_struct * VMA, unsigned long address,int write_acces S, pte_t * Pte) {pte_t entry;/* * We need the page table lock to synchronize with KSWAPD * and the Smp-safe Atomic Pte UPDA Tes. */spin_lock (&mm->page_table_lock); entry = *pte;//Page table entry in Contents If (!pte_present (entry)) {//Page table entry is empty/* * If it truly wasn ' t Present, we know that KSWAPD * and the PTE updates would not touch it later. So * Drop the lock. */spin_unlock (&mm->page_table_lock); if (Pte_none (entry))//Page table entry is empty return Do_no_page (mm, VMA, address, Write_ Access, Pte); return Do_swap_page (MM, VMA, address, Pte, Pte_to_swp_entry (entry), write_access);} if (write_access) {if (!pte_write (entry)) return Do_wp_page (mm, VMA, address, PTE, entry); entry = Pte_mkdirty (entry);} Entry = Pte_mkyoung (entry); Establish_pte (VMA, Address, PTE, entry); Spin_unlock (&mm->page_table_lock); return 1 ;}

The Do_no_page function is as follows:

static int Do_no_page (struct mm_struct * mm, struct vm_area_struct * vma,unsigned long address, int write_access, pte_t *p age_table) {struct page * new_page;pte_t entry;if (!vma->vm_ops | |!vma->vm_ops->nopage)//both empty return Do_ Anonymous_page (mm, VMA, page_table, write_access, address);.. return 2;/* Major fault */}

The Do_anonymous_page function is as follows:

static int Do_anonymous_page (struct mm_struct * mm, struct vm_area_struct * VMA, pte_t *page_table, int write_access, Unsi gned long addr) {struct page *page = null;pte_t entry = Pte_wrprotect (Mk_pte (zero_page (addr), Vma->vm_page_prot)); if (w rite_access) {//write_access = 1page = Alloc_page (Gfp_highuser);//Assign page if (!page) return-1;clear_user_highpage (page, addr); entry = Pte_mkwrite (Pte_mkdirty (Mk_pte (page, Vma->vm_page_prot));//the page table entry gives the corresponding physical page, which can be read, written, or executed mm->rss+ +;flush_page_to_ram (page);} Set_pte (page_table, entry);//Page table entries (properties have just been set) point to corresponding Page/* No need to invalidate-it was non-present before */update_mmu_cache (VMA, addr, entry); return 1;/* Minor fault */}

Returns, returns from exception handling, the stack area has been expanded, and then re-executes a previously aborted stack instruction, and then you can continue to execute down. For the user program, the whole process is "transparent", as if nothing has happened, and the stack interval seems to have been allocated enough space from the very beginning.

Linux kernel Source scenario analysis-Extension of the user stack for memory management

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More