Process address space for Linux kernel profiling (III)

Source: Internet
Author: User

Process address space for Linux kernel profiling (III)

This section describes page missing exception handling programs and heap management.

Page missing exception handling program

Two scenarios of triggering a page missing exception program:

1. Exceptions caused by programming errors (such as out-of-bounds access, the address does not belong to the process address space ).

2. The address belongs to the linear address space, but the kernel has not allocated the corresponding physical pages, resulting in page missing exceptions.

Overall solution for page missing exception handling programs:

The linear zone descriptor allows the page missing exception handler to effectively complete its work.

The do_page_fault () function is a page-missing interrupt service program on 80x86. It compares the linear address that causes the page-missing with the linear zone of the current process, and selects an appropriate method based on the specific solution to handle this exception.

The codes corresponding to the identifiers vmalloc_fault, good_area, do_sigbus, bad_area, no_context, keep ve, out_of_memory, and bad_area_nosemaphore perform different processing operations on different page missing exceptions.

Receiving parameters:

fastcall void do_page_fault(struct pt_regs *regs, unsigned long error_code)
Address regs of the pt_regs structure, which contains the value of the microprocessor register when an exception occurs.

Three-digit error_code:

==>> Error_code:

* If 0th bits are cleared, an error occurred while accessing a non-existent page; If 0th bits are set, the access permission is invalid.

* 1st bits are cleared, read access or execution access -- exception; 1st bits are set, write access -- exception.

* If 2nd bits are cleared, the processor kernel state is abnormal. If 2nd bits are set, the processor user State is abnormal.

Procedure:

* Read the linear address that causes the missing page. When an exception occurs, the CPU control unit stores this linear address in the cr2 control register.

    __asm__("movl %%cr2,%0":"=r" (address));if (regs->eflags & (X86_EFLAGS_IF|VM_MASK))    /*X86_EFLAGS_IF|VM_MASK)=0x00020200*/local_irq_enable();tsk = current;
The pt_regs structure pointer regs points to a copy of the content of each register in the CPU before an exception occurs, which is saved by the interrupt response mechanism of the kernel ", error_code further specifies the reason for the ing failure. If the local interrupt is enabled before the page is missing or when the CPU runs in 80x86 mode, enable local_irq_enable () and save the pointer to the current process descriptor in the tsk local variable.

* Next:

This figure is described as follows:

Do_page_fault () first checks whether the linear address that causes page shortage is in the kernel address space:

If yes, when the kernel tries to access a page that does not exist, It redirects to the execution of the non-contiguous memory zone address access code, that is, the code after the vmalloc_fault mark. Otherwise, run the code after the bad_area _ nosemaphore mark.

If not, the linear address of the page is displayed in the user address space. In this case, determine whether the page is interrupted, the delayable function, the critical section, or the kernel thread:

If yes, because the interrupt handler does not use addresses smaller than TASK_SIZE, The bad_area_nosemaphore code is executed.

if (in_atomic() || !mm)goto bad_area_nosemaphore;
If no, that is, it does not occur in the interrupt handler, deletable function, critical section, or kernel thread, the function checks the linear zone of the process to determine whether the linear address that causes the page shortage is included in the address space of the process. Therefore, the mmap_sem read/write semaphore of the process must be obtained.
if (!down_read_trylock(&mm->mmap_sem)) {if ((error_code & 4) == 0 &&    !search_exception_tables(regs->eip))goto bad_area_nosemaphore;down_read(&mm->mmap_sem);}
When the function obtains the mmap_sem semaphore, do_page_fault () starts to search for the linear zone where the error linear address is located, and jumps to the corresponding flag code segment based on the vma value.
vma = find_vma(mm, address);if (!vma)goto bad_area;if (vma->vm_start <= address)goto good_area;if (!(vma->vm_flags & VM_GROWSDOWN))goto bad_area;if (error_code & 4) {/* accessing the stack below %esp is always a bug. * The "+ 32" is there due to some instructions (like * pusha) doing post-decrement on the stack and that * doesn't show up until later.. */if (address + 32 < regs->esp)goto bad_area;}if (expand_stack(vma, address))goto bad_area;
Handle error addresses other than address space

If address (a linear address that causes page shortage) does not belong to the address space of the process, the do_page_fault () function executes the statement at the bad_area mark.

/* Something tried to access memory that isn't in our memory map .. * Fix it, but check if it's kernel or user first .. */bad_area: up_read (& mm-> mmap_sem);/* exit the critical section */bad_area_nosemaphore:/* User mode accesses just cause a SIGSEGV */if (error_code & 4) {/* user State * // * Valid to do another page fault here because this one came from user space. */if (is_prefetch (regs, address, error_code) return; tsk-> thread. cr2 = address;/* Kernel addresses are always protection faults */tsk-> thread. error_code = error_code | (address> = TASK_SIZE); tsk-> thread. trap_no = 14; info. si_signo = SIGSEGV; info. si_errno = 0;/* info. si_code has been set above */info. si_addr = (void _ user *) address; force_sig_info (SIGSEGV, & info, tsk); return;} no_context:/* kernel state */if (fixup_exception (regs )) return; if (is_prefetch (regs, address, error_code) return;
If an exception occurs in the user State, a SIGSEGV signal is sent to the current process and the function is terminated;

Among them, force_sig_info () is sure that the process does not ignore or block the SIGSEGV signal, and transmits additional information through the info local variable while sending the signal to the user-state process.

If an exception occurs in the kernel state (the second bit of error_code is cleared), there are two optional cases (implemented in no_context code segment ):

* The exception is caused by passing a linear address as a system call parameter to the kernel;

* Exceptions are caused by a real kernel defect.

In the first case, the Code jumps to a piece of "corrected Code". A typical operation of this Code is to send a SIGSEGV signal to the current process, or use an appropriate error code to terminate the system call handler.

In the second case, the function dumps all the CPU registers and kernel state stacks to the console, outputs them to the system message buffer, and then calls do_exit () to kill the current process. -- Kernel Vulnerability "Kernel Oops ".

Handle error addresses within the address space

If the address belongs to the address space of the Process, do_page_fault () is switched to the good_area tag to run the program:

/* OK, we have a good vm_area for this memory access, so we can handle it .. */good_area: info. si_code = SEGV_ACCERR; write = 0; switch (error_code & 3) {default:/* caused by write, in the memory * // * fall through: Failed (in the following cases) */case 2:/* write, not in memory */if (! (Vma-> vm_flags & VM_WRITE)/* The linear zone cannot be written */goto bad_area; write ++;/* The linear zone can be written */break; case 1: /* read or </span> execution access, in memory */goto bad_area; case 0:/* read or execution access, not in memory */if (! (Vma-> vm_flags & (VM_READ | VM_EXEC) goto bad_area;} Keep ve: switch (handle_mm_fault (mm, vma, address, write) {case VM_FAULT_MINOR: /* page missing */tsk-> min_flt ++; break; case VM_FAULT_MAJOR:/* main page missing */tsk-> maj_flt ++; break; case VM_FAULT_SIGBUS: /* any other errors */goto do_sigbus; case VM_FAULT_OOM:/* not enough memory */goto out_of_memory; default: BUG ();} if (regs-> eflags & VM_MASK) {unsigned long bit = (address-0xA0000)> PAGE_SHIFT; if (bit <32) tsk-> thread. screen_bitmap | = 1 <bit;} up_read (& mm-> mmap_sem); return;
For error_code & 3 ==>>>

Case 2: if an exception is caused by write access, the function checks whether the linear zone is writable. If you cannot write (! (Vma-> vm_flags & VM_WRITE), jump to the bad_area code; if it can be written, set the write local variable to 1.

Case 1 and case 0: If the exception is caused by read or execution access, the function checks whether this page already exists in RAM. If a process tries to access a privileged page box in the user State, the function jumps to the bad_area code. If this linear area does not exist, the function will also check whether it is readable or executable.

Default and case2 (write = 1): If the access to this linear zone is limited to the access type that causes exceptions to match, call the handle_mm_fault () function to allocate a new page box (keep ve code segment):

Key:Handle_mm_fault () function

/* By the time we get here, we already hold the mm semaphore */int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,unsigned long address, int write_access){pgd_t *pgd;pud_t *pud;pmd_t *pmd;pte_t *pte;__set_current_state(TASK_RUNNING);inc_page_state(pgfault);if (is_vm_hugetlb_page(vma))return VM_FAULT_SIGBUS;/* mapping truncation does this. *//* We need the page table lock to synchronize with kswapd * and the SMP-safe atomic PTE updates. */pgd = pgd_offset(mm, address);spin_lock(&mm->page_table_lock);pud = pud_alloc(mm, pgd, address);if (!pud)goto oom;pmd = pmd_alloc(mm, pud, address);if (!pmd)goto oom;pte = pte_alloc_map(mm, pmd, address);if (!pte)goto oom;return handle_pte_fault(mm, vma, address, write_access, pte, pmd); oom:spin_unlock(&mm->page_table_lock);return VM_FAULT_OOM;}
Parameters:The memory descriptor mm of the process running on the CPU when an exception occurs. This causes the descriptor vma of the linear zone where the abnormal linear address is located. This causes the abnormal linear address, write_access (if tsk tries to write to address, it is set to a bit. If tsk tries to read or execute to address, it is cleared ).

Steps:

* The function first checks the cause of the exception, then checks whether the address-mapped page Directory and page table exist, and then executes the task of allocating the page Directory and page table;

* The handle_pte_fault () function checks the page table items corresponding to the address and determines how to assign a new page box to the process:

Static inline int handle_pte_fault (struct mm_struct * mm, struct vm_area_struct * vma, unsigned long address, int write_access, pte_t * pte, pmd_t * pmd) {pte_t entry; /* request page */entry = * pte; if (! Pte_present (entry) {/** If it truly wasn't present, we know that kswapd * and the PTE updates will not touch it later. so * drop the lock. */if (pte_none (entry) return do_no_page (mm, vma, address, write_access, pte, pmd); if (pte_file (entry) return do_file_page (mm, vma, address, write_access, pte, pmd); return do_swap_page (mm, vma, address, pte, pmd, entry, write_access);}/* Copy at write */if (write_access) {If (! Pte_write (entry) return do_wp_page (mm, vma, address, pte, pmd, entry); entry = pte_mkdirty (entry);} entry = pte_mkyoung (entry); ptep_set_access_flags (vma, address, pte, entry, write_access); update_mmu_cache (vma, address, entry); pte_unmap (pte); spin_unlock (& mm-> page_table_lock); return VM_FAULT_MINOR;
==>>>

# If the accessed page does not exist in the memory, that is, the page is not stored in any page box, the kernel allocates a new page box and initializes it as appropriate. This technology is calledRequest Paging);

# If the accessed page exists but is marked as read-only, that is, it is already stored in a page box, the kernel allocates a new page box, and copy the data in the old page box to the new page box to initialize its content. This technology is calledCopy On Write (COW).

* If the access permission in the linear zone matches the access type that causes the exception, the handle_mm_fault () function assigns a new page box:

If handle_mm_fault () is successfully assigned a page box to the process, VM_FAULT_MINOR or VM_FAULT_MAJOR is returned. VM_FAULT_MINOR indicates that the missing page is processed without blocking the current process.Minor fault). The value VM_FAULT_MAJOR indicates that the missing page forces the current process to sleep. Blocking the missing page of the current process is calledMajor fault).

When there is not enough memory, the function returns VM_FAULT_OOM. In this case, the function does not allocate a new page box, and the kernel usually kills the current process. However, if the current process is an init process, it is placed at the end of the running queue and the scheduling program is called. Once init resumes execution, handle_mm_fault is executed again:

case VM_FAULT_OOM:goto out_of_memory;
Code at out_of_memory mark (the process is described above ):
/* We ran out of memory, or some other thing happened to us that made * us unable to handle the page fault gracefully. */out_of_memory:up_read(&mm->mmap_sem);if (tsk->pid == 1) {yield();down_read(&mm->mmap_sem);goto survive;}printk("VM: killing process %s\n", tsk->comm);if (error_code & 4)do_exit(SIGKILL);goto no_context;
Request pagination

Request paging refers to a dynamic memory allocation technology that delays page box allocation until the page is no longer in memory, this causes a page missing exception.

The reason for request paging is:Request page adjustment can increase the average number of idle page boxes in the system to make better use of idle memory, so that the system has a larger throughput in general.

The extra system overhead is paid: Each "missing page" exception caused by PAGE adjustment requests must be handled by the kernel.

Code for request page adjustment:

Entry = * pte;/* if the page is not in the primary storage */if (! Pte_present (entry) {/* the content of the page table is 0, indicating that the process has not accessed the page */if (pte_none (entry) return do_no_page (mm, vma, address, write_access, pte, pmd);/* is a non-linear file ing and has been swapped out */if (pte_file (entry) return do_file_page (mm, vma, address, write_access, pte, pmd);/* The page is not in the primary storage, but the page table item stores the relevant information. * indicates that the page is swapped out by the kernel. The swap operation */return do_swap_page (mm, vma, address, pte, pmd, entry, write_access );}
The pte_present () macro specifies whether the entry page is in the primary storage. If the entry page is not in the primary storage, the cause is that the process has never accessed the page, or the kernel has recycled the corresponding page box.

In both cases, the missing page handler must assign a new page box to the process. However, there are three special cases for how to initialize the page box:

* The entry page is never accessed by the process and is not mapped to a disk file:

Pte_none () macro ==> do_no_page () function;

* The entry page is a ing of non-linear disk files:

Pte_file () macro ==> do_file_page () function;

* The entry page has been accessed by the process, but its content is temporarily stored in the disk (present = dirty = 0 ):

==> Do_swap_page () function.

The handle_pte_fault () function identifies the three conditions by checking the logo of the table item corresponding to the address page, and performs different function processing based on different labels.

==>>> Anonymous page and ing page:

In Linux virtual memory, if the vma mapped to the page is a file, it is called a ing page; if it is not a ing file, it is called an anonymous page. The biggest difference between the two lies in the organization of the page and vma, because during page recycling, the vma of the page is mapped through reverse search through the page. For the inverse ing of anonymous pages, vma is organized by vma_anon_node (linked list node) and anon_vma (linked list header) in the vma structure, and then the information of the linked list header is saved in the page descriptor; the ing page and vma organization are organized by the priority Tree node in vma and the mapping-> I _mmap priority tree root in the page descriptor.

Copy at write time

Original Process Creation:

When a fork () system call is sent, the kernel copies the entire address space of the parent process to the child process. This method is time consuming:

1. Configure the address space allocation page for the sub-process

2. Allocate a page for the sub-process page table

3. initialize the sub-process page table

4. Copy the parent process page to the corresponding child process page.

Disadvantage: CPU consumption cycle.

Currently, the linux system adopts a write-time replication technology;

Principle: The Parent and Child processes share pages instead of copying them. Sharing means they cannot be modified. An exception occurs when the parent and child processes attempt to write pages; at this time, the kernel copies this physical page to a new page box and marks it as writable.

The _ count field of the page descriptor is used to track the number of processes that share the corresponding page box. The _ count field of a process is reduced as long as it releases a page box or copies it when it executes the write operation. This page box is released only when _ count changes to-1.

Copy related code when writing:

if(pte_present(entry)){if (write_access) {if (!pte_write(entry))return do_wp_page(mm, vma, address, pte, pmd, entry);entry = pte_mkdirty(entry);}entry = pte_mkyoung(entry);ptep_set_access_flags(vma, address, pte, entry, write_access);update_mmu_cache(vma, address, entry);pte_unmap(pte);spin_unlock(&mm->page_table_lock);return VM_FAULT_MINOR;}
Core Function: do_wp_page () function

This function first obtains the page box descriptor related to the page missing exception. Next, the function determines whether page replication is really necessary. Specifically, the function reads the _ count field of the page descriptor. If it is equal to 0 (only one owner), it does not need to be copied during writing. If multiple processes copy the shared page box when writing, the function copies the content of the old page box to the newly allocated page box (copy_page () macro ). Then, the physical address of the new page box is written into the page table and the corresponding TLB register is invalid. At the same time, the lru_cache_add_active () function inserts a new page box into the data structure related to the exchange. Finally, do_wp_page () reduces the number of old_page counters twice (pte_unmap () function). The first reduction is the increase in security before the page content is removed; the second decrease reflects the fact that the current process no longer has this page.

Processing non-contiguous memory zone access

The linear address where the exception occurs in the kernel state and page missing is generated is greater than TASK_SIZE. In this case, do_page_fault () checks the full table items on the corresponding master kernel page:

vmalloc_fault:{int index = pgd_index(address);unsigned long pgd_paddr;pgd_t *pgd, *pgd_k;pud_t *pud, *pud_k;pmd_t *pmd, *pmd_k;pte_t *pte_k;asm("movl %%cr3,%0":"=r" (pgd_paddr));pgd = index + (pgd_t *)__va(pgd_paddr);pgd_k = init_mm.pgd + index;if (!pgd_present(*pgd_k))goto no_context;pud = pud_offset(pgd, address);pud_k = pud_offset(pgd_k, address);if (!pud_present(*pud_k))goto no_context;pmd = pmd_offset(pud, address);pmd_k = pmd_offset(pud_k, address);if (!pmd_present(*pmd_k))goto no_context;set_pmd(pmd, *pmd_k);pte_k = pte_offset_kernel(pmd_k, address);if (!pte_present(*pte_k))goto no_context;return;}
Do_page_fault () assigns the physical address of the global directory of the current process page stored in the CR 3 register to the local variable pgd_paddr, and assigns the linear address corresponding to pgd_paddr to the local variable pgd, the linear address of the global directory on the main kernel page is assigned to the pgd_k local variable.

If the global directory item of the main kernel page corresponding to the linear address that generates the missing page is null, the function jumps to the Code labeled no_context. Otherwise, the function checks the upper-level directory items on the master kernel page and the middle directory items on the master kernel page corresponding to the error linear address. If one of them is null, it will jump to no_context again. Otherwise, copy the main directory item to the corresponding item in the middle directory of the Process page. Then, repeat the preceding operations on the home table items.

Create and delete the address space of a process

Six typical cases of a process acquiring a new line zone:

Program Execution

Exec () function

Page missing exception handling program

Memory ing

IPC shared memory

Malloc () function

Fork () system calls require creating a complete new address space for the child process. On the contrary, when the process ends, the kernel revokes its address space.

Create the address space of the process

Vfork ()/fork ()/clone () System Call ====>>>:

Copy_mm () function:

static int copy_mm(unsigned long clone_flags, struct task_struct * tsk){struct mm_struct * mm, *oldmm;int retval;tsk->min_flt = tsk->maj_flt = 0;tsk->nvcsw = tsk->nivcsw = 0;tsk->mm = NULL;tsk->active_mm = NULL;oldmm = current->mm;if (!oldmm)return 0;if (clone_flags & CLONE_VM) {atomic_inc(&oldmm->mm_users);mm = oldmm;/* * There are cases where the PTL is held to ensure no * new threads start up in user mode using an mm, which * allows optimizing out ipis; the tlb_gather_mmu code * is an example. */spin_unlock_wait(&oldmm->page_table_lock);goto good_mm;}retval = -ENOMEM;mm = allocate_mm();if (!mm)goto fail_nomem;/* Copy the current MM stuff.. */memcpy(mm, oldmm, sizeof(*mm));if (!mm_init(mm))goto fail_nomem;if (init_new_context(tsk,mm))goto fail_nocontext;retval = dup_mmap(mm, oldmm);if (retval)goto free_pt;mm->hiwater_rss = mm->rss;mm->hiwater_vm = mm->total_vm;good_mm:tsk->mm = mm;tsk->active_mm = mm;return 0;free_pt:mmput(mm);fail_nomem:return retval;fail_nocontext:/* * If init_new_context() failed, we cannot use mmput() to free the mm * because it calls destroy_context() */mm_free_pgd(mm);free_mm(mm);return retval;}
If the CLONE_VM flag of the flag parameter is set to a bit, the copy_mm () function gives the parent process (current) address space to the child process.

If the CLONE_VM flag is not set, the copy_mm () function creates an address space, allocates a new memory descriptor, and copies the mm content of the parent process to the new process descriptor.

Then, call the init_new_context () and init_mm () functions for initialization;

Finally, the dup_mmap () function is called to copy the linear partition and page table of the parent process.

Delete process address space

When the process ends, the kernel calls exit_mm () to release the address space of the process.

Void exit_mm (struct task_struct * tsk) {struct mm_struct * mm = tsk-> mm; mm_release (tsk, mm); if (! Mm) return;/* determine whether it is a kernel thread * // * Serialize with any possible pending coredump. * We must hold mmap_sem around checking core_waiters * and clearing tsk-> mm. the core-inducing thread * will increment core_waiters for each thread in the * group with-> mm! = NULL. */down_read (& mm-> mmap_sem); if (mm-> core_waiters) {up_read (& mm-> mmap_sem); down_write (& mm-> mmap_sem); if (! -- Mm-> core_waiters) complete (mm-> core_startup_done); up_write (& mm-> mmap_sem); wait_for_completion (& mm-> core_done ); down_read (& mm-> mmap_sem);} atomic_inc (& mm-> mm_count); if (mm! = Tsk-> active_mm) BUG ();/* more a memory barrier than a real lock */task_lock (tsk); tsk-> mm = NULL; up_read (& mm-> mmap_sem); enter_lazy_tlb (mm, current); task_unlock (tsk); matrix (mm );}
The mm_release () function wakeup tsk-> vfork_done supplements any process sleeping on the primitive.

If the process being terminated is not a kernel thread, the exit_mm () function releases the memory Descriptor and all relevant data structures. First, it checks whether the mm-> core_waiters flag is set. If yes, the process detaches all the memory content to a dump file. To avoid confusion in the dump file, the function uses the mm-> core_done and mm-> core_startup_done supplementary primitives to serialize the execution of lightweight processes that share the same memory descriptor mm.

Next, the function increments the memory descriptor's primary counter, resets the mm field of the Process descriptor, and puts the processor in the lazy TLB mode.

Finally, call the matrix () function to release the Local Descriptor Table, linear partition descriptor, and page table. Because the exit_mm () function has been incremented by the primary counter, the memory descriptor itself is not released. The finish_task_switch () function releases the memory descriptor when terminating processes are revoked from the local CPU.

Heap Management

Each process has a special linear zone, namely heap ). It is used to meet the dynamic memory requests of processes. The start_brk and brk fields in the memory descriptor indicate the starting and ending addresses of the heap respectively.

Operation Functions

Description

Malloc (size)

The dynamic memory size is requested. If the request succeeds, the linear address of the first byte is returned.

Calloc (n, size)

Request the memory of n elements of the size. If successful, return the linear address of the first element.

Realloc (ptr, size)

Change the memory size allocated by malloc \ calloc.

Free (addr)

Releases the linear zone with the starting address allocated by malloc and calloc as addr.

Brk (addr)

Directly modify the heap size. The Addr parameter specifies the new value of current-> mm-> brk. The returned value is the new end address in the linear area.

Sbrk (incr)

Similar to brk (), but the incr parameter specifies whether to increase or decrease the heap size in bytes.


At this point, the process address space chapter ends.

Legacy problems:

1. What is the specific mechanism of the distribution page box for parent and child processes during write replication?

2. Does each linear Area Divide linear address spaces such as code segments, data segments, and stacks?

3. Is the distribution of linear areas dynamically allocated or static, that is, will the distribution of linear areas dynamically add allocation as memory usage increases? Or will the system allocate a linear zone in advance when the program is executed?

4. As mentioned earlier, do_wp_page () reduces the use of the old_page counter twice (pte_unmap () function) during write replication ), the first decrease is the increase in security before the content of the copy page box is canceled; the second decrease is the fact that the current process no longer owns the page box. Here, why do we need to increase security?

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.