third, Linux pages interrupt processing
transferred from: http://blog.csdn.net/cxylaf/article/details/1626534
1. Request Paging Interrupt:the pages in the process's linear address space do not have to reside in memory, such as the process allocation request is understood to be satisfied, the space only retains vm_area_struct space, the page may be swapped to the backup memory, or write a read-only page (COW). Linux uses the request paging technology to solve the hardware page break exception, and through the appointment-based page-change strategy. The primary page fault and the second fault are interrupted, and the time-consuming need to read the data from the disk generates a primary page break. Each CPU structure provides a do_page_fault(struct Pt_regs *regs, Error_code)processing page faults, the function provides a lot of information, such as the occurrence of the exception address, whether the pages are not found or page protection errors, is read or write exceptions, from user space or kernel space. It is responsible for determining how exception types and exceptions are handled by the architecture-independent code. is the Linux pages interrupt processing process:Figure Linux pages interrupt processingOnce the exception handler determines that the exception is a valid fault in the valid memory area, the architecture-independent function Handle_mm_fault () is called. If the Request Page table entry does not exist, the requested page table entry is allocated and Handle_pte_fault () is called. The first step calls Pte_present to check the PTE flag bit, determine whether it is in memory, and then call Pte_none () to check whether the PTE is assigned. If the PTE is not already assigned, Do_no_page () is called to process the allocation of the requested page, otherwise the page has been swapped to disk and the call Do_swap_page () is called to process the request for a page break. If the swapped out page belongs to a virtual file, it is processed by Do_no_page (). The second step determines whether the page is written. If the PTEs write-protected, call Do_swap_page (), because this page is a copy of the page when it is written. Cow Page Recognition Method: The page is located VMA flag bit writable, but the corresponding PTE is not writable. If it is not a cow page, it is usually marked as dirty because it has been written. The third step determines whether the page has been read and is in memory, but an exception occurs. This is because there is no Level 3 page table in some architectures, in which case the PTE is established and the flag is new.
2. Request Page Assignment:The first time the page is accessed, the page is assigned first, and the data is generally populated by do_no_page (). If the parent VMA Vm_area_struct->vm_ops provides the nopage () function, it is populated with data, otherwise the do_anonymous_page () anonymous function is called to populate the data. If a file or device is mapped, if the file is mapped, filemap_nopage () replaces the nopage () function if it is mapped by a virtual file, then Shmem_nopage (). Each device driver will provide a different nopage () function that returns the struct page structure.
3. To request a page change:after swapping the page to the backup memory, the function do_swap_page () is responsible for reading the page into memory, which is described later. The information through the PTE is sufficient to find the swapped pages. When a page is swapped out, it is typically placed in the swap cache. if the page is in the cache when it is interrupted, simply increase the page count, and then put it in the Process page table and count the number of times that the pages break occurred. if the page exists only on disk, Linux calls Swapin_readahead () to read it and several subsequent pages.
4. Page Frame RecyclingIn addition to the slab allocator, all the pages being used in the system are stored in the page cache and are linked by PAGE->LRU. Slab pages are not stored in the cache because the page count is difficult based on the objects used by the slab. There is no other way to map a struct page to a PTE than to find a page table for each process, and it is expensive to look up a page table. If there is a large number of process map pages in the page cache, the system will traverse the Process page table and swap out the page through the Swap_out () function until there are enough pages to be idle, and the shared page will cause problems for swap_out (). If a page is shared and a swap item is already assigned, Pte fills in the required information to re-locate the page in the swap partition and subtract the reference count by 1. Only the reference count is 0 o'clock the page can be replaced. memory and disk cache requests more and more pages but does not know how to release the Process page, requesting a paging mechanism to request a new page when the Process page is missing, but it does not force the release of pages that the process no longer uses. The page Frame reclaiming algorithm (PFRA) recovery algorithm is used to recycle pages from the user process and kernel cache into the free block list of the partner system. Pfra must be recycled when the system's free memory reaches a minimum, and the reclaimed object must be a non-free page. the system pages can be divided into four types:1)unreclaimablenon-recyclable, including free pages, reserved pages set the pg_reserved flag, the kernel dynamically allocated pages, the process kernel stack page, the temporary locked page with the pg_locked flag set, the memory page with the vm_locked flag set. 2)swappableThe exchangeable page, the anonymous page of the user process space (user stack), the mapping page of the Tmpfs file system (into the IPC Shared memory page), the page is stored in the swap space. 3)syncablepages that can be synced, into the user-state map page of the address space, page-cached pages that protect disk data, block device buffers, disk cache pages (into the inode cache), and, if necessary, synchronize the data on the disk image. 4)discardableA page that can be discarded into a useless page in the memory cache (the page in the slab allocator), and the Dentry cache page. PfraThe algorithm is based on empirical rather than theoretical algorithm, and its design principles are as follows:1)first release the page without damage. The disk and memory cache that the process is no longer referencing should be released prior to the user-configured page of the address space. 2)The page that flags all process-state processes is recyclable. 3)to recycle a multi-process shared page, first clear the Process page table entry that references the page, and then recycle it. 4)Recycle "Not in use" page. PFRA uses the LRU list to divide the process into two in-use and unused, and Pfra only the pages of the unused state are recycled. Linux uses accessed bits in PTEs to implement a non-rigorous LRU algorithm. page recycling is typically performed in three cases:1)the system can be reclaimed when the available memory is low (usually occurs when the request memory fails). 2)The kernel is recycled when it enters the Suspend-to-disk state. 3)periodically, kernel threads are activated periodically and are page-recycled when necessary. Low on memory recovery has the following situations:1)_ _getblk ()the call to the Grow_buffers () function to allocate a new cache page failed;2)create_empty_buffers ()the call to the Alloc_page_buffers () function allocates a temporary buffer head failure for the page;3)_ _alloc_pages ()The function fails to allocate a contiguous set of page frames in a given memory area. the two kernel threads involved in recurring recycling are:1)KSWAPDkernel thread detects if a free page is lower in the memory areaPAges_highthe threshold value;2)The event kernel thread in the pre-defined work queue, PFRA periodically dispatches all the idle slab in the slab allocator for the task in the work queue; all user-space processes and page-cached pages are classified as active linked lists and inactive linked lists, collectively known as the LRU list. Each zone descriptor includes active_list and inactive_list two linked lists that link the pages separately. Nr_active and nr_inactive each represent the number of pages, Lru_lock for synchronization. The PG_LRU in the page descriptor is used to flag whether a page belongs to the LRU list, pg_active is used to flag whether the page belongs to the active linked list, and the LRU field is used to string the linked list in the LRU. The active list and inactive linked list pages are dynamically adjusted according to the most recent access conditions. The pg_referenced logo is for this purpose. the functions that handle the LRU list are:add_page_to_active_list (), add_page_to_inactive_list (), Activate_page (), Lru_cache_add (), lru_cache_add_active (), etc., these functions are relatively simple. Shrink_active_list () Moves the page table from the active linked list to the inactive linked list. The function executes when the Shrink_zone () function executes a page collection of the user address space.
5. Swap partition:the system can have max_swapfiles swap partition, each partition can be placed on disk partition or ordinary file. Each swap area consists of a series of page slots. Each swap area has a swap_header structure that describes the Exchange area version and other information. Each swap area consists of several swap_extent, each of which is a contiguous physical region. There is only one swap_extent for the disk swap area, and for the file swap area consists of multiple swap_extent, because the file is not placed on a contiguous disk block. The Mkswap command allows you to create swap partitions. Graph Exchange Partition structure Diagram Exchange page structureSwp_type() andSwp_offset ()The function obtains the type and offset values according to the page slot index and the interchange area code, and the function swp_entry (type,offset) gets the swap slots. The last one always clear 0 indicates that the page is not on RAM. Slot Max224 (64G). The first available slot index is 1. The slot index cannot be all 0. a page may be shared by multiple processes, it may be swapped out from a process address space but still in physical memory, so a page may be swapped out multiple times. But physically only for the first time is swapped out and stored on the swap area, the next swap operation only increases the SWAP_MAP reference count. The function of Swap_duplicate (swp_entry_t entry) is that the user tries to swap out a page that has been swapped out.
6. Swap cache:when multiple processes are swapped into a shared anonymous page or a process is swapped into a page that is being PFRA swapped out, there is a race condition that introduces a swap cache to resolve this synchronization problem. The pg_locked flag ensures that concurrent exchange operations on a page only work on a single page, thus avoiding competitive conditions.
7. Page Recycling Algorithm Description:is a function call graph when a page is recycled in various situations. You can see that the final calling function is Cache_reap (), Shrink_slab (), and Shrink_list (). Cache_reap () is used to periodically reclaim useless slab in the slab allocator. Shrink_slab () the page used to reclaim the disk cache. Shrink_list () is the core function of page recycling, and in the latest code the function name is changed to Shrink_page_list (). The following will focus on the explanation. The latest function named shrink_zones (), Shrink_cache () in the figure shrink_caches () is named Shrink_inactive_list (). The other functions do not change. Figure PFRA Function structure call relationship
Low Memory Reclamation page:as shown, when the memory allocation fails, the kernel calls Free_more_memory (), which first calls Wakeup_bdflush () to wake the Pdflush kernel thread to trigger the write operation, Write 1024 dirty pages from disk page buffer to disk to release the page tables occupied by data structures containing buffers, buffers, and VFS, and then make system calls to Sched_yield () so that the Pdflush thread is able to run, and finally the function loops through the system nodes, Call the Try_to_free_pages () function on the low memory area (ZONE_DMA and Zone_normal) on each node. try_to_free_pages (struct zone **zones, gfp_t gfp_mask)The goal of the function is to free at least 32 page frames by looping through calls Shrink_slab () and Shrink_zones (), increasing the priority parameter for each call, with an initial priority of 12 and a maximum of 0. If you loop 13 times and still don't release 32 pages,Pfra for memory-out protection:Call the Out_of_memory () function to select a process to recycle all of its pages. Shrink_zones (int priority, struct zone **zones, struct Scan_control *sc)the Zones function invokes the Shrink_zone () function on each of the extents in the list. Before calling Shrink_zone (), update the prev_priority in the zone descriptor with the value of sc->priority, if the zone->all_unreclaimable field is not 0 and the priority is not 12, No page reclamation is made for the zone. Shrink_zone (int priority, struct zone *zone, struct Scan_control *sc)The function attempts to reclaim 32 pages. This function loops through the operations of Shrink_active_list () and shrink_inactive_list to achieve the target. The function flow is as follows:1)atomic_inc (&zone->reclaim_in_progress)increase the recovery count of the zone;2)increase the zone->nr_scan_active, according to the priority, increase the range is zone->nr_active/2 tozone->nr_active/20. Ifzone->nr_scan_active >=the nr_active variable is assigned, while Zone->nr_scan_active is set to 0, otherwise nr_active=0;3)zone->nr_scan_inactiveand nr_inactive do the same treatment;4)if nr_active and nr_inactive are different, then the while loop is 5 or 6 steps:5)if Nr_active is not 0, move some pages from the active list to the inactive list:nr_to_scan = min (nr_active, (unsigned long) sc->swap_cluster_max);nr_active-= Nr_to_scan;shrink_active_list (Nr_to_scan, Zone, SC, priority);6)if Nr_inactive is not 0, the pages in the inactive linked list are recycled:nr_to_scan = min (nr_inactive, (unsigned long) sc->swap_cluster_max);nr_inactive-= Nr_to_scan;nr_reclaimed + = shrink_inactive_list (Nr_to_scan, Zone, SC);7)Atomic_dec (&zone->reclaim_in_progress)reduce the recycle count and return the number of recycled pages nr_reclaimed; shrink_inactive_list (unsigned long max_scan, struct zone *zone, struct Scan_control *sc)function from the area of theInactive a list of pages into a temporary linked list, calling Shrink_page_list () to recycle each page in the linked list. Below isshrink_inactive_list ()Main steps:1)call Lru_add_drain () to move the pages in the Lru_add_pvecs and lru_add_active_pvecs of the PAGEVEC structure on the current CPU to the active and inactive linked lists respectively;2)get the LRU lock SPIN_LOCK_IRQ (&zone->lru_lock);3)scan up to Max_scan pages, add usage counts to each page, check if the page is being released to the partner system, and move the page into a temporary list;4)subtract from zone->nr_inactive the number of pages moved to the temporary linked list;5)increase zone->pages_scanned count;6)Release LRU Lock: Spin_unlock_irq (&zone->lru_lock);7)Call Shrink_page_list (&page_list, SC) recycle page for temporary linked list;8)increase nr_reclaimed count;9)get the LRU lock Spin_lock (&zone->lru_lock);Ten)Add shrink_page_list (&page_list, SC) pages that are not recycled back to the active list and inactive linked lists. This function may set the PG_ACTIVE flag during recycling, so it is also considered to be added to the active list. One )if the number of pages scanned nr_scanned is less than Max_scan, the 3~10 operation is performed;)returns the number of pages recycled; shrink_page_list (struct list_head *page_list, struct Scan_control *sc)to do a real page recycling work, the function flow is as follows: Figure shrink_page_list () page recycling logic processing Flow 1)call cond_resched () for conditional dispatch;2)Iterate through each page in the Page_list, remove the page descriptor from the list and recycle the page, and if the collection fails, insert the page into a local list; the step process is described in the flowchart. L call cond_resched () for conditional dispatch; L Remove the first page from the LRU list and remove it from the LRU list; l if the page is locked, this page is added to the temporary list; l if the page cannot be partially Recycle and the page is a map of the Process page table, which skips the page; l if the process is a writeback dirty page, skip if the page is referenced and the page map is used, which skips and activates the page so that it is placed in the active list, and if it is an anonymous page and does not In the swap area, this call Add_to_swap () allocates swap space for the page and adds the page to the swap cache; If the page is a process space map and the page map address is not empty, call Try_to_unmap () to remove the page table mappings for that page; l if the page is di Rty page and no reference, Exchange writable, and FS file system mappings, call Pageout () to write out the page.3)The loop ends and the pages in the local list are moved back to the page_list linked list;4)returns the number of recycled pages. There are only three results after each page frame processing:1)by calling the Free_code_page () page to be released into the partner system, the page is effectively recycled;2)The page is not recycled and is reinserted into the page_list linked list, and it is considered that the page may be recycled again in the future, thus clearing the PG _active flag so that it can be added to the inactive linked list later;3)The page is not recycled and is reinserted into the page_list linked list, and it is considered that the page will not be recycled again in the foreseeable future, thus setting the PG _active flag so that it can be added to the active list laterWhen you recycle an anonymous page, the page must be added to the swap cache, and a new page slot must be reserved for it in the swap area. If the page is in the user-State address space of some processes, shrink_page_list () calls Try_to_unmap to locate all the process PTE items that are holding the page frame, and proceeds only if the return succeeds, and if the page is dirty state, it must be written to disk to be recycled. This requires calling the Pageout () function, which is resumed only if Pageout () completes the write operation soon or does not have to write, and if the page protects VFS buffers, call Try_to_release_page () to release buffer heads. Finally, if all goes well, the shrink_page_list () function checks the page's reference count: If the value is exactly 2, one is the page cache or the swap cache, and the other is the PFRA itself (the value is added in the Shrink_inactive_page () function). In this case, the page can be recycled, and it is not dirty. Depending on the page PG _swapcache flag, the page is removed from the page cache or the swap cache, and then the Free_code_page () is called.
Swap out pages add_to_swap (struct page * page, gfp_t gfp_mask) The first step is to assign the paging slot to the page and allocate the swap cache; steps are as follows: 1) get_swap_page () reserved swap slots for swap-out pages; 2) Call the __add_to_swap_cache () incoming slot index, page descriptor, and GFP flag to add the page to the swap cache and mark it as Dirty;3) Set the page PG _uptodate and Pg_dirty flag so that shrink_inactive_page () can force the page to be written to disk; 4) return ; try_to_unmap (struct page *page, int migration), swap out the second step, call after Add_to_swap, This function finds page table entries in all user page tables that point to the anonymous page frame and sets the displace flag in the Pte. page_out () swap out operation The third step is to write the dirty page to disk: 1) Check the page cache or swap the pages in the cache and see if the page is nearly occupied by the page cache or swap cache, or if it fails, return page_keep. 2) Check whether the Writepage method of the Address_space object is defined, such as no return page_activate;3) checks whether the current process can send write requests to the current mapped address space object corresponding to the block device on the request queue. 4) setpagereclaim (page) set pages recycling flag; 5) calls Mapping->a_ops->writepage (page, &WBC) for write operations, and if failed, clears the recycle flag; 6) if Pagewriteback (page) fails, the pages are not written back, clear the Recycle Flag Clearpagereclaim (page); 7) return success; for swap partitions, the Writepage implementation function is Swap_writepage (), The function flow is as follows: 1) Check if there are other processes referencing the page, and if not, remove the page from the swap cache to return to 0;2) The get_swap_bio () assignment initializes the Bio descriptor, which finds the swap area from the interchange page flag, and then traverses the swap extension list to find the starting disk partition of the page slot. The Bio descriptor contains a request to a page and sets the completion method to End_swap_bio_write (). 3) set_page_writeback (page) settings page writeback flag, Unlock_page () the page is unlocked; 4) submit_bio (rw, bio) submits the bio descriptor to the block device for write operations; 5) return; Once the write operation is complete, end_swap_bio_write () is executed. The function wakes up waiting for the page Pg_writeback flag to clear the process, clears the PG_WRITEBACK flag, whether the Bio descriptor.
Swap in Page The Swap page operation occurs when a process accesses a page that is swapped out to disk. The page error handler is swapped in when the following conditions occur: 1) the page containing the address that threw the exception is a valid page for the current process memory area; 2) the page is not in memory, and the PTE page present is cleared 0;3) The page-related PTEs are not cleared for Null,dirty bits 0, which means that the PTE contains a flag for the swapped-out page; when the above conditions are met, Hand_pte_fault () invokes the Do_swap_page () function in the request page. do_swap_page (struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, int write_access, pte_t orig_pte) The function processing flow is as follows: 1) entry = Pte_to_swp_entry (orig_pte) to obtain the Exchange slot information; 2) page = Lookup_swap_cache (entry) to see if the corresponding page of the swap slot exists in the swap cache, and if so, skip to step 6th; 3) Call Swapin_readahead (entry, address, VMA) to read a set of pages from the swap area, calling Read_swap_cache_async () for each pageRead the page; 4) call Read_swap_cache_async () again for the page of the process access exception. Because the Swapin_readahead call may fail, Read_swap_cache_async () finds the page in the swap cache, and returns quickly in case it succeeds; 5) if the page is still not in the swap cache, there may be other kernel control paths that have swapped the page in. Compare the page_table corresponding to the Orig_pte whether the content is the same, if different, the description page has been swapped in. function jumps back. 6) If the page is in the swap cache, call mark_page_accessed and lock the page; 7) pte_offset_map_lock (mm, PMD, address, &PTL) gets page_table corresponding PTEs content, compared to Orig_pte, Determine if there are other kernel control paths to be swapped in, 8) test pg_uptodate flag, if not set, then error return; 9) increase the count of Mm->anon_rss; mk_pte (page, Vma->vm_page_prot) Create PTE and set flags, Insert into the Process page table; page_add_anon_rmap () Inserts the contents of the reverse-mapped data structure for the anonymous page; swap_free (entry) to release the page slot; Check if the swap cache load is up to 50 %, if it is, and the page is occupied only by the process that triggered the page access exception, the page is freed from the interchange cache. If the write_access flag is 1, the description is cow write-time copy, call Do_wp_page () copy one copy of the page;
Releases page locks and page caches, and returns results.
Linux Learning Summary-page break and switching technology "go"