Collect Linux pages

Source: Internet
Author: User

Use of pages
In some previous articles, we learned that Linux kernel will allocate pages in many cases.
1. the kernel code may call functions such as alloc_pages to manage the partner system on the physical page (free_area idle linked list on the management zone) directly allocate page (see Linux kernel memory management analysis). For example, the driver may use this method to allocate the cache. When a process is created, the kernel also allocates two consecutive pages as the thread_info structure and kernel stack of the process. Page allocation from the partner system is the most basic page allocation method, and other memory allocations are based on this method;
2. Many objects in the kernel are managed using the slab mechanism (see "Linux slub distributor analysis"). Slab is equivalent to an object pool, which "Formats" pages into "objects" and stores them in the pool for users. When the objects in slab are insufficient, the slab mechanism will automatically allocate pages from the partner system and "format" the new objects;
3. High-speed disk cache (see Linux Kernel File Read/Write analysis). When a file is read and written, the page is allocated from the partner system and used for high-speed disk cache, and the file data on the disk is loaded into the corresponding disk cache page;
4. Memory ing. The so-called memory ing actually refers to ing the memory page to the user space for the user process. Every VMA in the task_struct-> MM structure of the process represents a ing, and the real implementation of the ing is after the user program accesses the corresponding memory address, page allocation and updating of page tables caused by page missing exceptions (see Linux kernel memory management analysis);

Summary of page recycling
If there is page allocation, there will be page recycling. Page recycling can be divided into two methods:
First, release it. Just like a user program uses the free function to release memory that was previously allocated through the malloc function, the user of the page knows exactly when the page is to be used and when it is no longer needed.
The first two allocation methods mentioned above are generally released by the kernel program. For pages directly allocated from the partner system, this is automatically released by the user using functions such as free_pages. After the page is released, it is directly released to the partner system; objects allocated from Slab (using the kmem_cache_alloc function) are also released by the user (using the kmem_cache_free function ).

Another page recycling method is to use the page box recycling algorithm (pfra) provided by the Linux kernel. Page users generally regard the page as a cache to improve the operating efficiency of the system. Cache is always good, but it will not cause any errors if the cache is no longer available, but the efficiency is only affected. Users of pages do not know when the cached pages are best retained or when they are best recycled.
To put it simply, what pfra wants to do is to recycle the pages that can be recycled. To prevent the system from getting into page shortage, pfra is periodically called and run in the kernel thread. Or because the system is already in short supply on the page, pfra is synchronously called to allocate the kernel execution process of the page because the required page is not available.
The last two allocation methods mentioned above are generally recycled by pfra (or synchronously recycled by a process such as deleting a file, exiting a process, and so on ).

Pfra General recycling page
For the first two page allocation methods mentioned above (direct page allocation and object allocation through slab), pfra may also be used for collection.
Users on the page can register a callback function with pfra (using the register_shrink function ). Pfra then calls these callback functions at the appropriate time to trigger the collection of corresponding pages or objects.
A typical example is dentry collection. Dentry is assigned by slab to represent objects in the directory structure of the virtual file system. When the reference count of dentry is reduced to 0, dentry is not directly released, but cached in an LRU linked list for later use. (See Linux kernel Virtual File System Analysis.)
The dentry In the LRU linked list needs to be recycled. Therefore, during initialization, the virtual file system calls register_shrinker to register the recycle function shrink_dcache_memory.
The super block objects of all file systems in the system are stored in a linked list. The shrink_dcache_memory function scans the linked list to obtain the LRU of each super block that is not using dentry, then, retrieve some of the oldest dentry instances. With the release of dentry, the corresponding inode will be removed and referenced, and may also cause the inode to be released.
After inode is released, it is stored in an unused linked list. during initialization, the Virtual File System also calls register_shrinker to register the callback function shrink_icache_memory, which is used to recycle unused inode, in this way, the associated disk cache in inode will also be released.

In addition, as the system runs, there may be many idle objects in Slab (for example, after the peak usage of a certain object ). The cache_reap function in pfra is used to recycle unnecessary idle objects. If some idle objects can be restored to a page, the page can be released back to the partner system;
The cache_reap function is easy to say. The kmem_cache structure of all the object pools in the system is connected to a linked list. The cache_reap function scans each of the object pools, searches for available pages, and recycles them. (Of course, the actual process must be more complex .)

About memory ing
As mentioned above, disk cache and memory ing are generally recycled by pfra. Pfra recycles these two items very similar. In fact, the disk cache may be mapped to the user space. The following is a brief introduction to memory ing:

Memory ing can be divided into file ing and anonymous ing.
File ing means that the mapped VMA corresponds to a region in a file. This ing method is rarely explicitly used by user State programs. User State programs are generally used to open a file and then read/write to read and write files.
In fact, the user program can also map a part of a file to the memory (corresponding to a VMA) using the MMAP system call, and then read and write the file in the memory access mode. Although user programs are rarely used in this way, user processes are filled with such mappings: executable code (including executable files and Lib library files) being executed by processes) it is mapped in this way.
In the Linux Kernel File read/write analysis, we did not discuss the implementation of file ing. In fact, file ing maps the pages in the file disk cache to the user space directly (visible, the page mapped to the file is a subset of the disk cache page ), you can read and write data in a zero copy. When read/write is used, a copy will take place between the memory of the user space and the high-speed cache of the disk.
The ing between anonymous mappings and files means that the mapped VMA does not correspond to the file. Normal user space memory allocation (heap space and stack space) is anonymous ing.
Obviously, multiple processes may map to the same file through their file ing (for example, most processes map to the so file of the libc library). What about anonymous ing? In fact, multiple processes may map to the same physical memory through their anonymous mappings, this is because the parent and child processes after fork share the original physical memory (copy-on-write.

File ing is divided into sharing ing and private ing. During private ing, if the process writes the mapped address space, the ing disk cache will not be directly written. Instead, copy the original content, and then write the replica. The corresponding page ing of the current process will switch to the replica (copy upon writing ). That is to say, write operations are only visible to you. For shared ing, write operations will affect the high-speed disk cache, which is visible to everyone.

PagesThis reclaim
Disk cache pages (including file ing pages) can be discarded and recycled. However, if the page is dirty, you must write it back to the disk before discarding it.
However, anonymous ing pages cannot be discarded because the pages contain data that is being used by the user program, and the discarded data cannot be restored. In contrast, the data on the disk cache page is stored on the disk and can be reproduced.
Therefore, if you want to reclaim the anonymous ing page, you have to first dump the data on the page to the disk, which is the page switching (SWAp ). Obviously, page exchange costs are relatively higher.
Pages mapped anonymously can be exchanged to swap files on disks or swap partitions (partitions are devices, devices, and files. So the next article is called an exchange file ).

Therefore, unless the page is retained or locked (the page tag pg_reserved/pg_locked is set. In some cases, the kernel needs to temporarily retain the page to avoid being recycled.) All disk cache pages can be recycled, and all anonymous ing pages can be exchanged.

Although there are many pages that can be recycled, it is clear that pfra should recycle/swap as few pages as possible (because these pages need to be restored from the disk at a high cost ). Therefore, pfra only recycles/exchanges a portion of rarely used pages when necessary. The number of pages recycled each time is an experience value: 32.

Therefore, all these disk cache pages and anonymous ing pages are stored in a group of LRU. (In fact, each zone has a set of such LRU, and the page is put in the LRU of its corresponding zone .)
A set of LRU consists of several pairs of linked lists, including linked lists of disk cache pages (including file ing pages), linked lists of anonymous ing pages, and so on. A linked list is actually two linked lists: active and inactive. The former is a recently used page, and the latter is a recently unused page.
During page recycling, pfra will do two things: first, move the pages that are least recently used in the active linked list to the inactive linked list, and second, try to recycle the pages that are least recently used in the inactive linked list.

Confirm minimum recently used
Now there is a problem. How can we determine which pages in the active/inactive linked list are least recently used?
One method is sorting. When a page is accessed, it is moved to the end of the linked list (assuming that the collection starts from the header ). However, this means that the position of the page in the linked list may move frequently, and the page must be locked before moving (multiple CPUs may be accessed at the same time), which has a great impact on efficiency.
The Linux kernel adopts the labeling and adding order method. When a page moves between the active and inactive linked lists, it is always placed at the end of the linked list (same as above, assuming that the collection starts from the header ).
When pages are not moved between linked lists, their order is not adjusted. The access tag is used to indicate whether the page has just been accessed. If the page with access tags configured in the inactive linked list is accessed again, move it to the active linked list and clear the access tag. (In fact, in order to avoid access conflicts, the page does not directly move from the inactive linked list to the active linked list. Instead, there is a pagevec intermediate structure used as a buffer to avoid chain tables .)

There are two access tags for a page: one is the pg_referenced tag in page-> flags, Which is set when the page is accessed. For pages in the disk cache (not mapped), user processes access them through system calls such as read and write. The pg_referenced mark of the corresponding page is set in the System Call code.
For pages mapped to memory, user processes can directly access them (without passing through the kernel). In this case, access tags are not set by the kernel, but by MMU. After ing a virtual address to a physical address, MMU sets an accessed flag on the corresponding page table to indicate that the page is accessed. (Similarly, MMU sets a dirty flag on the page table corresponding to the written page, indicating that the page is dirty .)
The access tag of the page (including the above two tags) will be cleared when pfra processes the page recycling, because the access tag is obviously valid, the pfra running cycle indicates the validity period. The pg_referenced mark in page-> flags can be cleared directly, while the accessed bit in the page table item must be found on the page to clear the corresponding page table item (see "reverse ing" below ").

So how does the recycling process scan the LRU linked list?
Because multiple sets of LRU exist (multiple zones exist in the system, and each zone has multiple sets of LRU), if pfra scans all LRU to find several pages worth recycling each time, the efficiency of recycling algorithms is obviously not ideal.
The scanning method used by Linux kernel pfra is to define a scanning priority, which is used to convert the number of pages that should be scanned on each LRU. The entire recycling algorithm starts with the lowest priority. It first scans the least recently used pages in each LRU, and then tries to recycle them. If a sufficient number of pages have been recycled after scanning, the process ends. Otherwise, increase the priority and scan again until a sufficient number of pages are recycled. If a sufficient number of pages cannot be recycled at all times, the priority will be increased to the maximum, that is, all pages will be scanned. At this time, even if the number of recycled pages is still insufficient, the recycling process will end.

Each time a LRU is scanned, pages corresponding to the current priority are retrieved from the active and inactive linked lists, and then processed: if the page cannot be recycled (if it is retained or locked), it is put back to the corresponding linked list header (same as above, assuming that the collection starts from the header); otherwise, if the access tag of the page is set to a bit, the mark is cleared, and the page is placed back at the end of the corresponding linked list (same as above, assuming that the collection starts from the header); otherwise, the page will be moved from the active linked list to the inactive linked list or recycled from the inactive linked list.
Pfra does not tend to reclaim anonymous ing pages from the active linked list, because the user process generally uses a relatively small amount of memory and needs to be exchanged for recycling, which is costly. Therefore, when there is a large amount of memory remaining and the proportion of anonymous mappings is small, the pages in the active linked list corresponding to anonymous mappings are not recycled. (If the page has been put in the inactive linked list, it will not be managed as much .)

Reverse ing
In this way, some pages in the LRU inactive linked list may be recycled during the pfra page recycling process.
If the page is not mapped, it can be directly recycled to the partner system (for dirty pages, write back and recycle ). Otherwise, there is another troublesome thing to handle. Because a page table item of a user process is referencing this page, you must provide a description of the page table item that references the page before recycling the page.
As a result, the problem arises. How does the kernel know which page table items are referenced on this page? To achieve this, the kernel establishes a reverse ing from page to page table items.
You can use reverse ing to find the VMA corresponding to the mapped page, and use VMA-> vm_mm-> PGD to find the corresponding page table. Then, use page-> index to obtain the virtual address of the page. Then, use the virtual address to find the corresponding page table items from the page table. (The accessed mark in the table items on the page is obtained through reverse ing .)

In the page structure corresponding to the page, if page-> mapping is at the lowest position, this is an anonymous ing page. Page-> mapping points to an anon_vma structure; otherwise, it is a file ing page, page-> addreing file address_space structure. (Obviously, when the anon_vma structure and address_space structure are allocated, the address must be aligned, at least to ensure that the priority bit is 0 .)
For pages mapped anonymously, The anon_vma structure serves as a linked list header, and connects all VMA mapped to this page through the VMA-> anon_vma_node linked list pointer. Every time a page is mapped (anonymously) to a user space, the corresponding VMA is added to this linked list.
For file ing pages, the address_space structure not only maintains a radix tree for storing disk cache pages, but also maintains a Priority Search Tree for all VMA files mapped. Because the VMA mapped to these files is not always mapped to the entire file, it is likely that only a part of the file is mapped. Therefore, this priority search tree not only indexes all mapped VMA instances, but also knows which regions of the file are mapped to which VMA instances. When a page is mapped to a user space by a file, the corresponding VMA is added to the priority search tree. Therefore, given a page on the high-speed cache of the disk, you can use page-> index to obtain the location of the page in the file. Then, you can use the Priority Search Tree to find all VMA mapped to the page.

In the above two steps, the magic page-> index is used to get the virtual address of the page and the location of the page in the file disk cache.
VMA-> vm_start records the first virtual address of VMA, and VMA-> vm_pgoff records the offset of the VMA in the corresponding ing file (or shared memory, page-> index records the offset of the page in the file (or shared memory.
You can use VMA-> vm_pgoff and page-> index to get the offset of the page in VMA, and add VMA-> vm_start to get the virtual address of the page; you can use page-> index to obtain the location of the page in the file disk cache.

Page exchange
After you find the page table items that reference the page to be recycled, you can directly clear the page table items that reference the page for file ing. When the user accesses this address again, a page missing exception is triggered. The exception handling code re-allocates a page and reads the corresponding data from the disk. (maybe, the page already exists in the corresponding disk cache, because other processes have accessed it first ). This is the same as the first access case after page ing;
For anonymous ing, write the page back to the swap file, and then record the index of the page in the swap file in the page table item.
The page table item has a present bit. If this bit is cleared, MMU considers the page table item invalid. When the page table item is invalid, other bits are not concerned by MMU and can be used to store other information. Here they are used to store the index of the page in the swap file (in fact, it is the swap file number + the index number in the swap file ).

The process of switching an anonymous ing page to a swap file (swap out) is similar to the process of writing dirty pages in the disk cache back to a file.
The swap file also has its corresponding address_space structure. The pages mapped anonymously are first placed in the high-speed cache of the corresponding disk of the address_space during the swap, and then written back like dirty pages, is written back to the swap file. After the write-back is complete, the page is released (Remember, we want to release this page ).
So why does it not directly write the page back to the swap file, but it needs to go through the disk cache? Because the page may be mapped multiple times, it is impossible to modify the corresponding page table items in the page tables of all user processes at one time (modified to the index of the page in the swap file ), therefore, when the page is released, the page is temporarily stored in the disk cache.
Not all page table items can be modified successfully (for example, the page is accessed again before modification, so the page does not need to be recycled now ), therefore, it may take a long time to cache pages on a disk.

Similarly, the process of reading an anonymous ing page from a swap file (the SWAp process) is similar to the process of reading the file data.
First go to the corresponding disk cache to see if the page is not there. If not, go to the swap file to read it. The data in the file is also read to the high-speed cache of the disk. Then, the corresponding page table items in the page table of the user process will be rewritten and direct to this page.
This page may not be immediately taken from the disk cache, because if other user processes are also mapped to this page (their corresponding page table items have been changed to the index of the swap file ), they can also be referenced here. The page can be retrieved from the disk cache only when no other page table item references this swap file index.

The final mandatory killer
As mentioned above, pfra may have scanned all LRU pages that cannot be recycled. Similarly, in slab, dentry cache, inode cache, and other places, the page may not be recycled.
In this case, if a kernel code segment needs to obtain a page (without a page, the system may crash )? Pfra had to make the final killer-out of memory ). The so-called oom is to find the least important process and then kill it. Release the memory pages occupied by the process to relieve system pressure.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.