Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞
Linux core-memory management-Linux general technology-Linux programming and kernel information. The following is a detailed description. The storage management subsystem is one of the most important components of the operating system. In the early computing era, because the amount of memory required by people is much larger than the physical memory, people have designed a variety of strategies to solve this problem. The most successful is the virtual memory technology. It satisfies the memory space required by processes that compete with limited physical memory.
The virtual memory technology not only allows us to use more memory, but also provides the following features:
Huge addressing space
The operating system makes the system look much larger than the actual memory. The virtual memory can be many times the actual physical space in the system. Each process runs in its own virtual address space. These virtual spaces are completely isolated from each other, so they do not affect each other. At the same time, the hardware Virtual Memory organization can set some areas of the memory as non-writable. This protects code and data from malicious programs.
The memory ing technology can map image files and data files directly to the address space of the process. In Memory ing, the file content is directly connected to the virtual address space of the process.
Fair physical memory allocation
The memory management subsystem allows every running process in the system to share the physical memory fairly.
Shared virtual memory
Although the virtual memory allows the process to have its own virtual address space, it sometimes needs to share the memory between processes. For example, there may be several processes running the BASH command shell program at the same time in the system. To avoid the existence of BASH program copies in the virtual memory space of each process, A better solution is to have only one BASH copy in the physical memory of the system and share it among multiple processes. The dynamic library is another way to share and execute code between processes. Shared memory can be used as a means of inter-process communication (IPC). Multiple processes exchange information through shared memory. Linux supports the shared memory IPC Mechanism of system v.
3.1 virtual memory abstract model
1. abstract model for ing virtual addresses to physical addresses
Before discussing how Linux supports virtual memory, it is necessary to look at a simpler abstract model.
When the processor executes a program, it needs to read it from the memory and then decode the command. Before decoding a command, it must extract or store a value to a location in the memory. Then execute this command and point to the next command in the program. In this process, the processor must frequently access the memory, either take the specified number or store data.
All addresses in the virtual memory system are virtual addresses instead of physical addresses. The processor converts a series of tables maintained by the operating system from virtual addresses to physical addresses.
To make the conversion easier, both the virtual memory and physical memory are organized on pages. The page size in different systems can be the same or different, which will cause inconvenience in management. The size of the Linux page running on the Alpha AXP processor is 8 KB, while that of the Intel X86 system is 4 kb. Each page is marked by a number (PFN ).
The virtual address in page mode is composed of two parts: the page box number and the Offset Value in the page. If the page size is 4 kb, The 0 bit of the Virtual Address indicates the virtual address offset value, and the 12-bit or above indicates the virtual page box number. The processor must complete address separation when processing virtual addresses. With the help of the page table, it converts the virtual page box number to the physical page box number, and then accesses the corresponding offset on the physical page.
. 1 provides the virtual address space of two processes X and Y. They have their own page tables. These page tables map virtual pages of various processes to physical pages in the memory. In the figure, the virtual page number 0 of process X is mapped to the physical page number 4. Theoretically, each page table entry should contain the following content:
Valid mark, indicating that the entry to the table on this page is valid
The physical page number described in the page table entry
Access control information. Which operations can be performed on this page? Include Execution Code?
The virtual page box number is the offset in the page table. The virtual page number 5 corresponds to the 6th units in the table (0 is the first ).
To convert a virtual address to a physical address, the processor must first obtain the virtual address page box number and page offset. Generally, the page size is set to the power of 2. Set. the page size in 1 is set to 0x2000 bytes (decimal: 8192) and an address in the virtual address space of process Y is 0x2194, the processor converts it to virtual page number 1 and the page offset 0x194.
The processor uses the virtual page box number as an index to access the processor page table and retrieve the page table entry. If the page table entry at this location is valid, the processor will obtain the physical page box number from this portal. If this entry is invalid, it means that the processor accesses a non-existent area in the virtual memory. In this case, the processor cannot perform address translation, and it must pass the control to the operating system to complete this operation.
When a process attempts to access a virtual address that the processor cannot perform a valid address translation, how does the processor transmit control to the operating system depends on a specific processor. The common practice is that the processor causes a page failure error and falls into the core of the operating system. In this way, the operating system will get information about the invalid virtual address and the cause of the page error.
Then use. for example, if the virtual page number 1 of process Y is mapped to system physical page number 4, the starting position in physical memory is 0x8000 (4*0x2000 ). Add the 0x194 byte offset to obtain the final physical address 0x8194.
By ing virtual addresses to physical addresses, virtual memory can be mapped to system physical pages in any order. For example. in 1, the virtual page number 0 of process X is mapped to the physical page number 1, and the virtual page number 7 is mapped to the physical page number 0, although the virtual page number of the latter is higher than that of the former. In this way, the virtual memory technology brings interesting results: the pages in the virtual memory do not need to maintain a specific order in the physical memory.
3.1.1 Request Form feed
In systems with much smaller physical memory than virtual memory, the operating system must improve the efficiency of physical memory usage. One way to save physical memory is to load only the virtual pages that are being used by the program. For example, a database program may need to query a database. At this time, not all the content of the database must be loaded into the memory, but only those parts to be used. If the database query is a search query without adding records to the database, loading the code for adding records is meaningless. This technology loads only virtual pages to be accessed is called Request Form feed.
When a process attempts to access a virtual address that is not in memory, the processor cannot find the entry of the referenced address in the page table. In. 1. for virtual page number 2, there is no entry in the page table of process X, so when process X tries to access the content of virtual page number 2, the processor cannot convert this address to a physical address. In this case, the processor notifies the operating system of a page error.
If a page error occurs, the virtual address is invalid, indicating that the process is trying to access a non-existent virtual address. This may be caused by an application error. For example, it tries to perform a random write operation on the memory. At this time, the operating system will terminate the operation of this application to protect other processes in the system from the impact of this error process.
If the error virtual address is valid but the page it points to is not in memory, the operating system must read the page from the disk image to the memory. Because the disk access takes a long time, the process must wait for a while until the page is retrieved. If there are other processes in the system, the operating system selects one of them to run when reading the page. The read-back page will be placed in an idle physical page box, and the portal corresponding to the virtual page number will be added to the page table of the process. Finally, the process will run again from where a page error occurs. At this time, the entire virtual memory access process has come to an end. The processor can continue to convert the virtual address to the physical address, and the process can continue to run.
Linux uses Request Form feed to load the executable image to the virtual memory of the process. When a command is executed, the executable command file is opened and its content is mapped to the virtual memory of the process. These operations are done by modifying the data structure of the process memory image. This process is called memory ing. However, only the initial part of the image is transferred to the physical memory, and the remaining part remains on the disk. When the image is executed, it will produce a page error, so that Linux will decide which part of the disk will be transferred to the memory for further execution.
If a process needs to transfer a virtual page to the physical memory, but the system does not have any idle physical page, the operating system must discard some pages in the physical memory to free up space.
If the pages discarded from the physical memory come from executable files or data files on the disk, and have not been modified, you do not need to save those pages. When the process needs this page again, it can be read directly from the executable file or data file.
However, if the page has been modified, the operating system must keep the page content for further access. These pages are called dirty pages. When they are removed from the memory, they must be stored in special files called swap files. Compared with the speed of the processor and physical memory, the speed of accessing the swap file is very slow. The operating system must write these dirty pages to the disk and keep them in the memory to make a choice.
When you select an algorithm to discard a page, you often need to determine which pages are to be discarded or exchanged. If the switching algorithm is very inefficient, a "bumps" will occur. In this case, the page is constantly written to the disk and read back from the disk, so that the operating system cannot perform any other work. Take. 1 as an example. If the physical page number 1 is frequently used, it is inappropriate for the page discard algorithm to use it as the candidate for switching to the hard disk. A page set frequently used by a process is called a working set. The efficient switching policy ensures that the worksets of all processes are stored in the physical memory.
Linux uses the least recently used (LRU) page aging algorithm to fairly select the page to be discarded from the system. This policy sets an age for each page in the system. It varies with the number of page visits. The more times a page is accessed, the younger the page is. On the contrary, the older the page is. Older pages are the best candidates to switch pages.
3.1.3 shared virtual memory
The virtual memory allows multiple processes to easily share the memory. All memory access is performed through the page tables of each process. For two processes that share the same physical page, the corresponding page table must contain a page table entry pointing to this physical page number.
. 1. The two processes share the physical Page No. 4. For process X, the corresponding virtual page box is 4 and process Y is 6. This interesting phenomenon indicates that the process sharing a physical page corresponds to the virtual memory location of the page.
3.1.4 physical and virtual addressing mode
It is of little significance that the operating system itself runs in the virtual memory. If the operating system is forced to maintain its own page tables, it will be a disgusting solution. Most general-purpose processors support both physical and virtual addressing modes. The physical addressing mode does not involve page tables and the processor does not perform any address translation. The Linux core runs directly in the physical address space.
The Alpha AXP processor does not have a special physical addressing mode. It divides the memory space into several areas and specifies two of them as physical ing addresses. The core address space is called the KSEG address space, which is located in the area above 0xfffffc0000000000. To execute the core code located in KSEG or access data there, the Code must be executed in core mode. The Linux Kernel on Alpha starts from 0xfffffc0000310000.
3.1.5 Access Control
The page table entry contains access control information. Since the processor has used the page table entry as the ing between the virtual address and the physical address, it is convenient to use the access control information to determine whether the processor accesses the memory in its proper way.
Many factors make it necessary to strictly control access to the memory area. Some memory, such as the part that contains the executed code, should obviously be read-only, and the operating system must not allow the process to write operations in this region. On the contrary, pages containing data should be writable, but executing this data will certainly lead to errors. Most processors have at least two execution modes: core State and user State. No one can execute core code in user mode or modify the core data structure in user mode.
. 2 Alpha AXP page table entry
The access control information in the Page Table Entry is processor-related;. 2 is the PTE (Page Table Entry) of the Alpha AXP processor ). The meanings of these bit domains are as follows:
Valid. If this position is specified, this PTE is valid.
"Invalid upon execution", the processor Reports page errors and passes control whenever the commands contained on this page are executed.
"Expiration upon writing", except for the page error when writing this page, the others are the same as those on.
"Failed upon reading", except for page errors that occur when reading this page, other errors are the same as those on the page.
Address Space match. The ports used by the operating system to clean some ports in the conversion buffer.
The code running in core mode can read this page.
Code running in user mode can read this page.
The implicit granularity when the entire block is mapped to a single, not multiple conversion buffers.
Code running in core mode can be written on this page.
Code running in user mode can be written on this page.
Page frame number
For a V-position PTE, this field contains the physical page number corresponding to this PTE; for an invalid PTE, this field is not 0, it contains information about the location of the page in the swap file.
The following two are defined and used by Linux.
If it is set to a bit, this page will be written to the swap file.
In Linux, it indicates that the page has been accessed.
3.2 High Speed Buffer
If the above theoretical model is used to implement a system, it may work, but the efficiency is not high. Operating system designers and processor designers are working hard to improve system performance. In addition to making faster CPU and memory, the best way is to maintain useful information and data in High-Speed buffering to speed up some operations. Linux uses many memory management policies related to high-speed buffering.
This buffer cache contains the data buffer used by the drive of the block device.
These buffer units are generally fixed in size (for example, 512 bytes) and contain information blocks read or written from Block devices. A block device is a device that can only perform read/write operations with a fixed size block. All hard disks are Block devices.
You can use the device identifier and the required block number as an index to quickly find data in the buffer cache. Block devices can only be accessed through buffer cache. If the data can be found in the buffer cache, it does not need to be read from physical block devices (such as hard disks), which can accelerate access.
It is used to accelerate the access to executable image files and data files on the hard disk.
It caches the file content of a page each time. The page is read from the disk and cached in the page cache.
Only modified pages are stored in the swap file.
As long as these pages are not modified after they are written to the swap file, the next time this page is swapped out of memory, there is no need to update the write operation, these pages can be simply discarded. In systems with frequent switching, Swap Cache can save a lot of unnecessary and time-consuming disk operations.
A common hardware cache is the page table entry cache in the processor. The processor does not always directly read the page table but caches the page conversion as needed. This cache is also called the Translation Look-aside Buffers, which contains the buffer copy of the page table entries of one or more processors in the system.
When a reference to a virtual address is sent, the processor tries to find a matched TLB entry. If it is found, the virtual address is directly converted into a physical address and the data is processed. If not, seek help from the operating system. The processor sends a TLB mismatch signal to the operating system, which uses a specific system mechanism to notify the operating system of this exception. The operating system generates a new TLB entry for this address matching pair. When the operating system clears this exception, the processor will perform another virtual address conversion. This operation is successful because a corresponding entry exists in TLB.
The disadvantage of using cache is that Linux must consume more time and space to maintain these caches, and the system will crash when the cache system crashes.
3.3 Linux page table
. 3 Linux three-level page table structure
Linux always assumes that the processor has a three-level page table. Each page table is accessed by the page number of the lower-level page table .. 3 shows how the virtual address is divided into multiple domains. Each domain provides an offset for a specified page table. To convert a virtual address to a physical address, the processor must obtain the value of each domain. This process continues three times until the physical page number corresponding to the virtual address is found. Finally, use the last domain in the virtual address to obtain the data address on the page.
To achieve cross-platform running, Linux provides a series of conversion macros so that the core can access the page tables of specific processes. In this way, the core does not need to know the structure of the page table entries and their arrangement.
This policy is quite successful. Linux always uses the same page table to manipulate code, whether in an Alpha AXP with a three-level page table structure or an Intel X86 processor with two-level page tables.
3.4 page allocation and recovery
Requests to physical pages in the system are very frequent. For example, when an executable image is transferred to the memory, the operating system must allocate a page for it. These pages must be released after image execution and uninstallation. Another purpose of a physical page is to store the core data structure of the page table. The data structure and mechanism in the virtual memory subsystem responsible for page allocation and recovery may be of the greatest use.
All physical pages in the system are described using the mem_map linked list containing the mem_map_t structure, which are initialized at system startup. Each mem_map_t describes a physical page. The important domains related to memory management are as follows:
Record the number of users using this page. When this page is shared among multiple processes, its value is greater than 1.
This field describes the age of the page, used to select to discard the appropriate page or replace the memory.
The number of the physical page in the mem_map_t log.
The page allocation Code uses the free_area array to find and release pages. This mechanism manages the entire buffer. In addition, this code has nothing to do with the page size used by the processor and the physical paging mechanism.
Each element in free_area contains the page block information. The first element in the array describe one page. The second element represents two page size blocks, and the second element represents four page size blocks. In short, the two are power-times. The list field indicates a queue header, which contains a pointer to the page data structure in the mem_map array. All idle pages are in this queue. The map field is a pointer to the page group allocation bitmap of a specific page size. When the nth part of the page is idle, the nth part of the bitmap is set.
Figure free-area-figure shows the free_area structure. The first element has a free page (page box number 0), and the second element has two free blocks of four page sizes, the previous one starts from Box 4 and then from box 56.
3.4.1 page allocation
Linux uses the Buddy algorithm to effectively allocate and recycle page blocks. The page allocation code allocates memory blocks that contain one or more physical pages each time. The page is allocated with memory blocks of the power of 2. This means that it can allocate one, two, and four page blocks. As long as the system has enough free pages to meet this requirement (nr_free_pages> min_free_page), the memory allocation code will find a free block with the same size as the request in free_area. Each element in free_area stores a bitmap that reflects the size of allocated and idle pages. For example, the second element in the free_area array points to a memory image that reflects the memory block allocation of four pages.
The allocation algorithm first searches for pages that meet the request size. It starts from the list field of the free_area data structure and searches idle pages along the chain. If there is no idle page with the requested size, it searches for memory blocks that are twice the requested size. This process continues until the free_area is searched or the memory block meeting the requirements is found. If a page block is larger than the requested block, it is split to match the size of the block. The splitting process is very simple because the block size is the power of 2. The idle block is connected to the corresponding queue and the page block is allocated to the caller.
. 4 free_area Data Structure
In. 4. When the system sends a request with two page blocks in size, the first 4 page size memory block (starting from page number 4) it is divided into two blocks with 2 page sizes. The first one, starting from Box 4, will be allocated and returned to the requester, and the last one, starting from Box 6, it will be added to element 1 in the free_area array that represents two free blocks of page size.
3.4.2 page recycling
Breaking a large page block and allocating it will increase the number of free page blocks in the system. The page recycling code should combine these pages to form a single large page block at an appropriate time. In fact, the size of the page block determines the difficulty of page combination.
When a page block is released, the code checks whether there are adjacent or buddy memory blocks of the same size. If yes, combine them to form a new idle block that is twice the original size. After each combination, check whether the code can be merged into a larger page. The best case is that the free page block of the system will be as large as the maximum memory that can be allocated.
In. 4, if the release page box number 1 is released, it will be combined with the free page box number 0 as the free block of the two pages into the first element of free_area.
3.5 memory ing
When an image is executed, the content of the executable image is transferred to the virtual address space of the process. The same applies to shared libraries used by executable images. However, the executable file is not actually transferred to the physical memory, but only to the virtual memory of the process. When other parts of the program are used for running, they are transferred to the memory from the disk. The process of connecting an image to the virtual address space of a process is called memory ing.
. 5 virtual memory area
The virtual memory of each process is represented by an mm_struct. It contains the currently executed image (such as BASH) and a large number of pointers to vm_area_struct. Each vm_area_struct Data Structure describes the starting and ending positions of the virtual memory, the access permissions of the process to the memory area, and a set of memory operation functions. These functions are all subprograms required by Linux to manipulate virtual memory areas. One of the processes attempts to access virtual memory that is not in the current physical memory (through page failure. This function is called nopage. It is used when Linux tries to call the page of the executable image into memory.
When the executable image is mapped to the virtual address of the process, a corresponding vm_area_struct data structure is generated. Each vm_area_struct data structure represents part of an executable image: executable code, initialized data (variables), uninitialized data, and so on. Linux supports many Standard Virtual Memory operation functions. When creating a vm_area_struct data structure, there is a set of corresponding virtual memory operation functions.
3.6 Request Form feed
After the map of the executable image to the virtual address space of the process is completed, it can start to run. Since only a few images are transferred to the memory, access to the virtual memory area that is not in the physical memory will soon occur. When a process accesses a virtual address without a valid page table entry, the processor reports a page error to Linux.
Page errors include invalid virtual addresses and invalid access methods. In Linux, The vm_area_struct structure indicating the region must be found. The search speed of the vm_area_struct data structure determines the efficiency of page error processing. All vm_area_struct structures are connected by an AVL (Adelson-Velskii and Landis) tree structure. If the relationship between vm_area_struct and the invalid virtual address cannot be found, the system considers the process to have accessed the illegal virtual address. At this time, Linux will send a SIGSEGV signal to the process. If the process does not have this signal, it will terminate the operation.
If the corresponding relationship is found, Linux then checks the access type that causes the page error. If a process accesses the memory in an illegal way, such as writing to a non-writable area, the system generates a memory error signal.
If Linux considers a page error to be legal, it must handle this situation.
First, Linux must differentiate pages in SWAp files and executable images on disks. In the Alpha AXP page table, there may be a page table entry with a valid bit not set but a non-0 value in the PFN field. In this case, the PFN field indicates the location of the page in the swap file. The page on how to handle the swap file will be discussed in the following chapter.
Not all vm_area_struct data structures have a set of virtual memory operation functions, and some of them do not even have nopage functions. This is because Linux fixes this access by allocating a new physical page and creating a valid page table entry for it. If the nopage operation function exists in this memory area, Linux will call it.
Generally, the Linux nopage function is used to process memory ing executable images. At the same time, it uses the page cache to transfer the requested pages to the physical memory.
When the requested page is transferred to the physical memory, the processor page table must also be updated. UPDATING These portals requires hardware operations, especially when the processor uses TLB. In this way, when the page becomes invalid and processed, the process starts to run again from the location where the virtual memory access fails.
3.7 Linux page cache
. 6 Linux page Cache
Linux uses page cache to accelerate access to files on the disk. The memory ing File Reads and stores these pages in the page cache every time on one page .. 6 indicates that the page cache is composed of a page_hash_table pointer array pointing to the mem_map_t data structure.
Each file in Linux is identified by a VFS inode (described in the file system Chapter) data structure and each VFS inode is unique. It can be used to describe only one file. The index of the page table is derived from the VFS inode of the file and the offset of the file.
Read pages from a memory ing file. For example, when a page feed request is generated, the page should be read back to the memory. The system tries to read the page from the page cache. If the page is in the cache, a Data Structure directed to mem_map_t will be returned for the page invalidation process; otherwise, the page will read the memory from the file system containing the image and allocate the physical page to it.
Page cache continues to grow during image reading and execution. When a page is no longer needed, that is, it is no longer used by any process, it will be deleted from the page cache.
3.8 swap out and discard page
When the physical memory in the system is reduced, the Linux memory management subsystem must release the physical page. This task is completed by the core switch backend process (kswapd.
The core exchange background process is a special Core Thread. It is a process without virtual memory and runs in the core state in the physical address space. The name of the core switch background process is easy to misunderstand. In fact, it performs much more work than simply exchanging pages to system files. The goal is to ensure that the system has enough free pages to maintain the memory management system running efficiency.
This process runs when the core init process is started and is periodically called by the core switch timer.
When the timer arrives, the switch background process checks whether there are too few idle pages in the system. It uses two variables: free_pages_high and free_page_low to determine whether to release some pages. As long as the number of free pages in the system is greater than free_pages_high, the core switch background process does not do any work; it will sleep until the next timer. During the check, the core switch background process also counts the number of pages currently written to the switch file. It uses nr_async_pages to record this value; when a page is added to the queue to be written to the swap file, the page increments once and then decreases once after the write operation is complete. If the number of free pages in the system is below free_pages_high or even free_pages_low, the core switch background process will reduce the number of physical pages used in the system in three ways:
Reduce the size of the buffer and page cache,
Swap out the V-type Memory Page,
Swap out or discard the page.
If the number of free pages in the system is lower than free_pages_low, the core switch background process will release 6 pages before the next operation. Otherwise, only three instances are released. The above three methods will be used in sequence until the system releases enough free pages. When the core switch background process tries to release the physical page, it will record the last method used. Next time, it runs the last successful algorithm.
When enough pages are released, the core switch background process will sleep until the next timer. If the reason for the core switch background process to release the page is that the number of idle pages in the system is less than free_pages_low, it only sleeps half of the normal time. If the number of idle pages is greater than free_pages_low, the sleep time of the core switch process is prolonged.
3.8.1 reduce the size of Page Cache and Buffer Cache
Pages in Page Cache and Buffer cache are preferentially released to the free_area array. Page Cache contains pages with memory ing files, some of which may be unnecessary and waste the system memory. The Buffer Cache contains the Buffer data read and written from the physical device, and some may be unnecessary. When the physical pages in the system start to run out, it is easy to discard the pages from these caches (it does not need to be exchanged from the memory, and does not need to perform write operations on the physical devices ). In addition to reducing access to physical devices and memory ing files, the page discard policy does not have many side effects. If the policy is correct, all processes suffer the same losses.
Each core switch background process will attempt to compress these caches.
It first checks the page blocks in the mem_map page array to see if they can be discarded from the physical memory. When the number of idle pages in the system is reduced to a dangerous level, the core background exchange process frequently exchanges, the page block to be checked is generally relatively large. The check method is rotation. Every time you try to compress the memory image, the core background SWAp process always checks different page blocks. This is a well-known clock algorithm that checks the page every time in the entire mem_map page array.
The core background Switch Process checks each page to see if it has been buffered by page cache or buffer cache. The reader may have noticed that the shared page is not in the column of the page to be discarded. Such pages will not appear in both cache types. If the page is not either of the two, it checks the next page in the mem_map page array.
Pages cached in the buffer cache (or pages cached in the cache) can make the cache allocation and recovery more effective. The memory compression code will try to release the buffer contained in the page under check.
If all the buffers on the page are released, the page is also released. If the page to be checked is in the Linux page cache, the page will be deleted and released from the page cache.
If enough pages are released, the core switch background process will wait until the next wake-up. These released pages are not part of the virtual memory of any process, so you do not need to update the page table. If there is not enough cache page discard, the exchange process will try to swap some shared pages out.
3.8.2 swap out System V Memory Page
System V shared memory is a mechanism used to implement process communication between processes by sharing virtual memory. How a process shares memory will be discussed in detail in the IPC chapter. Now, you only need to note that the system V shared memory can be expressed in any region using a shmid_ds data structure. This structure contains a linked list pointer pointing to vm_area. vm_area is a structure designed for each shared virtual memory area. They are connected through the vm_next_shared and vm_prev_shared pointers. Each shmid_ds data structure contains a page table entry, which describes the ing between a physical page and a shared virtual page.
The core swap background process also uses the clock algorithm to swap out the System V shared memory page.
During each running, it should remember which page in the shared virtual memory area is the last one to be swapped out. Two indexes can help it complete this work. One is a set of shmid_ds data structure indexes, and the other is the index of the linked list of the page table entry table in the system v shared memory area. This ensures a fair choice of the System V shared memory area.
Because the physical page number of the given System V shared virtual memory is saved in all the page tables that share the process in this virtual memory area, the core switch background process must modify all the page tables at the same time to indicate that the page is no longer in the memory but in the switch file. For each shared page to be swapped out, the core switch background process can find them in the page table entry of each shared process (through vm_area_struct data structure ). If the process page table entry corresponding to the System V shared memory page is valid, it can be converted to invalid, so that the page table entry and the number of users of the shared page will be reduced by one. The format of the System V shared page table entry contains an index corresponding to a group of shmid_ds data structures and a page table entry index for the System V shared memory area.
If the page table of all sharing processes is modified and the page count is 0, the sharing page can be written to the swap file. The page table entry in the shmid_ds Data Structure linked list pointing to the System V shared memory area is also replaced by the page table entry. Although the page Feed Table entry is invalid, it contains an index of a group of open swap files, and can also find the offset of the Page Swap in the file. This information is useful when the page is re-imported into the physical memory.
3.8.3 swap out and discard pages
The switch background process checks Each process in the system in sequence to confirm who is most suitable for the switch.
The better candidates are those that can be swapped out (some cannot be swapped out) and there are only one or several pages in the memory. Only pages that contain data that cannot be retrieved will be transferred from the physical memory to the system swap file.
Many of the executable image content can be read from the image file and can be easily re-viewed. For example, executable commands in an image cannot be modified by the image itself, so they will never be written to the swap file. These pages can be discarded directly. When the process references them again, you only need to read the memory from the executable image file.
Once the process to be swapped out is identified, the switch background process searches its entire virtual memory area to find those areas that are not shared or locked.
Linux does not swap the entire exchangeable page of the selected process. It only deletes a small part of the page.
If the memory is locked, the page cannot be switched or discarded.
The Linux switching algorithm uses the page aging algorithm. Each page has a counter to tell the core switch background process whether the page is worth switching out (this counter is included in the mem_map_t structure ). When the page is not used or is not found, it will become aging. The background process of the exchange only exchanges old pages. The default operation is: when a page is allocated for the first time, its initial age value is 3. Each time it is referenced, the age value is 3 and the maximum value is 20. Each core exchange background process runs it to make the page aging-reduce the age by 1. This default operation can be changed and is stored in the swap_control data structure for this reason.
If the page gets older (age = 0), the switch background process will process it further. The dirty page can be swapped out. Linux uses a hardware-related location in PTE to describe this feature of the page (see. 2 ). However, not all dirty pages need to be written to the swap file. Each virtual memory area of a process may have its own swap operations (represented by the vm_ops pointer in the vm_area_struct structure). These methods are used during swap. Otherwise, the switch background process will allocate a page in the switch file and write the page to the device.
The page table entry of the page is marked as invalid but contains information about the position of the page in the swap file, this includes an offset value indicating the position of the page in the swap file and the swap file used. However, no matter which switching algorithm is used, the previous physical page will be marked as idle and put into free_area. Clean (or not dirty) pages can be discarded and put into free_area for reuse.
If there are enough switchable process pages to be swapped out or discarded, the swap background process will sleep again. Next time it wakes up, it will consider the next process in the system. In this way, the exchange background process restores the interchangeable or discarded physical page of each process 1.1 points to know that the system is in a balance again. This is much more fair than exchanging the entire process.
3.9 The Swap Cache
When switching pages to swap files, Linux always avoids page writing unless this is required. When the page has been swapped out of memory, but when a process accesses the page again, it needs to be re-transferred to the memory. As long as the page has not been written in the memory, the copy in the swap file is valid.
Linux uses swap cache to track these pages. This swap cache is a page table portal linked list, each corresponding to the physical page in the system. This is a page table entry corresponding to the page to be swapped out and describes the swap file in which the page is placed and the location in the swap file. If the swap cache entry is not 0, this page in the swap file is not modified. If this page is modified (or written ). The entry is deleted from the swap cache.
When Linux needs to swap a physical page to a swap file, it will check the swap cache. If there is a valid entry corresponding to this page, it will not be written to the swap file. This is because the page in the memory has not been modified since it was read from the swap file last time.
The entry in the swap cache is the page table entry that has been swapped out of the page. Although marked as invalid, they provide Linux with information such as the page in which the swap file is located and the location of the file.
3.10 Page Swap
The dirty page stored in the swap file may be used again. For example, when an application writes data to a virtual memory area that is included in the submitted physical page. Access to a virtual memory page that is not in the physical memory will cause a page error. Because the processor cannot convert this virtual address to a physical address, the processor notifies the operating system. As the page table entry has been swapped out, it is marked as invalid. The processor cannot convert the virtual address to the physical address, so it passes the control to the operating system and notifies the operating system page of the wrong address and cause. The format of the information and how the processor transmits the control to the operating system are related to the specific hardware.
The processor-related page error handling code will be used to locate the vm_area_struct data structure that contains the virtual memory region corresponding to the error virtual address. It searches the vm_area_struct of the process for the location containing the virtual address with an error until it is found. These codes have a significant relationship with time. The vm_area_struct data structure of the process is specially arranged to reduce the search operation time.
After performing these processor-related operations and finding the valid memory area of the error virtual address, the remaining part of the page error handling process is similar to the previous one.
The common page error handling code is used to find the page table entry for the error virtual address. If the page table entry is an out-of-the-box page, Linux must switch it to the physical memory. The format of the page table entry that has been swapped out is related to the processor type, but all the processors mark these pages as invalid and place the necessary information for locating the page into the page table entry. Linux uses this information to switch pages to the physical memory.
In this case, Linux knows the error virtual memory address and has a page table entry that contains the page location information. The vm_area_struct data structure may contain a subroutine swapin that exchanges the virtual memory area to the physical memory. If swapin exists in this virtual memory area, Linux uses it. This is the process of switching out the System V shared memory page-because the swap out System V shared page is slightly different from the normal swap out page. If there is no swapin operation, it may be that Linux assumes that normal pages do not need special processing.
The system will allocate a physical page and read the replaced page. Location Information about the page in the swap file is retrieved from the page table entry.
If the access that causes the page error is not a write operation, the page is retained in the swap cache and its page table entry is no longer marked as writable. If the page is subsequently written, another page error is generated. The page is marked as dirty and its entry is deleted from the swap cache. If the page has not been written and is required to be swapped out again, Linux can exempt this write because the page already exists in the swap file.
If the operation that causes the page to be read from the swap file is a write operation, the page will be deleted from the swap cache and its page table entry will be marked as dirty and writable.
This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or
reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or
complaint, to email@example.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
and provide relevant evidence. A staff member will contact you within 5 working days.