Transferred from: melody_lu123 csdn blog, a great technical article.
Author: Gustavo Duarte
From: http://duartes.org/gustavo/blog/post/how-the-kernel-manages-your-memory
How the kernel manages your memory
How does the kernel manage your memory?
After examining the virtual address layout of a process, we turn to the kernel and Its machisms for managing user memory. Here is Gonzo again:
After the previous article on the virtual address layout of a process, it is time to look at the kernel and its mechanism for managing user memory. The following uses the Gonzo mentioned in the previous article as an example:
Linux processes are implemented in the kernel as instances of task_struct, the process descriptor. The mm field in task_struct points toMemory Descriptor, Mm_struct, which is an executive summary of a program's memory. It stores the start and end of memory segments as shown above, the number of physical memory pages used by the process (RSSStands for resident set size), the amount of virtual address space used, and other tidbits. Within the memory descriptor we also find the two work horses for managing program memory: the setVirtual Memory AreasAndPage tables. Gonzo's memory areas are shown below:
The Linux process is implemented by task_struct in the kernel. There is a mm domain pointing to the memory descriptor used by the process, mm_struct. It contains the start and end addresses of each segment shown above (represented by the unsigned long type), the number of physical memory pages used by the process, various virtual address spaces used (total_vm, shared_vm ,...) and other content. From this struct, we can see two major members used to manage the program memory: vm_area_struct and PGD. This is a platform-related structure ). Is the virtual memory layout of Gonzo.
Each virtual memory area (VMA) is a contiguous range of virtual addresses; these areas never overlap. an instance of vm_area_struct fully describes a memory area, including its start and end addresses, flags to determine access rights and behaviors, and the vm_file field to specify which file is being mapped by the area, if any. a vma that does not map a file isAnonymous. Each memory segment above (E.g., Heap, stack) corresponds to a single VMA, with the exception of the memory mapping segment. this is not a requirement, though it is usual in x86 machines. VMAs do not care which segment they are in.
Each virtual memory area is a continuous virtual address space; these areas do not overlap. A vm_area_struct represents such a memory area, including the start and end addresses of the area. Flags describes its access permissions and behavior. vm_file indicates which file is mapped to this area. A vma does not map a file and is called Anonymous. Each of the above memory segments (heap, stack...) corresponds to a VMA. The only exception is memory mapping segment, which has several VMA to describe. This is not necessary, but it is generally true in x86. No VMA cares which segment it belongs.
A program's VMAs are stored in its memory descriptor both as a linked list in the MMAP field, ordered by starting virtual address, and as a red-black tree rooted at the mm_rb field. the red-black tree allows the kernel to search quickly for the memory area covering a given virtual address. when you read file/Proc/pid_of_process/maps, The kernel is simply going through the linked list of VMAs for the process and printing each one.
All VMA of a program will be stored in its own memory Descriptor (mm_struct ), it is represented by a linked list directed by an MMAP and uses the virtual address of each VMA as a red-black tree of the key. The red/black tree provides the efficiency of quickly finding the VMA of a virtual address. The linked list is used to list the VMA space used by the process in sequence. For example, when you read data from/proc/pid_of_process/maps, the kernel uses the linked list structure to obtain all VMA information.
In Windows, the eprocess block is roughly a mix of task_struct and mm_struct. the Windows analog to a VMA is the Virtual Address Descriptor, or VAD; they are stored in an AVL Tree. you know what the funniest thing about Windows and Linux is? It's the little differences.
In Windows, use the eprocess structure to describe task_struct and mm_struct in Linux. Similar to VMA in windows, Virtual Address Descriptor, or vad, is managed by the AVL Tree. Do you know the most funny things about Windows and Linux? That is, they only have a small difference in virtual memory.
The 4 GB virtual address space is dividedPages. X86 processors in 32-Bit mode support page sizes of 4kb, 2 MB, and 4 MB. both Linux and Windows map the user portion of the virtual address space using 4kb pages. bytes 0-4095 fall in page 0, bytes 4096-8191 fall in page 1, and so on. the size of a VMAMust be a multiple of page size. Here's 3 GB of user space in 4 kb pages:
All 4 GB virtual address spaces are divided into many pages. The x86 processor supports page sizes of 4 kb, 2 MB, and 4 MB in 32-Bit mode. By default, Linux and Windows use 4 kb pages to map users' virtual address spaces. 0-40 95 belongs to 0 pages, 4096-8191 belongs to 1 page, and so on. The size of all VMA must be an integer multiple of the page size. The following is the description of the 3 GB user space when the page size is 4 kb:
The processor consultsPage tablesTo translate a virtual address into a physical memory address. each process has its own set of page tables; whenever a process switch occurs, page tables for user space are switched as well. linux stores a pointer to a process' page tables in the PGD field of the memory descriptor. to each virtual page there corresponds onePage table entry(PTE) in the page tables, which in regular x86 paging is a simple 4-byte record shown below:
The processor converts a virtual address into a physical memory address through a page table. Each process has a set of its own page tables. Whenever a process is switched, the user space page table is switched accordingly. Mm_struct has a PDG field that points to the page table set used by the process. Each virtual page has a corresponding page table item (PTE) in the page table. The following is a record of eleven 4 bytes per page under x86:
Linux has functions to read and set each flag in a Pte. Bit P tells the processor whether the virtual page isPresentIn physical memory. If clear (equal to 0), accessing the page triggers a page fault. Keep in mind that when this bit is zero,The kernel can do whatever it pleasesWith the remaining fields. the R/W flag stands for read/write; if clear, the page is read-only. flag U/S stands for user/supervisor; if clear, then the page can only be accessed by the kernel. these flags are used to implement the read-only memory and protected kernel space we saw before.
Linux provides some columns for reading and writing PTE functions. P-bit indicates whether the virtual page of the processor is in physical memory. 0 indicates no. This is a page access action that triggers a page missing exception. Remember that when the P bit is 0, the kernel does not detect all other bits. The kernel can use these bits for other purposes. R/W indicates read/write; 0 indicates read-only. U/s represents the user and system; 0 indicates that the page can only be accessed by the kernel.
Bits d and a areDirtyAndAccessed. A dirty page has had a write, while an accessed page has had a write or read. both flags are sticky: The processor only sets them, they must be cleared by the kernel. finally, the PTE stores the starting physical address that corresponds to this page, aligned to 4kb. this naive-looking field is the source of some pain, for it limits addressable physical memory to 4 GB. the other PTE fields are for another day, as is physical address extension.
The D and a bits correspond to the dirty data and the accessed data respectively. A dirty page indicates that it has been written and accessed, including being read and written. These two flags are special: both of them are set only by the processor, but are reset by the kernel. Finally, PTE stores the starting physical address of the corresponding page, which is 4 kb. These control signs often cause some inconvenience. they limit the physical memory to only 4 GB. Other symbol bits are used to support physical address expansion (PAE ).
A virtual page is the unit of memory protection because all of its bytes share the U/s and R/W flags. however, the same physical memory cocould be mapped by different pages, possibly with different protection flags. notice that execute permissions are nowhere to be seen in the Pte. this is why classic x86 paging allows code on the stack to be executed, making it easier to exploit stack buffer overflows (it's still possible to exploit non-executable stacks using Return-to-libc and other techniques ). this lack of a PTE no-Execute flag when strates a broader fact: Permission flags in a VMA may or may not translate cleanly into hardware protection. the kernel does what it can, but ultimately the architecture limits what is possible.
A virtual page is a unit of the memory protection mechanism, because they all share the U/s and R/W signs. Even so, the same physical address may be mapped to different pages, which may have different protection signs. In particular, the execution permission flag cannot be supported by PTE (but the new 64-bit CPU and its corresponding Linux kernel, and the kernel supporting PAE will support the no-exec feature. For details, visit http://en.wikipedia.org/wiki/nx_bit. ). This is why the typical x86 page mechanism allows code to be executed in the stack, which is vulnerable to stack overflow attacks. (Of course, even if there is a stack that cannot be executed, there will still be return-to-libc and other attack methods ). This is because PTE does not have the no-exec flag: The permission flag in a VMA may or may not be clearly transferred from the virtual to the hardware protection layer. The kernel does what it can, but the architecture limits it.
Virtual Memory doesn' t store anything, it simplyMapsA program's address space onto the underlying physical memory, which is accessed by the processor as a large block calledPhysical address space. While memory operations on the bus are somewhat involved, we can ignore that here and assume that physical addresses range from zero to the top of available memory in One-byte increments. this physical address space is broken down by the kernelPage Frames. The processor doesn't know or care about frames, yet they are crucial to the kernel becauseThe page frame is the unit of physical memory management.Both Linux and Windows use 4kb page frames in 32-Bit mode; here is an example of a machine with 2 GB of RAM:
The virtual memory does not store anything. It simply maps the address of a program to the underlying physical memory, which is accessed by the processor and called the physical memory address space. If the memory operation occurs on the bus, we can ignore the virtual memory, but we can assume that the physical address is from 0 to the maximum available memory and is 1 byte and 1 byte increments. This physical address space is now divided into a series of page frames by the kernel. The processors do not care about these frames. They are important to the kernel because the kernel uses page frames as the minimum unit for managing physical memory. Linux and Windows both use 4 kb page frames in 32-Bit mode. The following is an example of a machine with 2 GB physical memory:
In Linux each page frame is tracked by a descriptor and several flags. together these descriptors track the entire physical memory in the computer; the precise state of each page frame is always known. physical memory is managed with the buddy memory allocation technique, hence a page frame isFreeIf it's available for allocation via the buddy system. An allocated page frame might beAnonymous, Holding program data, or it might be inPage Cache, Holding data stored in a file or block device. there are other exotic page frame uses, but leave them alone for now. windows has an analogous page frame number (PFn) database to track physical memory.
In Linux, each page frame is tracked by a descriptor (page struct) and some flags. By combining these descriptors, we can trace all the physical memory in the computer. The specific status of each page frame is completely unknown. The physical memory is managed by the partner memory allocation system. Therefore, if a page frame is available to the partner system, the page frame is idle. An allocated page frame may be anonymous and can be used to save program data or a page cache, or to save data in a file or a block device. There are some other special page frames, but we will not consider them first. Windows has a database similar to page frame number (PFn) to help track physical memory.
Let's put together virtual memory areas, page table entries and page frames to understand how this all works. Below is an example of a user heap:
Let me combine all these with the virtual memory area to help understand how memory management works. The heap of the following eleven users:
Blue rectangles represent pages in the VMA range, while arrows represent page table entries mapping pages onto page frames. Some virtual pages lack arrows; this means their corresponding PTES havePresentFlag clear. this cocould be because the pages have never been touched or because their contents have been swapped out. in either case access to these pages will lead to page faults, even though they are within the VMA. it may seem strange for the VMA and the page tables to disagree, yet this often happens.
The blue rectangle indicates the range of the page in VMA, and the arrow indicates the page frame mapped by Pte. Some virtual pages do not have arrows; this means that the P Flag of their corresponding Pte is 0. This may be because the page has never been accessed or because its content has been replaced. In either case, operations that attempt to access this page will cause page errors, even if they are indeed in VMA. This seems strange for VMA and page tables, but it often happens.
A vma is like a contract between your program and the kernel. you ask for something to be done (memory allocated, a file mapped, etc .), the kernel says "Sure", and it creates or updates the appropriate VMA. butIt does notActually honor the request right away, it waits until a page fault happens to do real work. the kernel is a lazy, deceitful sack of scum; this is the fundamental principle of virtual memory. it applies in most situations, some familiar and some surprising, but the rule is that VMAs record what has beenAgreed, While PTES reflect what hasActually been doneBy the lazy kernel. these two data structures together manage a program's memory; both play a role in resolving page faults, freeing memory, swapping memory out, and so on. let's take the simple case of memory allocation:
A vma is like a contract between your program and the kernel. When you request something to be completed (memory allocation, file ing, etc.), the kernel will tell you OK and help you create or update the appropriate VMA. However, it does not really respond immediately to your request. It will wait until a page error occurs to do the corresponding thing. Kernel is lazy, a bit hypocritical :) this is indeed the significance of the existence of virtual memory. It is applied in some common scenarios, some familiar scenarios, and some seemingly strange scenarios, but the fact is that VMA records what is agreed by the kernel, PTES reflects what is actually done by the lazy kernel. These two key data structures are combined to manage the memory of a program. At the same time, they are used together to implement a logic that determines whether page errors occur, releases memory, exchanges memory, and so on. Let's look at a simple example of memory allocation:
When the program asks for more memory via the BRK () system call, the kernel simply updates the heap VMA and callit good. no page frames are actually allocated at this point and the new pages are not present in physical memory. once the program tries to access the pages, the processor page faults and do_page_fault () is called. it searches for the VMA covering the faulted virtual address using find_vma (). if found, the permissions on the VMA are also checked against the attempted access (read or write ). if there's no suitable VMA, no contract covers the attempted memory access and the process is punished by segmentation fault.
When a program calls more memory through the BRK system, the kernel simply updates the VMA of its heap. No page frames are actually allocated, and new pages are not created in the physical memory. Once the program attempts to access these pages, the processor sets out a page error, that is, executing the do_page_fault () function. It uses find_vma () to find the virtual address in the VMA that causes a page error. If the permission check is found and the permission check in various VMA instances fails, it indicates that there is no suitable VMA, that is, no suitable contract has been established, so this memory access is illegal, this will cause the kernel to issue segmentation fault.
When a VMA is found the kernel must handle the fault by looking at the PTE contents and the type of VMA. in our case, the PTE shows the page is not present. in fact, our Pte is completely blank (all zeros), which in Linux means the virtual page has never been mapped. since this is an anonymous VMA, we have a purely Ram affair that must be handled by do_anonymous_page (), which allocates a page frame and makes a Pte to map the faulted virtual page onto the freshly allocated frame.
When VMA is found and various permissions are checked, PTE indicates that the page is not in the existing page frame. In our example, our Pte is completely empty, which means that the virtual page has not been mapped by Linux kernel. Because this is an anonymous VMA, we must use do_anonymous_page () to handle this situation. It will map the virtual page of the previous PTE error to the newly allocated page frame.
Things cocould have been different. the PTE for a swapped out page, for example, has 0 in the present flag but is not blank. instead, it stores the swap location holding the page contents, which must be read from disk and loaded into a page frame by do_swap_page () in what is called a major fault.
Of course, things may also be different. For example, if a Pte is an output page, the present indicates 0, but the Pte is not empty. On the contrary, the address in the SWAp area stores the page content, which must be read from the hard disk by do_swap_page () to the page frame. This is called major fault.
This concludes the first half of our tour through the kernel's user memory management. in the next post, we'll throw files into the mix to build a complete picture of memory fundamentals, including consequences for performance.
Note that the author will write another article about file operations to form a complete memory model, including some efficiency discussions.