Analysis of Linux memory management

Last Update:2014-05-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Address Mapping] (figure: Left middle)
The Linux kernel uses page-type memory management, where the memory address given by the application is a virtual address, which needs to pass through a number of levels of page table-level transformations before it becomes a real physical address.
think of it, address mapping is still a very scary thing. When accessing a memory space represented by a virtual address, a number of memory accesses are required to obtain a page table entry for conversion in each level of page table (the page table is stored in memory) in order to complete the mapping. In other words, in order to achieve a memory access, the memory is actually accessed n+1 times (n= page table series), and also need to do N addition operations.
Therefore, the address mapping must have hardware support, the MMU (Memory management unit) is this hardware. and a cache is required to save the page table, which is the TLB (translation lookaside buffer).
However, address mapping has a significant overhead. Assuming that the cache is accessed at 10 times times the memory, the hit rate is 40%, and the page table has three levels, the average virtual address access consumes two times of physical memory access.
Therefore, some embedded hardware may abandon the use of MMU, such hardware can run VxWorks (a very efficient embedded real-time operating system), Linux (Linux also has to disable the MMU compiler option), and other systems.
but the advantage of using MMU is also very big, the most important is for security reasons. Each process is independent of the virtual address space, non-interference. After the address mapping is discarded, all programs run in the same address space. Thus, on a machine without the MMU, a process that crosses the border may cause other processes to be baffled and even cause the kernel to crash.
On the issue of address mapping, the kernel only provides page tables, and the actual conversion is done by hardware. So how does the kernel generate these page tables? This has two aspects of content, virtual address space management and physical memory management. (In fact, only the user-State address mapping needs to be managed, the kernel-State address mapping is written dead.) )

[Virtual Address Management] (figure: lower left)
Each process corresponds to a task structure that points to a MM structure, which is the memory manager of the process. (For threads, each thread also has a task structure, but they all point to the same mm, so the address space is shared.) )
MM->PGD points to the memory that holds the page table, each process has its own mm, and each mm has its own page table. Thus, when the process is scheduled, the page table is switched (typically there is a CPU register to save the page table address, such as X86 under the CR3, the page table switch is to change the value of the register). Therefore, each process's address space does not affect each other (because the page table is different, of course, can not access to other people's address space.) Except for shared memory, this is intentionally allowing different page tables to have access to the same physical address.
The operation of the user program on memory (allocation, recycling, mapping, etc.) is the operation of MM, specifically the VMA (virtual memory space) on MM. These VMA represent various areas of the process space, such as heap, stack, code area, data area, various mapping areas, and so on.
The operation of the user program on memory does not directly affect the page table, nor does it directly affect the allocation of physical memory. For example, malloc succeeds simply by changing a certain VMA, the page table will not change, and the allocation of physical memory will not change.
assume that the user has allocated memory, and then accesses this block of memory. Because there are no related mappings in the page table, the CPU generates a fault. The kernel catches the exception and checks that the address that generated the exception is present in a legitimate VMA. If not, give the process a "segment error", Crash it, or, if it is, assign a physical page and establish a mapping for it.

[Physical Memory Management] (figure: top right)
So how is the physical memory allocated?
First, Linux supports NUMA (Heterogeneous storage architecture), and the first level of physical memory management is media management. The pg_data_t structure describes the media. In general, our memory management media is only memory, and it is uniform, so it is easy to assume that there is only one pg_data_t object in the system.
there are several zones below each type of media. Typically three, DMA, Normal, and high.
DMA: Because some hardware systems have a narrower DMA bus than the system bus, only a portion of the address space can be used as DMA, which is managed in the DMA area (this is a premium);
High: higher memory. In a 32-bit system, the address space is 4G, where the kernel specifies that the 3~4g scope is the kernel space, 0~3g is the user space (each user process has such a large virtual space) (figure: lower). The previously mentioned kernel address mapping is written dead, that is, the corresponding page table of this 3~4g is written dead, it is mapped to the physical address of the 0~1g. (There is actually no mapping of 1G, only 896M is mapped.) The rest of the space is left to map physical addresses larger than 1G, and this part is obviously not written dead. So, a physical address larger than 896M corresponds to a page table that is not written dead, and the kernel cannot access them directly (it must be mapped), calling them high-end memory (of course, if the machine has less than 896M of memory, there is no high-end memory.) If it is a 64-bit machine, there is no high-end memory, because the address space is very large, the kernel is not more than 1G of space;
Normal: Memory that does not belong to DMA or high is called normal.
The zone_list above the zone represents the allocation policy, which is the zone priority in memory allocation. A memory allocation is often not only in a zone to allocate, such as allocating a page to the kernel when used, the first priority is to assign from the normal, not the allocation of DMA inside the good (high is not, because there is no mapping), this is an allocation strategy.
each memory medium maintains a mem_map that establishes a page structure corresponding to each physical page in the media in order to manage physical memory.
each zone records its starting position on the Mem_map. and the free page on this zone is concatenated through Free_area. The allocation of physical memory is from here, from the Free_area page off, even if it is allocated. (Kernel memory allocation and user process, user memory will be monitored by the kernel, improper use of "segment error", and the kernel is not supervised, only by self-awareness, not their own from the Free_area off the page don't mess with. )

[Establish address mappings]
when the kernel needs physical memory, there are a lot of situations where the whole page is allocated, which is good to pick a page down in the Mem_map. For example, the kernel captures a page fault exception, and then it needs to be assigned a pages to establish the mapping.
here, there is a question, the kernel in the allocation page, the process of establishing an address map, the use of virtual address or physical address it? First, the address that the kernel code accesses is the virtual address, because the CPU instruction receives the virtual address (the address map is transparent to the CPU instructions). However, when establishing an address map, the kernel fills in the page table with the physical address, because the destination of the address map is to obtain the physical address.
So, how does the kernel get this physical address? In fact, it is mentioned above, the page in Mem_map is based on physical memory to build, each page corresponds to a physical page.
so we can say that the mapping of virtual addresses is done by the page structure here, they give the final physical address. However, the page structure is obviously managed by a virtual address (as previously mentioned, the CPU instruction receives the virtual address). So, the page structure to implement the other people's virtual address mapping, who to implement the page structure of their own virtual address mapping it? No one can achieve it.
This leads to the problem mentioned earlier, the kernel space of the page table entries are written dead. When the kernel is initialized, the address space of the kernel is already written to the address map. The page structure obviously exists in kernel space, so its address mapping problem has been solved by "write dead".
because the page table entries in kernel space are dead and another problem arises, the memory of the NORMAL (or DMA) region may be mapped to both kernel space and user space. Being mapped to kernel space is obvious, because this mapping has been written dead. These pages may also be mapped to user space, which is possible in the scenario where the previously mentioned page faults are abnormal. Pages mapped to user space should be taken first from the high zone because they are inconvenient to access by the kernel and are suitable for user space. However, the high zone may be depleted, or it may be that there is no high zone in the system due to insufficient physical memory on the device, so mapping the normal zone to user space is inevitable.
However, it is not a problem to have the memory of the normal zone mapped to both the kernel space and the user space, because if a page is being used by the kernel, the corresponding page should have been removed from Free_area, and the page will no longer be mapped to user space in the fault handling code. In turn, the page that is mapped to the user space is naturally removed from the Free_area, and the kernel will no longer use it.

[Kernel space Management] (figure: lower right)
In addition to the use of internal pages, in some cases, the kernel also needs to allocate a space of any size, as the user program uses malloc. This feature is implemented by the slab system.
slab is equivalent to creating object pools for some of the structure objects commonly used in the kernel, such as pools for task structures, pools corresponding to mm structures, and so on.
Slab also maintains a common pool of objects, such as the "32-byte-Size" object pool, the "64-byte-Size" object pool, and so on. The Kmalloc functions commonly used in the kernel (like the user-state malloc) are allocated in these common object pools.
Slab In addition to the actual memory space used by the object, there is a corresponding control structure. There are two ways to organize, if the object is large, the control structure is saved with a dedicated page, and if the object is small, the control structure uses the same page as the object space.
In addition to Slab,linux 2.6, a mempool (memory pool) was introduced. The intention is that some objects we do not want it to fail due to insufficient memory, so we pre-allocate several, put in Mempool to save up. Under normal circumstances, the allocation of objects is not to move the resources inside the Mempool, as usual through the slab to allocate. The contents of Mempool are not used until the system memory is scarce and the memory cannot be allocated through slab.

[page swapping in and out] (figure: Upper left) (figure: upper right)
page swap out is also a very complex system. The memory page is swapped out to disk, and the disk file is mapped to memory, which is very similar to the two processes (the motive for memory pages being swapped out to disk, which is to be loaded back into memory from disk in the future). So swap re-uses some of the mechanisms of the file subsystem.
page swapping is a matter of CPU and IO, but due to the historical reason of the high memory cost, we had to take the disk to expand the memory. But now that the memory is getting cheaper, we can easily install a few g of memory and then turn off the swap system. So the realization of swap is really difficult to explore the desire, here will not repeat it. (See also: "Analysis on the recovery of Linux kernel pages")

[User space Memory management]
malloc is a library function of libc, where a user program typically allocates memory space through it (or similar functions).
LIBC has two ways to allocate memory, one is to adjust the size of the heap, and the other is to mmap a new virtual memory area (the heap is also a VMA).
in the kernel, the heap is a fixed, one-end VMA (figure: left). The retractable end is adjusted by the system call BRK. LIBC manages the heap space, and when users call malloc to allocate memory, LIBC tries to allocate from the existing heap as much as possible. If the heap space is insufficient, increase the heap space by BRK.
When the user places the allocated space free, libc may reduce the heap space by BRK. However, the increase in heap space is easy to reduce but difficult, considering a situation where user space is continuously allocated 10 memory, the first 9 blocks are free. At this point, the 10th block without free, even if only 1 bytes large, libc is not able to reduce the size of the heap. Because only one end of the heap can be stretched, and the middle cannot be emptied. The 10th block of memory is tightly occupied by the heap of the retractable end, the size of the heap can not be reduced, the relevant resources can not be returned to the kernel.
when the user malloc a large chunk of memory, LIBC will map a new VMA through the mmap system call. Because the heap sizing and space management is still troublesome, it is more convenient to re-build a VMA (the issue of free as mentioned above is also one of the reasons).
So why not always go to mmap a new VMA at malloc? First, for the allocation and recovery of small space, the heap space managed by LIBC is already able to meet the needs, and does not have to make system calls every time. And VMA is in page, the smallest is to allocate a page; second, too many VMA can degrade system performance. Page faults, new and destroyed VMA, heap size adjustment, and so on, all need to operate on VMA, need to find the one (or those) VMA that need to be manipulated in all VMA of the current process. The number of VMA is too large, which inevitably leads to degraded performance. (When the VMA of the process is less, the kernel uses the list to manage the VMA;VMA more, instead, the red-black tree is used to manage it.) )

[User's Stack]
like the heap, the stack is also a VMA (figure: left), the VMA is fixed at one end, one end can be stretched (note, cannot be shrunk). This VMA is very special, there is no similar BRK system call let this VMA stretch, it is automatically stretched.
When a user accesses a virtual address that crosses this VMA, the kernel will automatically increase the VMA when it handles the fault of the missing pages. The kernel checks the current stack register (for example, ESP), accessing the virtual address cannot exceed the ESP plus N (n is the maximum number of bytes that the CPU pushes the stack). That is, the kernel checks whether access is out-of-bounds by using ESP as a benchmark.
However, the value of ESP can be freely read and written by the user-state program, if the user program to adjust the ESP, the stack is very large, how to do? The kernel has a set of configuration about the process constraints, in which there is a stack size configuration, the stack can only be so large, then the big error.
for a process, the stack can generally be stretched to a larger extent (e.g. 8MB). But what about threads?
What's the first thread stack going on? As mentioned earlier, the thread's mm is shared by its parent process. Although the stack is a VMA in mm, the thread cannot share this VMA with its parent process (two running entities are obviously not sharing a stack). As a result, when the thread is created, the line libraries creates a new VMA through MMAP, which serves as the stack of threads (larger than the general: 2M).
As can be seen, the stack of threads is not really a stack in a sense, it is a fixed area, and the capacity is limited.

Turn from: Analysis of memory management. Http://hi.baidu.com/_kouu/item/4c73532902a05299b73263d0

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analysis of Linux memory management

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Analysis of Linux memory management

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support