Linux memory management

Source: Internet
Author: User

[Transfer] http://hi.baidu.com/_kouu/item/4c73532902a05299b73263d0

 


[Address ing] (Figure: middle left)
The Linux kernel uses page-based memory management. The memory address provided by the application is a virtual address. It must undergo several level-1 Changes in page tables to become a real physical address.
Think about address ing. When accessing a memory space represented by a virtual address, you must first access the memory several times, the page table items used for conversion in each level of page table (the page table is stored in the memory) can be mapped. That is to say, to implement a memory access, the memory is actually accessed n + 1 times (n = page table level), and N addition operations are required.
Therefore, address ing must be supported by hardware. MMU (Memory Management Unit) is the hardware. A cache is required to store the page table. This cache is TLB (translation lookaside buffer ).
However, address ing still has a large overhead. Assuming that the cache access speed is 10 times that of the memory, the hit rate is 40%, and the page table has three levels, the average virtual address access consumes about two physical memory access times.
Therefore, some embedded hardware may discard MMU, which can run VxWorks (a very efficient embedded real-time operating system) and Linux (which also has the option to disable MMU compilation in Linux), and other systems.
However, the advantages of MMU are also great, most importantly, for security considerations. Each process is an independent virtual address space that does not interfere with each other. After the address ing is abandoned, all programs will run in the same address space. Therefore, on a machine without MMU, a process accesses memory outside of the border, which may cause inexplicable errors of other processes or even cause kernel crash.
In terms of address ing, the kernel only provides page tables, and the actual conversion is completed by hardware. How does the kernel generate these page tables? This involves two aspects: virtual address space management and physical memory management. (In fact, only user-state address ing needs to be managed, and kernel-state address ing is hard to write .)

[Virtual Address Management] (Figure: Bottom left)
Each process corresponds to a task structure, which points to a mm structure, which is the memory manager of the process. (For threads, each thread also has a task structure, but they all point to the same mm, so the address space is shared .)
Mm-> PGD points to the memory that holds the page table. Each process has its own mm, and each mm has its own page table. As a result, the page table is switched during process scheduling (generally there is a CPU register to save the address of the page table, for example, the page table Switch Under x86 is to change the value of this register ). Therefore, the address space of each process does not affect each other (because the page tables are different, of course, they cannot access others' address spaces. Except shared memory, this means that different page tables can access the same physical address ).
User Programs perform memory operations (such as allocation, recovery, ing, and so on) on MM. Specifically, they perform VMA (virtual memory space) operations on MM. These VMA represents various areas of the process space, such as heaps, stacks, code areas, data areas, various ing areas, and so on.
The user program's memory operations will not directly affect the page table, nor directly affect the physical memory allocation. For example, if malloc is successful, only a VMA is changed, the page table will not change, and the physical memory allocation will not change.
Assume that the user allocates the memory and then accesses the memory. Because there is no record-related ing in the page table, the CPU generates a page missing exception. The kernel captures an exception and checks whether the abnormal address exists in a valid VMA. If not, a "segment error" is sent to the process to crash. If yes, a physical page is allocated and a ing is created for the process.

[Physical memory management] (Figure: Top right)
How is the physical memory allocated?
First, Linux supports NUMA (heterogeneous storage structure). The first layer of physical memory management is media management. The pg_data_t Structure describes the media. In general, our memory management media only has memory and is uniform, so we can simply think that there is only one pg_data_t object in the system.
There are several zones under each media. Generally, there are three, DMA, normal, and high.
DMA: Because the DMA bus of some hardware systems is narrower than the system bus, only a portion of the address space can be used as the DMA, and this part of the address is managed in the DMA area (this is an advanced product );
High: high-end memory. In a 32-bit system, the address space is 4 GB, of which the kernel specification is 3 ~ The range of 4G is kernel space, 0 ~ 3G is the user space (each user process has such a large virtual space) (figure: bottom and bottom ). As mentioned above, the kernel address ing is written to the dead, that is, the 3 ~ 4G corresponding page table is written to death, it maps to the physical address 0 ~ 1g. (In fact, 1g is not mapped, and only 896 m is mapped. The remaining space is mapped to a physical address larger than 1 GB, And this part is obviously not written to death ). Therefore, physical addresses larger than m are not written to the dead page table, and the kernel cannot directly access them (a ing must be established), and they are called high-end memory (of course, if the machine memory is less than 896 MB, there is no high-end memory. If it is a 64-bit machine, there is no high-end memory, because the address space is very large, the space belonging to the kernel is more than 1 GB );
Normal: memory that does not belong to DMA or high is called normal.
Zone_list on the zone represents the allocation policy, that is, the priority of the zone during memory allocation. A memory allocation is usually not distributed only in one zone. For example, when a page is allocated to the kernel for use, it is best to allocate it from normal, if this is not the case, you can allocate all the resources in the DMA instance (high won't work because no ing has been established). This is a allocation policy.
Each memory medium maintains a mem_map, and creates a page structure for each physical page in the media to manage the physical memory.
Each zone records its starting position on mem_map. Free_area is used to concatenate idle pages in the zone. The physical memory allocation comes from here, and the page is removed from free_area, even if it is allocated. (The memory allocation of the kernel is different from that of the user process. When the user uses the memory, it is monitored by the kernel. If the user uses the memory improperly, it is "segment error". However, the kernel is unsupervised and can only be conscious, do not use the page that is not taken from free_area .)

[Create address ing]
When the kernel requires physical memory, the entire page is allocated in many cases. This is fine when we extract a page from the mem_map File above. For example, if the kernel caught a page error, You need to allocate a page to create a ing.
Speaking of this, I have a question: is the kernel using a virtual address or a physical address in the Process of page allocation and address ing? First, the addresses accessed by the kernel code are virtual addresses, because the CPU commands receive virtual addresses (address ing is transparent to CPU commands ). However, when establishing address ing, the content entered by the kernel in the page table is a physical address, because the goal of address ing is to obtain the physical address.
So how does the kernel obtain this physical address? As mentioned above, pages in mem_map are created based on the physical memory, and each page corresponds to a physical page.
Therefore, we can say that the page ing of virtual addresses relies on the page structure, which gives the final physical addresses. However, the page structure is obviously managed through a virtual address (as mentioned above, the CPU command receives a virtual address ). In this case, the page structure implements the virtual address ing of others. Who can realize the virtual address ing of the page structure? No one can achieve this.
This leads to a problem mentioned above. The page table items in the kernel space are written to death. During kernel initialization, the address space of the kernel has written the address ing to an end. The page structure obviously exists in the kernel space, so its address ing problem has been solved by "writing dead.
Because the page table items in the kernel space are written to death, another problem occurs. The memory in the normal (or DMA) region may be mapped to both the kernel space and the user space. It is obvious that the ing is to the kernel space, because the ing is no longer written. These pages may also be mapped to the user space, which is possible in the case of page missing exceptions mentioned above. Pages mapped to the user space should be obtained from the high area first, because the memory is inconvenient to be accessed by the kernel, so it is more appropriate to give the user space. However, the high area may be exhausted, or the system does not have a high area due to insufficient physical memory on the device. Therefore, ing the normal area to the user space is inevitable.
However, it is no problem that the memory in the normal area is mapped to both the kernel space and the user space, because if a page is being used by the kernel, the corresponding page should have been removed from free_area, therefore, the page is no longer mapped to the user space in the error code. In turn, the page mapped to the user space has naturally been removed from free_area, and the kernel will not use this page any more.

[Kernel space management] (Figure: bottom right)
In addition to memory usage on the entire page, sometimes the kernel needs to allocate a space of any size just like the user program uses malloc. This function is implemented by the slab system.
Slab is equivalent to setting up an object pool for some common struct objects in the kernel, such as the pool corresponding to the task structure, the pool corresponding to the MM structure, and so on.
Slab also maintains a general object pool, such as the "32-byte size" Object pool, the "64-byte size" Object pool, and so on. Common kmalloc functions in the kernel (similar to the user-state malloc) are allocated in these general object pools.
In addition to the memory space actually used by the object, slab also has its corresponding control structure. There are two ways to organize an object. If the object is large, the control structure uses a special page to save it. If the object is small, the control structure uses the same page as the object space.
In addition to slab, mempool (memory pool) is introduced in Linux 2.6 ). The intention is: we do not want some objects to be allocated due to insufficient memory, so we allocate several objects in advance and store them in mempool. Normally, resources in mempool are not moved when an object is allocated, and are allocated through slab as usual. Mempool content is used only when the system memory is insufficient and the memory cannot be allocated through slab.

[Page exchange] (Figure: Top left) (figure: Top right)
Page switching is a complicated system. The memory page is swapped out to the disk, which is mapped to the memory. It is very similar to the two processes (the motivation for the memory page to be swapped out to the disk, that is, it will be carried back from the disk to the memory in the future ). Therefore, swap reuses some mechanisms of the file subsystem.
Page Swap-in and out is a very costly task of CPU and I/O. However, due to the historical high memory usage, we have to use the disk to expand the memory. But now the memory is getting cheaper and cheaper. We can easily install several GB of memory and disable the swap system. So the implementation of swap makes it difficult to explore, so we will not go into detail here. (For more information, see Linux kernel page recycling analysis.)

[User Space Memory Management]
Malloc is the library function of libc. user programs usually use it (or similar functions) to allocate memory space.
Libc has two ways to allocate memory: one is to adjust the heap size, and the other is to MMAP a new virtual memory area (the heap is also a VMA ).
In the kernel, the heap is a fixed and scalable VMA at one end (figure: middle left ). The scalable end is adjusted by calling BRK. Libc manages heap space. When you call malloc to allocate memory, libc tries its best to allocate it from the existing heap. If the heap space is insufficient, increase the heap space through BRK.
When you free the allocated space, libc may reduce the heap space through BRK. However, it is difficult to reduce the size of heap space. In this case, the user space is allocated with 10 consecutive memories, and the first nine are free. At this time, even if the remaining 10th blocks are only 1 byte large, libc cannot reduce the heap size. Because the heap is only scalable at one end, and the middle cannot be hollowed out. The 10th block of memory occupies the Scalable end of the heap, And the heap size cannot be reduced, and related resources cannot be returned to the kernel.
When the user malloc a large memory, libc maps a new VMA through the MMAP system call. Because it is difficult to adjust the heap size and manage the space, it is more convenient to re-build a VMA (the free problem mentioned above is also one of the reasons ).
So why not always map a new VMA to MMAP during malloc? First, for the allocation and recovery of small spaces, the heap space managed by libc can meet the requirements and does not need to be called by the system every time. In addition, VMA is in the unit of page, and the smallest is to allocate a page. Second, too many VMA will reduce system performance. When a page is missing, VMA is created and destroyed, the heap space is adjusted, and so on, VMA operations are required, you need to find the (or those) VMA to be operated in all VMA of the current process. A large number of VMA instances will inevitably cause performance degradation. (When there are few VMA processes, the kernel uses a linked list to manage VMA. If there are many VMA instances, use the red/black tree to manage them .)

[User Stack]
Like the heap, the stack is also a VMA (figure: middle left). This VMA is fixed at one end and scalable at one end (note that it cannot be scaled down. This VMA is special. There is no system call similar to BRK to make the VMA stretch, and it is automatically stretched.
When the virtual address accessed by the user crosses this VMA, the kernel will automatically increase the VMA when handling page-missing exceptions. The kernel will check the current stack register (for example, ESP). The accessed virtual address cannot exceed ESP plus N (n is the maximum number of bytes for one-time stack pressure for CPU pressure stack commands ). That is to say, the kernel uses esp as the benchmark to check whether access is out of bounds.
However, the ESP value can be read and written freely by the user State program. What should I do if the user program adjusts the ESP to make the stack very large? There is a set of configuration about process restrictions in the kernel, where there is a stack size configuration, the stack can only be so large, and then an error occurs.
For a process, the stack can be stretched to a large size (for example, 8 Mb ). But what about threads?
First, what is the thread stack? As mentioned above, the mm of a thread shares its parent process. Although the stack is a VMA in mm, the thread cannot share this VMA with its parent process (two running entities obviously do not need to share one stack ). Therefore, when a thread is created, the thread library creates a VMA through MMAP, which serves as the thread stack (larger than generally: 2 m ).
It can be seen that the stack of a thread is not a real stack in a sense. It is a fixed area and has limited capacity.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.