On Linux memory management mechanism

Last Update:2014-11-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

often encounter some new Linux novice will ask how much memory consumption? In Linux often found that the free memory is very small, it seems that all the memory is occupied by the system, the surface of the memory is not enough to use, it is not. This is an excellent feature of Linux memory management, which differs from Windows memory management in this regard. AD:2014WOT Global Software Technology Summit Beijing Station course video release often encounter some new Linux novice will ask how much memory consumption? In Linux often found that the free memory is very small, it seems that all the memory is occupied by the system, the surface of the memory is not enough to use, it is not. This is an excellent feature of Linux memory management, which differs from Windows memory management in this regard. The main feature is that no matter how large the physical memory is, Linux is fully utilized to read some program-called hard disk data into memory, using the high-speed features of memory read and write to improve the data access performance of Linux system. Instead, Windows allocates memory for the application only when it needs memory, and does not take full advantage of the large capacity of the memory space. In other words, with each additional physical memory, Linux will be able to take full advantage of the benefits of hardware investment, and Windows will only do it as a device, even if the increase of 8GB or even larger. This feature of Linux, mainly uses the free physical memory, divides the part space, as the cache, buffers, improves the data access performance. The page cache (cache) is a primary disk cache implemented by the Linux kernel. It is mainly used to reduce the disk I/O operation. Specifically, by caching the data in the disk into physical memory, access to the disk becomes access to physical memory. The value of the disk cache is two: first, accessing the disk is much slower than accessing the memory, so accessing the data from memory is faster than accessing it from disk. Second, once the data is accessed, it is likely to be accessed again in the short term. Here's a look at the Linux memory management mechanism: first, physical memory and virtual memory we know that reading and writing data directly from physical memory is much faster than reading and writing data from the hard disk, so we want all the data read and written to be done in memory, and memory is limited, which leads to the concept of physical memory and virtual memory. Physical memory is the amount of memory provided by the system hardware, is the real memory, relative to the physical memory, under Linux there is a virtual memory concept, virtual memory is to meet the lack of physical memory of the proposed strategy, it is the use of disk space virtual out of a piece of logical memory, The disk space used as virtual memory is called swap space. As an extension of physical memory, Linux will use the virtual memory of the swap partition when physical memory is low, in more detail, the kernel will write the memory block information that is temporarily unused to the swap space, so that the physical memory has been released, this memory can be used for other purposes, when the original content needs to be used, This information is re-read into the physical memory from the swap space. Linux memory management is a paging access mechanism, in order to ensure that the physical memory can be fully utilized, the kernel will be in the physical memory of infrequently used data blocks automatically swapped into virtual memory, and the information often used to retain the physical memory. To learn more about the Linux memory run mechanism, you need to know the following aspects: The Linux system will occasionally do the paging operation to keep as much free physical memory as possible, even if there is nothing to do with memory, Linux will swap out the memory pages that are temporarily unused. This avoids the time required to wait for the interchange. Linux for the page exchange is conditional, not all the pages are swapped to virtual memory when not in use, the Linux kernel based on the "most Frequently used" algorithm, only some infrequently used paging files to virtual memory, sometimes we will see a phenomenon: Linux physical memory is still many, But the swap space is also used a lot. In fact, this is not surprising, for example, a very large memory of the process run, it takes a lot of memory resources, there will be some infrequently used paging file is swapped into virtual memory, but later this memory resource-intensive process ended and released a lot of memory, The page file that was swapped out is not automatically swapped into physical memory, unless it is necessary, then the system physical memory will be idle a lot at the moment, while the exchange of space is also being used, there is a phenomenon just said. Don't worry about this., as long as you know what's going on. The swap space pages are swapped to physical memory first, and if there is not enough physical memory to accommodate them, they will be swapped out immediately, so there may not be enough space in the virtual memory to store the swap pages, which can eventually lead to a fake crash, service exception, etc. While Linux can recover itself over time, the restored system is largely unusable. Therefore, it is very important to plan and design the use of Linux memory rationally. Second, memory monitoring as a Linux system administrator, it is important to monitor the state of memory usage through monitoring to help understand the state of memory usage, such as whether memory consumption is normal, memory is scarce, etc.  The most commonly used commands for monitoring memory are free, top, etc., and the following is the output of a system free: [[email protected]~]# free total used free shared buffers cached Mem:3894036    3473544     420492          0      72972    1332348-/+ Buffers/cache:2068224    1825812Swap:4095992     906036    3189956the meaning of each option: the first line: total: Overall size of physical memory used: the size of the physical memory used free: idle physical memory size shared: The memory size of multiple processes sharing buffers/Cached: Disk cache size Second line mem: Represents the physical memory usage third row (-/+ buffers/cached): Represents the disk cache usage status line fourth: Swap space memory usage state the memory state of the free command output can be viewed from two angles: one from the perspective of the kernel, one from the perspective of the application layer. From the perspective of the kernel to see the state of memory is the kernel can now be directly assigned to, do not need extra action, that is, the output of the above free command mem value of the second line, you can see that this system physical memory 3894036K, free memory only 420492K, that is, 40M a little more, Let's do a calculation like this:3894036–3473544=420492In fact, the total physical memory minus the physical memory that has been used to get the free physical memory size, note that the available memory value of 420492 does not contain the memory size in the buffers and cached states. If you think that the system is too small, you are wrong, in fact, the kernel completely control the use of memory, Linux will need to memory, or when the system is running progressively, the buffers and cached state memory into the Free State of memory for the system to use. The usage state of the system memory from the application layer's point of view is the amount of memory that the application running on Linux can use, i.e. the third line of the free command-/+ buffers/cached output, you can see that this system has used the memory is 2068224K, and idle memory reached 1825812K, continue to do such a calculation:420492+72972+1332348) =1825812This equation shows that the physical memory value available to an application is the sum of the free value of the MEM item plus the buffers and cached values, that is, the value of this value is buffers and cached item size, and for applications, buffers /cached occupied memory is available because buffers/cached is designed to improve the performance of file reads, and when applications need to use memory, buffers/cached will be quickly recycled for use by the application. Similarities and differences between buffers and cached in a Linux operating system, when an application needs to read data from a file, the operating system allocates some memory, reads the data from the disk into the memory, and then distributes the data to the application, and when the data is written to the file, The operating system allocates memory to receive user data before it writes data from memory to disk. However, if there is a large amount of data that needs to be read from disk to memory or written to disk by memory, the read and write performance of the system becomes very low, because either reading from disk or writing data to disk is a very time-consuming and resource-intensive process, in which case Linux introduces the buffers and cached mechanisms. Buffers and cached are memory operations that are used to save files and file attribute information that have been opened by the system, so that when the operating system needs to read some files, it will first look in the buffers and cached memory areas, and if found, read them directly to the application. If you do not find the data needed to read from disk, this is the operating system caching mechanism, through the cache, greatly improve the performance of the operating system. But the content of buffers and cached buffer is different. Buffers is used to buffer the block device, it only records the file system metadata (metadata) and trackinginch-flight pages, and cached is used to buffer files. More commonly said: buffers is mainly used to store content in the directory, file attributes and permissions and so on. and cached is used directly to memorize the files and programs we have opened. In order to verify our conclusion is correct, you can open a very large file by VI, look at the change of cached, and then again VI this file, feel how the speed of two times to open the similarities and differences, is not the second opening speed significantly faster than the first time? Then execute the following command: Find/*-name *.conf See if the value of buffers changes, and then repeat the Find command to see how the two times the display speed is different. Linux operating system memory operation Principle, is largely based on the needs of the server design, such as the system's buffering mechanism will be used to cache files and data in cached, Linux is always trying to cache more data and information, so that the need for this data can be directly from the memory Without the need for a lengthy disk operation, this design approach improves the overall performance of the system.

[Address mapping] (figure: left) the Linux kernel uses page-type memory management, the memory address given by the application is a virtual address, and it needs to go through a number of levels of page table-level transformations before it becomes a real physical address. Think of it, address mapping is still a very scary thing. When accessing a memory space represented by a virtual address, a number of memory accesses are required to obtain a page table entry for conversion in each level of page table (the page table is stored in memory) in order to complete the mapping. In other words, to implement a memory access, the memory is actually accessed by n+ 1 times (n=Page Table series), and n addition operations are also required. Therefore, the address mapping must have hardware support, the MMU (Memory management unit) is this hardware. and a cache is required to save the page table, which is the TLB (translation lookaside buffer). However, address mapping has a significant overhead. Assuming the cache is 10 times times more memory, the hit rate is%, the page table has three levels, and the average virtual address access consumes approximately two times of physical memory access. Therefore, some embedded hardware may abandon the use of MMU, such hardware can run VxWorks (a very efficient embedded real-time operating system), Linux (Linux also has to disable the MMU compiler option), and other systems. But the advantage of using MMU is also very big, the most important is for security reasons. Each process is independent of the virtual address space, non-interference. After the address mapping is discarded, all programs run in the same address space. Thus, on a machine without the MMU, a process that crosses the border may cause other processes to be baffled and even cause the kernel to crash. On the issue of address mapping, the kernel only provides page tables, and the actual conversion is done by hardware. So how does the kernel generate these page tables? This has two aspects of content, virtual address space management and physical memory management. (In fact, only the user-State address mapping needs to be managed, the kernel-State address mapping is written dead.) ) [Virtual Address management] (figure: bottom left) each process corresponds to a task structure that points to a MM structure, which is the memory manager of the process. (For threads, each thread also has a task structure, but they all point to the same mm, so the address space is shared.) ) mm-PGD points to the memory that holds the page table, each process has its own mm, and each mm has its own page table. Thus, when the process is scheduled, the page table is switched (typically there is a CPU register to save the page table address, such as X86 under the CR3, the page table switch is to change the value of the register). Therefore, each process's address space does not affect each other (because the page table is different, of course, can not access to other people's address space.) Except for shared memory, this is intentionally allowing different page tables to have access to the same physical address. The operation of the user program on memory (allocation, recycling, mapping, etc.) is the operation of MM, specifically the VMA (virtual memory space) on MM. These VMA represent various areas of the process space, such as heap, stack, code area, data area, various mapping areas, and so on. The operation of the user program on memory does not directly affect the page table, nor does it directly affect the allocation of physical memory. For example, malloc succeeds simply by changing a certain VMA, the page table will not change, and the allocation of physical memory will not change. Assume that the user has allocated memory, and then accesses this block of memory. Because there are no related mappings in the page table, the CPU generates a fault. The kernel catches the exception and checks that the address that generated the exception is present in a legitimate VMA. If not, give the process a"Segment Error", let it crash, or, if it is, assign a physical page and map it. [Physical memory management] (Figure: Top right) So how is the physical memory allocated? First, Linux supports NUMA (Heterogeneous storage architecture), and the first level of physical memory management is media management. The pg_data_t structure describes the media. In general, our memory management media is only memory, and it is uniform, so it is easy to assume that there is only one pg_data_t object in the system. There are several zones under each type of media. Typically three, DMA, Normal, and high. DMA: Because some hardware systems have a narrower DMA bus than the system bus, only a subset of the address space can be used as DMA, which is managed in the DMA zone (this is a premium); High: higher memory. In a 32-bit system, the address space is 4G, where the kernel specifies 3The scope of the ~4G is kernel space,0~3G is the user space (each user process has such a large virtual space) (figure: lower). The previously mentioned kernel address mapping is written dead, that is, the corresponding page table of this 3~4g is written dead, it mapped to the physical address of the 0~on 1G. (There is actually no mapping of 1G, only 896M is mapped.) The rest of the space is left to map physical addresses larger than 1G, and this part is obviously not written dead. So, a physical address larger than 896M corresponds to a page table that is not written dead, and the kernel cannot access them directly (it must be mapped), calling them high-end memory (of course, if the machine has less than 896M of memory, there is no high-end memory.) If it is a 64-bit machine, there is no high-end memory, because the address space is very large, the kernel is not more than 1G of space; normal: Memory that does not belong to DMA or higher is called normal. The zone_list above the zone represents the allocation policy, which is the zone priority in memory allocation. A memory allocation is often not only in a zone to allocate, such as allocating a page to the kernel when used, the first priority is to assign from the normal, not the allocation of DMA inside the good (high is not, because there is no mapping), this is an allocation strategy. Each memory medium maintains a mem_map that establishes a page structure corresponding to each physical page in the media in order to manage physical memory. Each zone records its starting position on the Mem_map. and the free page on this zone is concatenated through Free_area. The allocation of physical memory is from here, from the Free_area page off, even if it is allocated. (The kernel's memory allocation differs from the user process, and the user uses memory to be supervised by the kernel, and improper use"Segment Error", but the kernel is unsupervised, can only rely on the self-consciously, not oneself from free_area off the page to don't mess with. [Establish address mapping] when the kernel needs physical memory, a lot of cases are allocated throughout the page, which is good to pick a page down in the Mem_map. For example, the kernel captures a page fault exception, and then it needs to be assigned a pages to establish the mapping. Here, there is a question, the kernel in the allocation page, the process of establishing an address map, the use of virtual address or physical address it? First, the address that the kernel code accesses is the virtual address, because the CPU instruction receives the virtual address (the address map is transparent to the CPU instructions). However, when establishing an address map, the kernel fills in the page table with the physical address, because the destination of the address map is to obtain the physical address. So, how does the kernel get this physical address? In fact, it is mentioned above, the page in Mem_map is based on physical memory to build, each page corresponds to a physical page. So we can say that the mapping of virtual addresses is done by the page structure here, they give the final physical address. However, the page structure is obviously managed by a virtual address (as previously mentioned, the CPU instruction receives the virtual address). So, the page structure to implement the other people's virtual address mapping, who to implement the page structure of their own virtual address mapping it? No one can achieve it. This leads to the problem mentioned earlier, the kernel space of the page table entries are written dead. When the kernel is initialized, the address space of the kernel is already written to the address map. The page structure obviously exists in kernel space, so its address mapping problem has been solved by "write dead". Because the page table entries in kernel space are dead and another problem arises, the memory of the NORMAL (or DMA) region may be mapped to both kernel space and user space. Being mapped to kernel space is obvious, because this mapping has been written dead. These pages may also be mapped to user space, which is possible in the scenario where the previously mentioned page faults are abnormal. Pages mapped to user space should be taken first from the high zone because they are inconvenient to access by the kernel and are suitable for user space. However, the high zone may be depleted, or it may be that there is no high zone in the system due to insufficient physical memory on the device, so mapping the normal zone to user space is inevitable. However, it is not a problem to have the memory of the normal zone mapped to both the kernel space and the user space, because if a page is being used by the kernel, the corresponding page should have been removed from Free_area, and the page will no longer be mapped to user space in the fault handling code. In turn, the page that is mapped to the user space is naturally removed from the Free_area, and the kernel will no longer use it. [Kernel space management] (figure: bottom right) In addition to the use of internal pages, in some cases, the kernel also needs to be like a user program with malloc, allocating an arbitrary size of space. This feature is implemented by the slab system. Slab is equivalent to creating object pools for some of the structure objects commonly used in the kernel, such as pools for task structures, pools corresponding to mm structures, and so on. Slab also maintains a common pool of objects, such as"32 byte size"The object pool,"64 byte size"the object pool, and so on. The Kmalloc functions commonly used in the kernel (like the user-state malloc) are allocated in these common object pools. Slab In addition to the actual memory space used by the object, there is a corresponding control structure. There are two ways to organize, if the object is large, the control structure is saved with a dedicated page, and if the object is small, the control structure uses the same page as the object space. Except Slab,linux .2.6 also introduced the Mempool (memory pool). The intention is that some objects we do not want it to fail due to insufficient memory, so we pre-allocate several, put in Mempool to save up. Under normal circumstances, the allocation of objects is not to move the resources inside the Mempool, as usual through the slab to allocate. The contents of Mempool are not used until the system memory is scarce and the memory cannot be allocated through slab. [Page swapping in and out] (Figure: Upper left) (Figure: Top right) page swap out is also a very complex system. The memory page is swapped out to disk, and the disk file is mapped to memory, which is very similar to the two processes (the motive for memory pages being swapped out to disk, which is to be loaded back into memory from disk in the future). So swap re-uses some of the mechanisms of the file subsystem. Page swapping is a matter of CPU and IO, but due to the historical reason of the high memory cost, we had to take the disk to expand the memory. But now that the memory is getting cheaper, we can easily install a few g of memory and then turn off the swap system. So the realization of swap is really difficult to explore the desire, here will not repeat it. (See also: "Linux kernel page Recycling analysis") [User space memory management]malloc is a library function of libc, and the user program typically allocates memory space through it (or similar functions). LIBC has two ways to allocate memory, one is to adjust the size of the heap, and the other is to mmap a new virtual memory area (the heap is also a VMA). In the kernel, the heap is a fixed, one-end VMA (figure: left). The retractable end is adjusted by the system call BRK. LIBC manages the heap space, and when users call malloc to allocate memory, LIBC tries to allocate from the existing heap as much as possible. If the heap space is insufficient, increase the heap space by BRK. When the user places the allocated space free, libc may reduce the heap space by BRK. However, the increase in heap space is easy to reduce but difficult, considering a situation where user space is continuously allocated 10 memory, the first 9 blocks are free. At this point, the 10th block without free, even if only 1 bytes large, libc is not able to reduce the size of the heap. Because only one end of the heap can be stretched, and the middle cannot be emptied. The 10th block of memory is tightly occupied by the heap of the retractable end, the size of the heap can not be reduced, the relevant resources can not be returned to the kernel. When the user malloc a large chunk of memory, LIBC will map a new VMA through the mmap system call. Because the heap sizing and space management is still troublesome, it is more convenient to re-build a VMA (the issue of free as mentioned above is also one of the reasons). So why not always go to mmap a new VMA at malloc? First, for the allocation and recovery of small space, the heap space managed by LIBC is already able to meet the needs, and does not have to make system calls every time. And VMA is with PThe smallest of the units of age is the allocation of one page; second, too many VMA can degrade system performance. Page faults, new and destroyed VMA, heap size adjustment, and so on, all need to operate on VMA, need to find the one (or those) VMA that need to be manipulated in all VMA of the current process. The number of VMA is too large, which inevitably leads to degraded performance. (When the VMA of the process is less, the kernel uses the list to manage the VMA;VMA more, instead, the red-black tree is used to manage it.) [user's stack] As with the heap, the stack is also a VMA (figure: left), the VMA is fixed at one end, one end can be stretched (note, cannot be shrunk). This VMA is very special, there is no similar BRK system call let this VMA stretch, it is automatically stretched. When a user accesses a virtual address that crosses this VMA, the kernel will automatically increase the VMA when it handles the fault of the missing pages. The kernel checks the current stack register (for example, ESP), accessing the virtual address cannot exceed the ESP plus N (n is the maximum number of bytes that the CPU pushes the stack). That is, the kernel checks whether access is out-of-bounds by using ESP as a benchmark. However, the value of ESP can be freely read and written by the user-state program, if the user program to adjust the ESP, the stack is very large, how to do? The kernel has a set of configuration about the process constraints, in which there is a stack size configuration, the stack can only be so large, then the big error. For a process, the stack can generally be stretched to a larger extent (e.g. 8MB). But what about threads? What's the first thread stack going on? As mentioned earlier, the thread's mm is shared by its parent process. Although the stack is a VMA in mm, the thread cannot share this VMA with its parent process (two running entities are obviously not sharing a stack). As a result, when the thread is created, the line libraries creates a new VMA through MMAP, which serves as the stack of threads (larger than the general: 2M). As can be seen, the stack of threads is not really a stack in a sense, it is a fixed area, and the capacity is limited.

On Linux memory management mechanism

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

On Linux memory management mechanism

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

On Linux memory management mechanism

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support