A brief analysis of Linux memory management mechanism
This paper makes a simple analysis of the Linux memory management mechanism, and tries to make you understand the concept of Linux memory management quickly and make use of some management methods effectively.
Numa
Linux 2.6 is beginning to support NUMA (Non-uniform memory Access) in RAM management mode. In systems with multiple CPUs, memory is divided by CPU into different node, each CPU hangs a node, and its access to local node is much faster than accessing node on other CPUs.
By numactl -H
looking at the NUMA hardware information, you can see the size of the 2 node and the corresponding CPU cores, as well as the CPU access to node distances. As shown below the CPU accesses the remote node's distances is twice times more than the local node.
[[email protected] ~]# numactl -Havailable: 2 nodes (0-1)node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23node 0 size: 15870 MBnode 0 free: 13780 MBnode 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31node 1 size: 16384 MBnode 1 free: 15542 MBnode distances:node 0 1 0: 10 21 1: 21 10
numastat
view NUMA statistics, including the number of hits, misses, local allocations, and remote allocations for memory allocations.
[[email protected] ~]# numastat node0 node1numa_hit 2351854045 3021228076numa_miss 22736854 2976885numa_foreign 2976885 22736854interleave_hit 14144 14100local_node 2351844760 3021220020other_node 22746139 2984941
Zone
Node below is divided into one or more zones, why should have zone, two reasons: 1. DMA devices can access a limited range of memory (ISA device can only access 16MB), 2.x86-32bit system address space is limited (32 bits up to 4GB), in order to use larger memory, you need to use the HIGHMEM mechanism.
Zone_dma
An area of memory with the lowest address segment for DMA access by ISA (Industry standard Architecture) device. Under the x86 architecture, the zone size is limited to 16MB.
Zone_dma32
This zone is used for DMA devices that support 32-bits address bus and only works in 64-bits systems.
Zone_normal
The zone's memory is mapped directly to a linear address by the kernel and can be used directly. In the X86-32 architecture, the zone corresponds to an address range of 16MB~896MB. Under the X86-64 architecture, the memory outside of DMA and DMA32 is all managed in the zone of normal.
Zone_highmem
The zone is only on a 32-bit system and maps more than 896MB of memory space by creating a temporary page table. That is, when you need access to establish a mapping relationship between the address space and memory, after the end of the visit to remove the mapping relationship to release the address space, the address space can be used for other highmem memory mapping.
Zone-related information can be viewed through/proc/zoneinfo. As shown below, there is only one normal Zone on the two node,node0 on the x86-64 system with DMA, DMA32, and normal three zone,node1.
[[email protected]|grep"zone| free|managed"Node 0, zone DMA pages free 3700 managed 3975Node 0, zone DMA32 pages free 291250 managed 326897Node 0, zone Normal pages free 3232166 managed 3604347Node 1, zone Normal pages free 3980110 managed 4128056
Page
Page is the basic unit of Linux underlying memory management with a size of 4KB. A page is mapped to a contiguous amount of physical memory, and the allocation and deallocation of the memory is done in page units. The mapping of the process virtual address to the physical address is also done through the page table, where each item of the page table records the physical address corresponding to the virtual address of a page.
Tlb
Memory access needs to find the corresponding page structure of the address, this data is recorded in the page table. All access to the memory address is queried first in the page table, so the page table has the highest frequency of accesses. To increase the speed of access to the page table, a TLB (translation lookaside buffer) mechanism is introduced, which caches access to more page tables in the CPU cache. So an important part of CPU performance statistics is the L1/L2 cache's TLB Miss statistic. In a large memory system, such as the full amount of 256GB memory page table entries have 256gb/4kb=67108864 bar, each entry occupies 16 bytes, the need for 1GB, apparently the CPU cache is not fully cached. At this time if the access to a wide range of memory is easy to appear TLB miss resulting in increased access latency.
Hugepages
To reduce the probability of TLB Miss, Linux introduces the hugepages mechanism, which can set the page size to 2MB or 1GB. 2MB hugepages mechanism, the same 256GB memory required page table entries are reduced to 256gb/2mb=131072, only need 2MB. Therefore, the Hugepages page table can be cached in the CPU cache in full volume.
By Sysctl-w vm.nr_hugepages=1024 You can set the number of hugepages to 1024, the total size is 4GB. It is important to note that the Setup huagepages will request a contiguous 2MB block of memory from the system and hold it (not for normal memory requests), and if the system is running for a period of time causing more memory fragmentation, then requesting hugepages will fail.
As shown below for the setup and Mount methods of Hugepages, the application needs to use these hugepages under the Mount path through mmap for file mapping.
sysctl -w vm.nr_hugepages=1024mkdir -p /mnt/hugepagesmount -t hugetlbfs hugetlbfs /mnt/hugepages
Buddy System
The Linux Buddy system is designed to address the problem of external memory fragmentation caused by a page-based memory allocation: The system is missing a continuous page page that causes memory requests that require continuous page pages to not be satisfied. The principle is very simple, the different number of consecutive pages of the composition block allocation, block by the power of 2 pages divided into 11 block list, respectively, corresponding to 1,2,4,8,16,32,64,128,256,512 and 1024 consecutive pages. When calling buddy system for memory allocation, find the most appropriate block based on the size of the request.
As shown below is the buddy system basic information on each zone, the following 11 is listed as the number of blocks available in the 11 block list.
[[email protected]Node 0, zone DMA 0 0 1 0 1 1 1 0 0 1 Node 0, zone DMA32 102 79 179 229 230 166 251 168 107 78 Node 0, zone Normal 1328 900 1985 1920 2261 1388 798 972 539 324 Node 1, zone Normal 466 1476 2133 7715 6026 4737 2883 1532 778 490 2760
Slab
Buddy system memory is a chunk of the application, but most applications require a small amount of memory, such as the common hundreds of bytes data structure, if you also apply for a page, it will be very wasteful. To meet the small and irregular memory allocation requirements, Linux designed the slab allocator. The principle is simply to establish a memcache for a particular data structure, apply pages from the buddy system, divide each page into multiple objects by the size of the data structure, and assign an object when the user requests a data structure from memcache.
Here's how to view slab information for Linux as follows:
[[email protected]~]# Cat/proc/slabinfoSlabinfo-version:2.1# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>: Tunab Les <limit> <batchcount> <sharedfactor>: Slabdata <active_slabs> <num_slabs> < Sharedavail>Fat_inode_cache720 8:tunables 0 0 0:slabdata 2 2 0Fat_cache0 0 102 1:tunables 0 0 0:slabdata 0 0 0kvm_vcpu0 0 16576 1 8:tunables 0 0 0:slabdata 0 0 0Kvm_mmu_page_header0 0 168 2:tunables 0 0 0:slabdata 0 0 0ext4_groupinfo_4k4440 4440 136 1:tunables 0 0 0:slabdata 148 148 0Ext4_inode_cache63816 65100 1032 8:tunables 0 0 0:slabdata 2100 2100 0ext4_xattr1012 1012 1:tunables 0 0 0:slabdata 22 22 0Ext4_free_data16896 17600 1:tunables 0 0 0:slabdata 275 275 0
Usually we look at the sorted slab information through the slabtop command:
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 352014 352014 100% 0.10K 9026 39 93492 93435 99% 0.19K 2226 42 65100 63816 98% 1.01K 2100 31 48128 47638 98% 0.06K 752 64 47090 43684 92% 0.05K 554 85 44892 44892 100% 0.11K 1247 36 43624 43177 98% 0.07K 779 56 43146 42842 99% 0.04K 423 102 1692K ext4_extent_status
Kmalloc
As with glibc malloc()
, the kernel also provides kmalloc()
for allocating memory space of any size. Similarly, allowing an application to arbitrarily request any size of memory from a page can also cause memory fragmentation within the page. To resolve internal fragmentation issues, Linux uses the slab mechanism to implement KMALLOC memory allocations. The principle is similar to the buddy system, where the slab pool that creates a power of 2 is used to kmalloc allocations based on the optimal size of the slab.
The slabs for Kmalloc allocation are as follows:
[[email protected]~]# Cat/proc/slabinfoSlabinfo-version:2.1# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>: Tunab Les <limit> <batchcount> <sharedfactor>: Slabdata <active_slabs> <num_slabs> < Sharedavail>kmalloc-8192196 8192 4 8:tunables 0 0 0:slabdata 50 50 0kmalloc-40961214 1288 4096 8 8:tunables 0 0 0:slabdata 161 161 0kmalloc-20482861 2928 2048 8:tunables 0 0 0:slabdata 183 183 0kmalloc-10247993 8320 8:tunables 0 0 0:slabdata 260 260 0kmalloc-5126030 6144 4:tunables 0 0 0:slabdata 192 192 0kmalloc-2567813 8576 2:tunables 0 0 0:slabdata 268 268 0kmalloc-19215542 15750 192 2:tunables 0 0 0:slabdata 375 375 0kmalloc-12816814 16896 1:tunables 0 0 0:slabdata 528 528 0kmalloc-9617507 17934 1:tunables 0 0 0:slabdata 427 427 0kmalloc-6448590 48704 1:tunables 0 0 0:slabdata 761 761 0kmalloc-327296 7296 1:tunables 0 0 0:slabdata 57 57 0kmalloc-1614336 14336 1:tunables 0 0 0:slabdata 56 56 0kmalloc-821504 21504 8 1:tunables 0 0 0:slabdata 42 42 0
Kernel parameters
Linux provides some memory management-related kernel parameters, which /proc/sys/vm
can be viewed in the directory or sysctl -a |grep vm
viewed by:
[[email protected]vm]# sysctl-a|grepVmvm.admin_reserve_kbytes= 8192Vm.block_dump= 0vm.dirty_background_bytes= 0Vm.dirty_background_ratio= 10vm.dirty_bytes= 0Vm.dirty_expire_centisecs= 3000Vm.dirty_ratio= 20Vm.dirty_writeback_centisecs= 500vm.drop_caches= 1Vm.extfrag_threshold= 500vm.hugepages_treat_as_movable= 0Vm.hugetlb_shm_group= 0Vm.laptop_mode= 0Vm.legacy_va_layout= 0Vm.lowmem_reserve_ratio= 256 256 32Vm.max_map_count= 65530Vm.memory_failure_early_kill= 0Vm.memory_failure_recovery= 1vm.min_free_kbytes= 1024000Vm.min_slab_ratio= 1Vm.min_unmapped_ratio= 1vm.mmap_min_addr= 4096vm.nr_hugepages= 0Vm.nr_hugepages_mempolicy= 0vm.nr_overcommit_hugepages= 0vm.nr_pdflush_threads= 0Vm.numa_zonelist_order= DefaultVm.oom_dump_tasks= 1Vm.oom_kill_allocating_task= 0vm.overcommit_kbytes= 0vm.overcommit_memory= 0Vm.overcommit_ratio= 50Vm.page-cluster= 3Vm.panic_on_oom= 0vm.percpu_pagelist_fraction= 0Vm.stat_interval= 1vm.swappiness= 60vm.user_reserve_kbytes= 131072vm.vfs_cache_pressure= 100Vm.zone_reclaim_mode= 0
Vm.drop_caches
Vm.drop_caches is the most commonly used parameter because the Linux page cache mechanism causes a large amount of memory to be used in the file system cache, including the data cache and metadata (Dentry, inode) caches. When memory is low, we can quickly release the file system cache with this parameter:
To free pagecache: echo> /proc/sys/vm/drop_cachesTo free reclaimable slab objects (includes dentries and inodes): echo> /proc/sys/vm/drop_cachesTo free slab objects and pagecache: echo> /proc/sys/vm/drop_caches
Vm.min_free_kbytes
Vm.min_free_kbytes is used to determine how much memory is below the start of the memory recycling mechanism (including the file system cache mentioned above and the recyclable slab mentioned below), which has a smaller default value, A higher-memory system set to a larger value (such as 1GB) can automatically trigger a memory recycle if the memory is not too young. But also cannot be set too large, resulting in frequent applications often being oom killed.
sysctl -w vm.min_free_kbytes=1024000
Vm.min_slab_ratio
The vm.min_slab_ratio is used to determine how much of the recoverable slab space in the slab pool is recycled when the percentage of the zone is reached, by default it is 5%. But after the author test, when the memory is sufficient, it will not trigger slab recovery, and only when the memory watermark reaches the above min_free_kbytes will trigger slab recovery. The minimum value can be set to 1%:
sysctl -w vm.min_slab_ratio=1
Summarize
The above briefly describes the Linux memory management mechanism and several common memory management kernel parameters.
Resources
Understanding the Linux Kernel 3rd Edition
[Linux physical Memory Description]] (http://www.ilinuxkernel.com/files/Linux_Physical_Memory_Description.pdf)
A brief analysis of Linux memory management mechanism