A brief analysis of Linux memory management mechanism

Source: Internet
Author: User

A brief analysis of Linux memory management mechanism

This paper makes a simple analysis of the Linux memory management mechanism, and tries to make you understand the concept of Linux memory management quickly and make use of some management methods effectively.

Numa

Linux 2.6 is beginning to support NUMA (Non-uniform memory Access) in RAM management mode. In systems with multiple CPUs, memory is divided by CPU into different node, each CPU hangs a node, and its access to local node is much faster than accessing node on other CPUs.
By numactl -H looking at the NUMA hardware information, you can see the size of the 2 node and the corresponding CPU cores, as well as the CPU access to node distances. As shown below the CPU accesses the remote node's distances is twice times more than the local node.

[[email protected] ~]# numactl -Havailable: 2 nodes (0-1)node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23node 0 size: 15870 MBnode 0 free: 13780 MBnode 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31node 1 size: 16384 MBnode 1 free: 15542 MBnode distances:node   0   1   0:  10  21   1:  21  10

numastatview NUMA statistics, including the number of hits, misses, local allocations, and remote allocations for memory allocations.

[[email protected] ~]# numastat                            node0           node1numa_hit              2351854045      3021228076numa_miss               22736854         2976885numa_foreign             2976885        22736854interleave_hit             14144           14100local_node            2351844760      3021220020other_node              22746139         2984941
Zone

Node below is divided into one or more zones, why should have zone, two reasons: 1. DMA devices can access a limited range of memory (ISA device can only access 16MB), 2.x86-32bit system address space is limited (32 bits up to 4GB), in order to use larger memory, you need to use the HIGHMEM mechanism.

Zone_dma

An area of memory with the lowest address segment for DMA access by ISA (Industry standard Architecture) device. Under the x86 architecture, the zone size is limited to 16MB.

Zone_dma32

This zone is used for DMA devices that support 32-bits address bus and only works in 64-bits systems.

Zone_normal

The zone's memory is mapped directly to a linear address by the kernel and can be used directly. In the X86-32 architecture, the zone corresponds to an address range of 16MB~896MB. Under the X86-64 architecture, the memory outside of DMA and DMA32 is all managed in the zone of normal.

Zone_highmem

The zone is only on a 32-bit system and maps more than 896MB of memory space by creating a temporary page table. That is, when you need access to establish a mapping relationship between the address space and memory, after the end of the visit to remove the mapping relationship to release the address space, the address space can be used for other highmem memory mapping.

Zone-related information can be viewed through/proc/zoneinfo. As shown below, there is only one normal Zone on the two node,node0 on the x86-64 system with DMA, DMA32, and normal three zone,node1.

[[email protected]|grep"zone| free|managed"Node 0, zone      DMA  pages free     3700        managed  3975Node 0, zone    DMA32  pages free     291250        managed  326897Node 0, zone   Normal  pages free     3232166        managed  3604347Node 1, zone   Normal  pages free     3980110        managed  4128056
Page

Page is the basic unit of Linux underlying memory management with a size of 4KB. A page is mapped to a contiguous amount of physical memory, and the allocation and deallocation of the memory is done in page units. The mapping of the process virtual address to the physical address is also done through the page table, where each item of the page table records the physical address corresponding to the virtual address of a page.

Tlb

Memory access needs to find the corresponding page structure of the address, this data is recorded in the page table. All access to the memory address is queried first in the page table, so the page table has the highest frequency of accesses. To increase the speed of access to the page table, a TLB (translation lookaside buffer) mechanism is introduced, which caches access to more page tables in the CPU cache. So an important part of CPU performance statistics is the L1/L2 cache's TLB Miss statistic. In a large memory system, such as the full amount of 256GB memory page table entries have 256gb/4kb=67108864 bar, each entry occupies 16 bytes, the need for 1GB, apparently the CPU cache is not fully cached. At this time if the access to a wide range of memory is easy to appear TLB miss resulting in increased access latency.

Hugepages

To reduce the probability of TLB Miss, Linux introduces the hugepages mechanism, which can set the page size to 2MB or 1GB. 2MB hugepages mechanism, the same 256GB memory required page table entries are reduced to 256gb/2mb=131072, only need 2MB. Therefore, the Hugepages page table can be cached in the CPU cache in full volume.
By Sysctl-w vm.nr_hugepages=1024 You can set the number of hugepages to 1024, the total size is 4GB. It is important to note that the Setup huagepages will request a contiguous 2MB block of memory from the system and hold it (not for normal memory requests), and if the system is running for a period of time causing more memory fragmentation, then requesting hugepages will fail.
As shown below for the setup and Mount methods of Hugepages, the application needs to use these hugepages under the Mount path through mmap for file mapping.

sysctl -w vm.nr_hugepages=1024mkdir -p /mnt/hugepagesmount -t hugetlbfs hugetlbfs /mnt/hugepages
Buddy System

The Linux Buddy system is designed to address the problem of external memory fragmentation caused by a page-based memory allocation: The system is missing a continuous page page that causes memory requests that require continuous page pages to not be satisfied. The principle is very simple, the different number of consecutive pages of the composition block allocation, block by the power of 2 pages divided into 11 block list, respectively, corresponding to 1,2,4,8,16,32,64,128,256,512 and 1024 consecutive pages. When calling buddy system for memory allocation, find the most appropriate block based on the size of the request.
As shown below is the buddy system basic information on each zone, the following 11 is listed as the number of blocks available in the 11 block list.

[[email protected]Node 0, zone      DMA      0      0      1      0      1      1      1      0      0      1      Node 0, zone    DMA32    102     79    179    229    230    166    251    168    107     78    Node 0, zone   Normal   1328    900   1985   1920   2261   1388    798    972    539    324   Node 1, zone   Normal    466   1476   2133   7715   6026   4737   2883   1532    778    490   2760
Slab

Buddy system memory is a chunk of the application, but most applications require a small amount of memory, such as the common hundreds of bytes data structure, if you also apply for a page, it will be very wasteful. To meet the small and irregular memory allocation requirements, Linux designed the slab allocator. The principle is simply to establish a memcache for a particular data structure, apply pages from the buddy system, divide each page into multiple objects by the size of the data structure, and assign an object when the user requests a data structure from memcache.
Here's how to view slab information for Linux as follows:

[[email protected]~]# Cat/proc/slabinfoSlabinfo-version:2.1# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>: Tunab Les <limit> <batchcount> <sharedfactor>: Slabdata <active_slabs> <num_slabs> < Sharedavail>Fat_inode_cache720 8:tunables 0 0 0:slabdata 2 2 0Fat_cache0 0 102 1:tunables 0 0 0:slabdata 0 0 0kvm_vcpu0 0 16576 1 8:tunables 0 0 0:slabdata 0 0 0Kvm_mmu_page_header0 0 168 2:tunables 0 0 0:slabdata 0 0 0ext4_groupinfo_4k4440 4440 136 1:tunables 0 0 0:slabdata 148 148 0Ext4_inode_cache63816 65100 1032 8:tunables 0 0 0:slabdata 2100 2100 0ext4_xattr1012 1012 1:tunables 0 0 0:slabdata 22 22 0Ext4_free_data16896 17600 1:tunables 0 0 0:slabdata 275 275 0

Usually we look at the sorted slab information through the slabtop command:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   352014 352014 100%    0.10K   9026   39     93492  93435  99%    0.19K   2226   42     65100  63816  98%    1.01K   2100   31     48128  47638  98%    0.06K    752   64  47090  43684  92%    0.05K    554   85  44892  44892 100%    0.11K   1247   36  43624  43177  98%    0.07K    779   56  43146  42842  99%    0.04K    423  102  1692K ext4_extent_status
Kmalloc

As with glibc malloc() , the kernel also provides kmalloc() for allocating memory space of any size. Similarly, allowing an application to arbitrarily request any size of memory from a page can also cause memory fragmentation within the page. To resolve internal fragmentation issues, Linux uses the slab mechanism to implement KMALLOC memory allocations. The principle is similar to the buddy system, where the slab pool that creates a power of 2 is used to kmalloc allocations based on the optimal size of the slab.
The slabs for Kmalloc allocation are as follows:

[[email protected]~]# Cat/proc/slabinfoSlabinfo-version:2.1# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>: Tunab Les <limit> <batchcount> <sharedfactor>: Slabdata <active_slabs> <num_slabs> < Sharedavail>kmalloc-8192196 8192 4 8:tunables 0 0 0:slabdata 50 50 0kmalloc-40961214 1288 4096 8 8:tunables 0 0 0:slabdata 161 161 0kmalloc-20482861 2928 2048 8:tunables 0 0 0:slabdata 183 183 0kmalloc-10247993 8320 8:tunables 0 0 0:slabdata 260 260 0kmalloc-5126030 6144 4:tunables 0 0 0:slabdata 192 192 0kmalloc-2567813 8576 2:tunables 0 0 0:slabdata 268 268 0kmalloc-19215542 15750 192 2:tunables 0 0 0:slabdata 375 375 0kmalloc-12816814 16896 1:tunables 0 0 0:slabdata 528 528 0kmalloc-9617507 17934 1:tunables 0 0 0:slabdata 427 427 0kmalloc-6448590 48704 1:tunables 0 0 0:slabdata 761 761 0kmalloc-327296 7296 1:tunables 0 0 0:slabdata 57 57 0kmalloc-1614336 14336 1:tunables 0 0 0:slabdata 56 56 0kmalloc-821504 21504 8 1:tunables 0 0 0:slabdata 42 42 0
Kernel parameters

Linux provides some memory management-related kernel parameters, which /proc/sys/vm can be viewed in the directory or sysctl -a |grep vm viewed by:

[[email protected]vm]# sysctl-a|grepVmvm.admin_reserve_kbytes= 8192Vm.block_dump= 0vm.dirty_background_bytes= 0Vm.dirty_background_ratio= 10vm.dirty_bytes= 0Vm.dirty_expire_centisecs= 3000Vm.dirty_ratio= 20Vm.dirty_writeback_centisecs= 500vm.drop_caches= 1Vm.extfrag_threshold= 500vm.hugepages_treat_as_movable= 0Vm.hugetlb_shm_group= 0Vm.laptop_mode= 0Vm.legacy_va_layout= 0Vm.lowmem_reserve_ratio= 256 256 32Vm.max_map_count= 65530Vm.memory_failure_early_kill= 0Vm.memory_failure_recovery= 1vm.min_free_kbytes= 1024000Vm.min_slab_ratio= 1Vm.min_unmapped_ratio= 1vm.mmap_min_addr= 4096vm.nr_hugepages= 0Vm.nr_hugepages_mempolicy= 0vm.nr_overcommit_hugepages= 0vm.nr_pdflush_threads= 0Vm.numa_zonelist_order= DefaultVm.oom_dump_tasks= 1Vm.oom_kill_allocating_task= 0vm.overcommit_kbytes= 0vm.overcommit_memory= 0Vm.overcommit_ratio= 50Vm.page-cluster= 3Vm.panic_on_oom= 0vm.percpu_pagelist_fraction= 0Vm.stat_interval= 1vm.swappiness= 60vm.user_reserve_kbytes= 131072vm.vfs_cache_pressure= 100Vm.zone_reclaim_mode= 0
Vm.drop_caches

Vm.drop_caches is the most commonly used parameter because the Linux page cache mechanism causes a large amount of memory to be used in the file system cache, including the data cache and metadata (Dentry, inode) caches. When memory is low, we can quickly release the file system cache with this parameter:

To free pagecache:    echo> /proc/sys/vm/drop_cachesTo free reclaimable slab objects (includes dentries and inodes):    echo> /proc/sys/vm/drop_cachesTo free slab objects and pagecache:    echo> /proc/sys/vm/drop_caches
Vm.min_free_kbytes

Vm.min_free_kbytes is used to determine how much memory is below the start of the memory recycling mechanism (including the file system cache mentioned above and the recyclable slab mentioned below), which has a smaller default value, A higher-memory system set to a larger value (such as 1GB) can automatically trigger a memory recycle if the memory is not too young. But also cannot be set too large, resulting in frequent applications often being oom killed.

sysctl -w vm.min_free_kbytes=1024000
Vm.min_slab_ratio

The vm.min_slab_ratio is used to determine how much of the recoverable slab space in the slab pool is recycled when the percentage of the zone is reached, by default it is 5%. But after the author test, when the memory is sufficient, it will not trigger slab recovery, and only when the memory watermark reaches the above min_free_kbytes will trigger slab recovery. The minimum value can be set to 1%:

sysctl -w vm.min_slab_ratio=1
Summarize

The above briefly describes the Linux memory management mechanism and several common memory management kernel parameters.

Resources

Understanding the Linux Kernel 3rd Edition
[Linux physical Memory Description]] (http://www.ilinuxkernel.com/files/Linux_Physical_Memory_Description.pdf)

A brief analysis of Linux memory management mechanism

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.