Proc/sys/Vm Final Version

Source: Internet
Author: User
I will give you a detailed explanation of the proc directory recently. You are welcome to translate it. If you have any questions, leave a message. Although it is in English, it is easy to understand. If you have any questions, please leave a message and we will work together for the Linux community. Our translation effect is not necessarily good, because it is done by foreigners after all !!! Giggling, translation may cause misunderstanding. Understanding this requires TCP/IP knowledge. Learning proc is very important for Linux performance optimization. This is from my own organization. I hope it will be useful to you. /Proc/sys/Vm is mainly about virtual storage. This directory is as follows: [root @ jiangtao sys] # cd VM
[Root @ jiangtao VM] # ls
Block_dump legacy_va_layout overcommit_memory
Dirty_background_ratio lowmem_reserve_ratio overcommit_ratio
Dirty_expire_centisecs max_map_count page-Cluster
Dirty_ratio min_free_kbytes panic_on_oom
Dirty_writeback_centisecs mmap_min_addr percpu_pagelist_fraction
Drop_caches nr_hugepages stat_interval
Highmem_is_dirtyable nr_overcommit_hugepages swappiness
Hugepages_treat_as_movable nr_pdflush_threads vdso_enabled
Hugetlb_shm_group oom_dump_tasks vfs_cache_pressure
Laptop_mode oom_kill_allocating_task would_have_oomkilled
[Root @ jiangtao VM] # pwd
/Proc/sys/Vm block_dump enables block I/O debugging when set to a nonzero value. if you want to find out which process caused the disk to spin up (see/proc/sys/Vm/laptop_mode), you can gather information by setting the flag. when this flag is set, Linux reports all disk read and write operations that take place, and all block dirtyings done to files. this makes it possible to debug why a disk needs to spin up, and to increase battery life even more. the output of block_dump is written to the kernel output, and it can be retrieved using "dmesg ". when you use block_dump and your kernel logging level also includes kernel debugging messages, you probably want to turn off klogd, otherwise the output of block_dump will be logged, causing disk activity that is not normally there. dirty_background_ratio contains, as a percentage of total system memory, the number of pages at which the pdflush background writeback daemon will start writing out dirty data. three dirty_expire_centisecs this tunable is used to define when dirty data is old enough to be eligible for writeout by the pdflush daemons. it is expressed in 100 'ths of a second. data which has been dirty in memory for longer than this interval will be written out next time a pdflush daemon wakes up. dirty_ratio contains, as a percentage of total system memory, the number of pages at which a process which is generating disk writes will itself start writing out dirty data. five dirty_writeback_centisecs the pdflush writeback daemons will periodically wake up and write "old" data out to disk. this tunable expresses the interval between those wakeups, in 100 'ths of a second. setting this to zero disables periodic writeback altogether. drop_cacheswriting to this will cause the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free. to free pagecache:
  • Echo 1>/proc/sys/Vm/drop_caches
To free dentries and inodes:
  • Echo 2>/proc/sys/Vm/drop_caches
To free pagecache, dentries and inodes:
  • Echo 3>/proc/sys/Vm/drop_caches
As this is a non-destructive operation, and dirty objects are not freeable, the user shoshould run "sync" first in order to make sure all cached objects are freed. this tunable was added in 2.6.16. seven hugepages_treat_as_movablewhen a non-zero value is written to this tunable, future allocations for the huge page pool will use zone_movable. despite huge pages being non-movable, we do not introduce additio NAL external fragmentation of note as huge pages are always the largest contiguous block we care about. huge pages are not movable so are not allocated from zone_movable by default. however, as zone_movable will always have pages that can be migrated or reclaimed, it can be used to satisfy hugepage allocations even when the system has been running a long time. this allows an administrator to resize The hugepage pool at runtime depending on the size of zone_movable. eight hugetlb_shm_group contains group ID that is allowed to create sysv shared memory segment using hugetlb page nine laptop_mode is a knob that controls "laptop mode ". when the knob is set, any physical disk I/O (that might have caused the hard disk to spin up, see. /Proc/sys/Vm/block_dump) causes Linux to flush all dirty blocks. the result of this is that after a disk has spun down, it will not be spun up anymore to write dirty blocks, because those blocks had already been written immediately after the most recent read operation. the value of the laptop_mode knob determines the time between the occurrence of disk I/O and when the flush is triggered. A sensible value for the knob is 5 seconds. setting the knob to 0 disables laptop mode. ten legacy_va_layout if non-zero, this sysctl disables the New 32-bit MMAP Map Layout-the kernel will use the legacy (2.4) layout for all processes eleven lowmem_reserve_ratioratio of total pages to free pages for each memory zone. twelve max_map_count this file contains the maximum number of memory map areas a process may have. memory Map areas are used as a side-effect of calling malloc, directly by MMAP and mprotect, and also when loading shared libraries. while most applications need less than a thousand maps, certain programs, fig, may consume lots of them, e.g ., up to one or two maps per allocation. the default value is 65536. thirteen min_free_kbytes this is used to force the Linux VM to keep a minimum number of kilobytes free. the VM uses this number to compute a pages_min value for each lowmem zone in the system. each lowmem zone gets a number of reserved free pages based proportionally on its size. fourteen mmap_min_addr this file indicates the amount of address space which a user process will be restricted from mmaping. since kernel Null dereference bugs cocould accidentally operate based on the information in the first couple of pages of memory userspace processes shocould not be allowed to write to them. by default this value is set to 0 and no protections will be enforced by the security module. setting this value to something like 64 K will allow the vast majority of applications to work correctly and provide defense in depth against future potential kernel bugs. fifteen nr_hugepagesnr_hugepages configures Number of hugetlb page reserved for the system. sixteen nr_pdflush_threadsthe count of currently-running pdflush threads. this is a read-only value. 17 numa_zonelist_orderthis sysctl is only for NUMA. 'Where the memory is allocated from 'is controlled by zonelists. in non-NUMA case, a zonelist for gfp_kernel is ordered as following: zone_normal-> zone_dma. this means that a memory allocation request for gfp_kernel will get memory from zone_dma only when zone_normal is not available. in NUMA case, you can think of following 2 types of order. assume 2 node NUMA and below is zonelist of node (0)'s gfp_kernel :( A) node (0) zone_normal-> node (0) zone_dma-> node (1) zone_normal
(B) node (0) zone_normal-> node (1) zone_normal-> node (0) zone_dma.type (a) offers the best locality for processes on node (0 ), but zone_dma will be used before zone_normal exhaustion. this increases possibility of out-of-memory (OOM) of zone_dma because zone_dma is tend to be small. type (B) cannot offer the best locality but is more robust against OOM of the DMA zone. type (a) is called as "Node" order. type (B) is "zone" order. "Node order" orders the zonelists by node, then by zone within each node. specify "[NN] ode" for node order. "zone order" orders the zonelists by zone type, then by node within each zone. specify "[zz] One" for zone order. specify "[DD] efault" to request automatic configuration. autoconfiguration will select "Node" order in following case :( 1) if the DMA zone does not exist or
(2) If the DMA zone comprises greater than 50% of the available memory or
(3) If any node's DMA zone comprises greater than 60% of its local memory and the amount of local memory is big enough. otherwise, "zone" Order will be selected. default order is recommended unless this is causing problems for your system/application. 18 overcommit_memorycontrols overcommit of system memory, possibly allowing processes to allocate (but not use) more memory than is actually available.
  • 0-heuristic overcommit handling. obvious overcommits of address space are refused. used for a typical system. it ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. root is allowed to allocate slighly more memory in this mode. this is the default.
  • 1-always overcommit. Appropriate for some scientific applications.
  • 2-Don't overcommit. the total address space commit for the system is not permitted to exceed swap plus a retriable percentage (default is 50) of physical Ram. depending on the percentage you use, in most situations this means a process will not be killed while attempting to use already-allocated memory but will receive errors on memory allocation as appropriate.
19 overcommit_ratiopercentage of physical memory size to include in overcommit calculations. Memory Allocation Limit = swapspace + physmem * (overcommit_ratio/100) swapspace = total size of all swap Areas
Physmem = size of physical memory in system page-clusterpage-cluster controls the number of pages which are written to swap in a single attempt. the swap I/O size. it is a logarithmic value-setting it to zero means "1 page", setting it to 1 means "2 pages", setting it to 2 means "4 pages", etc. the default value is three (eight pages at a time ). there may be some small benefits in tuning this to a different value if your workload is Swap-intensive. 21 panic_on_oomthis enables or disables panic on Out-of-memory feature. if this is set to 1, the kernel panics when out-of-memory happens. if this is set to 0, the kernel will kill some rogue process, by calling oom_kill (). usually, oom_killer can kill rogue processes and system will keep ve. if you want to panic the system rather than killing rogue processes, set this to 1.The default value is 0. 22. percpu_pagelist_fractionthis is the fraction of pages at most (High Mark PCP-> high) in each zone that are allocated for each per CPU page list. the min value for this is 8. it means that we don't allow more than 1/8th of pages in each zone to be allocated in any single per_cpu_pagelist. this entry only changes the value of hot per CPU pagelists. user can specify a number like 100 to allocate 1/100th of each zone to each per CPU page list. the batch value of each per CPU pagelist is also updated as a result. it is set to PCP-> high/4. the upper limit of batch is (page_shift * 8 ). the initial value is zero. kernel does not use this value at boot time to set the high water marks for each per CPU page list. twenty-three stat_intervalwith this tunable you can configure VM Statistics update interval. the default value is 1. this tunable first appeared in 2.6.22 kernel. twenty-four swap_token_timeoutthis file contains valid hold time of swap out protection token. the Linux VM has token based thrashing control mechanisms and uses the token to prevent unnecessary page faults in thrashing situation. the unit of the value is second. the value wocould be useful to tune thrashing behavior. this tunable was removed in 2.6.20 when the algorithm got improved. 25 swappinessswappiness is a parameter which sets the kernel's balance between reclaiming pages from the page cache and swapping process memory. the default value is 60.if you want kernel to swap out more process memory and thus cache more file contents increase the value. otherwise, if you wowould like kernel to swap less decrease it. 26 vdso_enabledwhen this flag is set, the kernel maps a vdso page into newly created processes and passes its address down to glibc upon exec (). this feature is enabled by default. vdso is a virtual DSO (dynamic shared object) exposed by the kernel at some address in every process 'memory. it's purpose is to speed up system CILS. the mapping address used to be fixed (0xffffe000), but starting with 2.6.18 It's randomized (besides the security implications, this also helps debuggers twenty-seven vfs_cache_pressurecontrols the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects. at the default value of vfs_cache_pressure = 100 the kernel will attempt to reclaim dentries and inodes at a "fair" rate with respect to pagecache and swapcache reclaim. decreasing vfs_cache_pressure causes the kernel to prefer to retain dentry and inode caches. increasing vfs_cache_pressure beyond 100 causes the kernel to prefer to reclaim dentries and inodes.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.