Linux memory fundamentals and related tuning scenarios

Source: Internet
Author: User
Tags message queue

Memory is one of the most important parts of a computer, and it is a bridge to communicate with the CPU. All programs in the computer run in memory, so the performance of the memory affects the computer very much. The memory function is used to temporarily store the operational data in the CPU, as well as the data exchanged with external memory such as the hard disk. As long as the computer is running, the CPU will transfer the data needed for operation into memory, and when the operation is completed, the CPU will transmit the result, and the running of the memory determines the stable operation of the computer. For the entire operating system, memory can be the most troublesome device. And its performance directly affects the entire operating system.

We know that the CPU is not able to deal with the hard disk, only the data is loaded into memory to be called by the CPU. When accessing memory, the CPU needs to be like a memory monitor request, a read-write request controlled by monitoring and allocating memory, which is called the MMU (memory management unit). The following is a 32-bit system that describes the memory access process:

Every process on a 32-bit system accesses memory, each process as if its own 4 g of memory space is available, called Virtual memory (address), virtual memory into physical memory is done through the MMU. In order to be able to convert from a linear address to a physical address, the memory space of the page table is required, and the page table is loaded into the MMU. In order to complete the mapping of the linear address to the physical address, a very large table is required if you follow the 1-byte 1-byte mapping, and this transformation relationship can be very complex. Therefore, the memory space is divided into another storage unit format, usually 4K. On different hardware platforms, they are generally not the same size, like x86 32-bit 4k pages, while 64 bits of 4k pages, 2M pages, 4M pages, 8M pages and so on, the default is 4k. Each process typically has its own page path and page table mapping mechanism, regardless of which page table is loaded by the kernel. Each process can only see its own linear address space, want to add new memory, only in its own linear address space to apply, and after the application must be mapped through the operating system kernel to the physical address space to find a space, and tell the linear address space is ready, can be accessed, and in the page A mapping is added to the table so that physical memory can be accessed, which is called memory allocation. But the new application must be through the operation of the kernel to the physical memory to find a space, and tell the linear address space is good, you can build a mapping relationship, the final page table to establish a mapping relationship.

Reflects the general situation of the above described process. You can see that each user program has its own page table and maps to the corresponding main storage.


2 questions can be found based on the above text and chart description:
1. Every process needs to find the page table if it needs to access the memory, it will cause the server performance under
2. What if the memory of the main memory is full and the application needs to call memory?

For the first question, we need to translate the backup buffer with the TLB (translation lookaside buffer). The TLB is a memory management unit that can be used to improve the caching of virtual address-to-physical address translation speeds. In this way, each time you look up a page table, you can go to the TLB to find the corresponding page data, if there is a direct return, no more to find the page table, and the results of the found cache in the TLB. While the TLB resolves the functionality of the cache, the Lookup Mappings in page table are still slow, so there is a hierarchical catalog of page table. Page table can be divided into 1 levels of directories, 2 levels of directories and offsets

However, when a process is running, the file is opened frequently and files are closed. This means that memory is frequently applied and memory is freed. There are those processes that can cache data in memory, and they allocate and recycle memory more, so each allocation creates a corresponding item in the page table. So, even if the memory is very fast, a large number of frequent allocation and release of memory at the same time, will still reduce the overall performance of the server. Of course, when memory space is not enough, we are called oom (out of memory, exhausted). When the memory runs out, the entire operating system hangs. In this case we can consider the swap partition, after all, the swap partition is the virtual memory of the hard disk, so its performance is much worse than the real memory, so try to avoid using swap partition. When you have physical memory space, try to ensure that all physical memory is used. The CPU does not have to deal with swapping memory anyway, it can only deal with physical memory, the addressable space can only be physical memory. So when the actual physical memory space is not enough, the most recently used memory is put into swap memory by the LRU algorithm, so that the space in the physical memory can be used by the new program. But this raises another question: When the original process is looking through page table, the data for that piece of space does not belong to it. So at the moment the CPU sends a notification or an exception tells the program that this address space does not belong to it, this time there may be 2 situations:

1. The physical existence of available space available: This time the CPU will be based on the previous conversion strategy will be in the swap partition of the memory back to the physical memory, but the converted space address is not necessarily the previous part of the space address, because the previous space address may have been used by others.

2. There is no usable space available for physical memory: At this point, the LRU calculation is still used to convert the least recently used space address in the current physical address space to the swap memory, and the memory that is required by the current process to be in the swap space is sent to the physical memory space, and the mapping relationship is re-established.

The above notification or exception occurs, often called a missing pages exception. There are two kinds of fault pages: Big anomaly and small anomaly. The big exception is the access to the data in memory, not to go to the hard drive to load, whether from the swap memory or directly from the disk on a file system, anyway need to load from the hard disk, this exception loading takes a long time. The small exception is the process through shared memory, the second process access, the view of the local memory mapping table is not, but other processes already have this memory page, so you can directly map, this exception loading time is generally very short.

When the operating system is powered on, each IO device is like a random port that the CPU requests for some columns, which are called IO ports. In the IBM PC architecture, the I/O address space provides a total of 65,536 8-bit I/O ports. It is the presence of these IO ports that the CPU can interact with the IO device in a read-write process. When performing a read-write operation, the CPU uses address bus to select the requested I/O port and transmits data between the CPU register and the port using the bus. The I/O port can also be mapped to the physical address space: Therefore, communication between the processor and the I/O device can directly use the assembly language instructions that operate on the memory (for example, MOV, and, or, and so on). Modern hardware devices are more likely to map I/O because it is faster to process and can be used in conjunction with DMA. In this way, Io in and memory to pass the data when it does not need to pass the control of the bus to the DMA, each time the IO data cpu,cpu call DMA once, the CPU to liberate out. When the data is finished, the DMA notification interrupts the CPU once. DMA has control over the entire bus at run time, and when the CPU discovers that other processes need to use the bus, the two will generate contention. At this time, the CPU and DMA have equal permissions on the use of the bus control. As long as the CPU entrusted to the DMA, you can not arbitrarily retract this delegate, it is necessary to wait for the DMA to run out.

If no other process can run, or if other processes are running very short, and this time the CPU finds that our IO is still not complete, it means that the CPU can only wait for IO. The CPU has a iowait value in the time allocation, which is the time the CPU spends waiting for IO. Some are in the synchronous call process, the CPU must wait for the completion of IO, the CPU can release IO transmission in the back of the automatic completion, the CPU itself to deal with other things. After the hard drive data transfer is complete, the hard disk only needs to be like CPU to initiate a notification. There is a device on the periphery of the CPU called a programmable Interrupt controller. Every hardware device in order to communicate to the CPU, at the time of the first boot, when the BIOS implementation detection, the device will go to the programmable Interrupt controller to register a so-called interrupt number. Then this number is used for this hardware. There may be more than one hardware on the current host, each with its own number, and the CPU will be able to break through the interrupt Phasor table to find that hardware device after receiving the interrupt number. and is processed by the corresponding IO port.

The CPU is running other processes, and when an interrupt request is sent, the CPU immediately terminates the process that is currently being processed and processes the interrupt. The current CPU suspends the process that is currently being processed and goes to the execution of the interrupt, also known as interrupt switching. However, this switch is less than the process switch at the volume level, and any interrupt priority is usually higher than any process because we are referring to a hardware outage. Interrupts are also divided into the upper and lower halves, in general, the upper half is the CPU at the time of processing, put it in, put it into memory, if this thing is not particularly urgent (CPU or the kernel will judge itself), so in this case, the CPU back to the scene to continue to execute the process just suspended, when the process is finished, Then go back to the bottom half of the break.

In a 32-bit system, our memory (linear address) address space, in general, a low address space has a G is used for the kernel, the above 3 G is used for the process. But it should be understood that in the kernel memory, and then down, not directly divided. 32-bit systems and 64-bit systems may be different (physical addresses), and in 32-bit systems, there is more than 10 m of space at the lowest end for DMA use. The bus width of the DNA is very small and may be only a few, so the addressable capacity is limited and the memory space for access is limited. If the DMA needs to replicate the data, and it can address the physical memory, but also the data directly into the memory, then you must ensure that the DMA can address the memory line. The precondition of addressing is to give DMA the segment within the address range of the M,da. So in this perspective, our memory management is sub-regional.

On a 32-bit system, 16M of memory gives ZONE_DMA (the physical address space used by DMA), from 16M to 896M to Zone_normal (normal physical address space), which is the address space that the kernel can access directly to the Linux operating system. From 896M to 1G this space is called "Reserved" (Reserved Physical address space), from 1G to 4G of this physical address space, our kernel is not directly accessible, to access must be a piece of content mapped to Reserved, In the reserved to preserve the memory of the address code, we can access the kernel, so the kernel does not directly access the physical address space greater than 1G. So on a 32-bit system, it accesses the data in memory, in the middle it takes an extra step.

On a 64-bit system, the Zone_dam gives the low-end 1G address space, this time the DMA's addressing ability is greatly enhanced, zone_dam32 can use 4G of space, and more than 1G to divide the zone_normal, this space can be directly accessed by the kernel. So on 64-bit, the kernel accesses more than 1G of memory address, there is no need for additional steps, the efficiency and performance is also greatly increased, which is why the use of 64-bit system reasons. A description of the above process is shown below:


In today's PC architecture, Amd,inter supports a mechanism called PEA (Physical Address extension). So-called PAE. Refers to the 32-bit system on the address bus, and expand the 4-bit, so that the address space on the 32-bit system can reach 64G. Of course, on a 32 system, no matter how large your physical memory is, the space used by a single process cannot be extended. Because the linear address space is only 4 G on a 32-bit system, only 3 g of access can be recognized by a single process.

The virtual memory subsystem for Linux includes the following functional modules:

Slab allocator,zoned Buddy Allocator,mmu,kswapd,bdflush
Slab allocator is called the slab dispenser.
Buddy Allocator, also known as the Buddy system, is called a partner and a memory allocator.
Buddy system is working on the MMU, and slab allocator is working on the buddy system.

The frequent allocation and recycling of memory in the system is bound to cause memory fragmentation, in order to avoid memory fragmentation, buddy system in the implementation of memory allocation, it can first partition memory into a number of different units. When assigned, try to find the most appropriate memory space to allocate, and if the memory space of a process is released, it can merge multiple discrete, contiguous small memory spaces into a larger contiguous area. So when allocating, try to find the most appropriate gap to allocate, and eventually can be released by multiple processes of the address space to merge into a continuous space, and did not divide the page. Some of the process's required memory space is not allowed to be paged, so if there is not a large segment of contiguous memory space, this piece of information will not be stored. So the memory is allocated as far as possible to find the best space-like allocation, and can be recycled to merge memory into a contiguous large memory space. This is what buddy system means to avoid out-of-memory fragmentation. The so-called outer fragment means that there are many pages in memory that are not contiguous. Buddy allocator is only used to allocate pages or non-page of those continuous space applications, generally this level of space is relatively large, but sometimes we need to be relatively small space, such as open a file, access to the inode. This is a time when it is possible to put it on a page, but this is the only inode in the page that will create a huge waste of space. So for the storage of this information, although it is possible to put the data in the page, but the page is certainly not just to save this data, it is possible to save more than one data. So how to store it quickly? Each inode is a special data structure that has a variety of information, so we call it a special data structure. When we store it, we not only store the data, but also store its structure, which is less than the page. To achieve fast storage, this is the meaning of slab allocator existence. It can implement itself to apply for several pages, dividing these pages into unique internal data structures suitable for storing some kind of object. Directly to the partition, the structure is also stored well, when the need to store the inode, the inode information can be filled in, the structure has been allocated in advance. When a file is closed, its inode is cleared out. Slab allocator can also take it out of the inode to get back to use when other processes open the file. This is to avoid in-memory fragmentation and complete the distribution of small slices of memory.

When we implement the buddy system in memory allocation space, if the physical memory space is not enough, it is possible to swap memory. KSWAPD is the realization of swap out and swap in. Put data on swap and load data back into memory from swap. Of course, it should be understood that if a process modifies the data, the data will eventually be populated from memory into the disk. Because the data that the process accesses is done in memory, the data is eventually written to the attachment to complete the permanent storage of the data. How to write this operation is done by Pdflush. As soon as we write the data into memory, it is not written to disk immediately, because the performance is so poor that the process is asynchronous rather than synchronous. Our kernel periodically synchronizes the saved data to the disk. Pdflush is a kernel thread, usually a hard disk with one, monitoring the current memory space in which the data (often called dirty pages) has been modified, but not synchronized to the disk of the data to be stored to disk. Of course, it is not necessarily to actively monitor, if the physical memory of the dirty page has reached a percentage, it will also go to actively synchronize data.

The modified data in physical memory is not exchangeable and must be written to disk because the exchange may cause some failures. Therefore, we can exchange the data in the Exchange memory, must be no modified data, modified, to free up, you can only write to the hard disk.


With the basic knowledge of memory concepts, here are some common tuning options for memory:
I. Tuning parameters related to Hugepage
Cat/proc/zoneinfo can view the segmentation of memory segments on the current operating system

Hugepage: Big page
Not only large pages are supported on CENTOS64-bit systems, but also transparent large pages (thp,transparent huge page)
Transparent large pages simply means the use of anonymous memory segments. It is able to manage anonymous memory segments automatically by using large pages behind the operating system, without any user involvement. Which memory is an anonymous memory segment? RSS minus shared memory is an anonymous memory segment. Transparent large page for CENTOS64-bit system support two sizes, one is 2M, one is 1G. 1G is often useful in terabytes of memory-level usage. In dozens of G, hundreds of g of memory, 2M is usually a good choice. Usually only when the physical memory is greater than 4G, the mechanism of transparent large pages will start up. Transparent large pages Usually, the system is quietly used behind the back, why is called transparency, because users do not need to participate.

In the/proc/zoneinfo under the nr_anon_transparent_hugepages this parameter can see the use of transparent large pages, usually a page is 2M.
Under the/proc/meminfo:
Anonhugepages: You can see the total size of transparent pages
Hugepagesize: The size of the large page.
Hugepages_total: Total number of large pages
Hugepages_free: Total number of remaining large pages
HUGEPAGES_RSVD: Total number of large pages reserved
Hugepages_total is the user's own designation, not a transparent large page.

Use Vm.nr_hugepages = N to manually specify the number of large pages. In general, we can manually define them as shared memory space to use to mount, directly to him as an in-memory partition mounted to a file system directory. Here's how to use it:
Mkdir/hugepages
Mount-t Hugetlbfs None/hugepages
You can then use the hugepages as a memory disk. Of course, usually do not need our own designated, such as MySQL server, on the MySQL server has a variable, the definition of the variable to use transparent large pages, it will automatically use.

two. Tuning parameters related to buffer and cache
/proc/sys/vm/drop_caches can manually force release buffer and cache, which accepts 3 parameters
If 1: Releases all Pagecache page caches
If 2: the cache for Dentries and Inode is released
If 3: The cache for Pagecache,dentries and Inode is released

The caches in buffer and cache fall into two main categories:
1.pagecache: Used to cache the page data, usually cached is the file data, open file content
2.buffers: Cached metadata for files (inode and dentries), sometimes used to cache write requests

Echo 1 >/proc/sys/vm/drop_caches release the first one above
Echo 2 >/proc/sys/vm/drop_caches Release the second one above.
Echo 3 >/proc/sys/vm/drop_caches Release the sum of the two above

three. Tuning parameters related to swap memory
/proc/sys/vm/swappiness indicates how much the kernel tends to use swap memory
The larger the value, the more inclined to use swap memory, and the smaller the value, the less inclined to use swap memory (but this does not mean that swap memory cannot be used), the default value is 60, and the value range is 0-100. It is recommended to make this value smaller on the server, even 0. In general, when we are now mapped to the percentage of memory in the page table (and how much of our physical memory is used by the page table), +vm.swappiness will start to enable swap memory when the value is greater than or equal to 100.

Recommended use for Swap memory:
1. On a server that performs a batch calculation (scientific calculation), it can be set to a relatively large
2. On the database server, can be set to less than or equal to 1G, the database server should be hard to avoid the use of swap memory
3. On the application server, can be set to ram*0.5, of course this is the theoretical value

If you do not use swap memory, swap memory should be placed on the most out-of-the-wall track partition, because the outermost disk has the fastest access. So if you have more than one hard drive, you can take a small portion of the outermost track of each hard drive as a swap partition. Swap partitions can be prioritized, so the swap memory priority of these drives is set to the same level, which results in a load balancing effect. The method for defining the interchange partition priority is to edit the/etc/fstab:
/dev/sda1 SwapSwap pri=5 0 0
/dev/sdb1 SwapSwap pri=5 0 0
/dev/sdc1 SwapSwap pri=5 0 0
/dev/sdd1 SwapSwap pri=5 0 0

Four. Relevant tuning parameters when memory is exhausted
When Linux memory runs out, it kills the most memory-intensive processes, and the following three scenarios kill the process:
1. All processes are active and there is no idle process at this time to swap out
2. No page is available in Zone_normal
3. There are other new processes to start, when requesting memory space, to find a free memory to do mapping, but this time can not find
Once the memory is exhausted, the operating system will enable the Oom-kill mechanism.
In the/proc/pid/directory there is a file called Oom_score, which is used to specify the rating of Oom, is the villain index.

If you want to manually enable the Oom-kill mechanism, just perform the echo F>/proc/sysrq-trigger, and it will automatically kill the process that we have assigned the highest score of the villain Index.
You can adjust the villain scoring index of a process by using echo n >/proc/pid/oom_adj. The final score index is 2 of the Oom_adj value of the n-th square. If one of our processes has a value of Oom_adj of 5, then its villain scoring index is 2 of the 5-time side.

You can use Vm.panic_on_oom=1 if you want to disable the use of the Oom-kill feature.

Five. Memory tuning parameters related to capacity:
Overcommit_memory, the available parameters are 3, which specifies whether the memory can be used excessively:
0: Default setting, Kernel performs heuristic overuse processing
1: The kernel performs no memory overdose processing. Using this value can increase the likelihood of memory overloading
2: The amount of memory used equals the size of the swap +ram*overcommit_ratio value. If you want to reduce memory overuse, this value is the safest
Overcommit_ratio:
Specifies the physical RAM ratio, which defaults to 50, when Overcommit_memory is specified as 2

Six. Tuning parameters related to communication
Common ways to communicate between processes in the same host:
1. Through the message message;2. Communication via signal Semaphore; 3. Communication through shared memory, common communication across hosts is RPC

Tuning scenarios for process communication in a messaging way:
Msgmax: Specifies the maximum allowable size of any message in a message queue, in bytes. This value must not exceed the size of the queue (MSGMNB), the default value is 65536
MSGMNB: Specifies the maximum value (maximum length) of a single message queue in bytes. Default is 65536 bytes
Msgmni: Specifies the maximum number of message queue identifiers (and the maximum number of queues). The default value for a 64-bit architecture machine is 1985, and the default value for a 32-bit architecture machine is 1736

Tuning scheme for process communication in shared memory mode:
Shmall: Specifies, in bytes, the total amount of shared memory that can be used in the system at one time (the upper limit of a single request)
Shmmax: Specifies the maximum size of each shared memory fragment in bytes
Shmmni: Specifies the maximum number of shared memory fragments within a system range. On 64 and 32-bit systems, the default value is 4096

Seven. Capacity-related file system tunable parameters:
File-max: Lists the maximum number of file handles allocated by the kernel
Dirty_ratio: Specify a percentage value that starts executing pdflush after the dirty data reaches this percentage value of the total system memory, which defaults to 20
Dirty_background_ratio: Specifies a percentage value that starts executing Pdflush in the background after the percentage of dirty pages that a process consumes is up to the total system memory, which defaults to 10
Dirty_expire_centisecs:pdlush every 1% seconds to refresh the dirty page, the default value is 3000, so every 30 seconds to start refreshing dirty pages
Dirty_writeback_centisecs: Refreshes a single dirty page every 1% seconds. The default value is 500, so a dirty page has a time of 5 seconds to start flushing dirty

eight. Common indicator commands for Linux memory:
Memory activity
Vmstat [Interval] [count]
sar-r [Interval] [count]
Rate of change in memory
sar-r [Interval] [count]
FRMPG/S: A memory page that is freed or allocated per second, or a memory page that is freed if it is a positive number, or a memory page that is allocated if it is a negative number
BUFPG/S: The memory pages obtained or freed in buffer per second. If a positive number is the obtained memory page, it is a negative number. The memory page that is freed
CAMPG/S: The memory pages obtained or freed from the cache per second. If a positive number is the obtained memory page, it is a negative number. The memory page that is freed
Swap activity
sar-w [Interval] [count]
All IO
Sar-b [Interval] [count]
PGPGIN/S: Number of blocks written to the kernel from disk per second
PGPGOUT/S: Number of blocks written to disk per second from the kernel
FAULT/S: Number of fault pages per second
MAJFLT/S: The number of large page exceptions that occur per second
PGFREE/S: Number of pages recycled per second


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.