When the CentOS server runs out of memory

Source: Internet
Author: User
Tags centos server dmesg

You may rarely face this situation, but once so, you must know what is wrong: insufficient memory or out-of-memory (OOM ). The result is very typical: you cannot reallocate the memory, and the kernel will kill a task (usually the one running ). Generally, with a large amount of read/write switching, you can see the screen and disk movements.

The following is an implicit question: How much memory do you need to allocate? How much does the operating system allocate to you? The basic reason for OOM is that you apply for more memory than the system can use. I have to say virtual memory, because swap partitions are also included.

Understanding OOM

To get started with OOM, first try the code that will allocate a large amount of memory:

# Include <stdio. h>
# Include <stdlib. h>

# Define MEGABYTE 1024*1024

Int main (int argc, char * argv [])
{
Void * myblock = NULL;
Int count = 0;

While (1)
{
Myblock = (void *) malloc (MEGABYTE );
If (! Myblock) break;
Printf ("Currently allocating % d MBn", ++ count );
}

Exit (0 );
}

Compile and run it for a while. The system will have an early or late OOM. Then try the following section to allocate a large amount of memory and write with 1:

# Include <stdio. h>
# Include <stdlib. h>

# Define MEGABYTE 1024*1024

Int main (int argc, char * argv [])
{
Void * myblock = NULL;
Int count = 0;

While (1)
{
Myblock = (void *) malloc (MEGABYTE );
If (! Myblock) break;
Memset (myblock, 1, MEGABYTE );
Printf ("Currently allocating % d MBn", ++ count );
}
Exit (0 );

}

Are there any differences? A allocates more memory than B. And B was killed earlier. Both programs exit because there is no available memory. More accurately, A exited elegantly because of the failed malloc (), and B was killed by the OOM killer.

First, observe the number of allocated memory blocks. Suppose you use M memory, 888M swap partition (in my case), B ends:

Currently allocating 1081 MB

At the end of:

Currently allocating 3056 MB

How does A get another 1975 M? I lie? No! If you look at it carefully, you will find that B fills up the memory with 1, and A hardly takes what they do. In Linux, deferred page allocation is allowed. In other words, the allocation action is started only when you really want to use it, such as when writing data. Therefore, unless data is written, you can always require more memory. The term is optimistic memory allocation.

View/proc/<pid>/status to confirm the information.

$ Cat/proc/<pid of program A>/status
VmPeak: 3141876 kB
VmSize: 3141876 kB
VmLck: 0 kB
VmHWM: 12556 kB
VmRSS: 12556 kB
VmData: 3140564 kB
VmStk: 88 kB
VmExe: 4 kB
VmLib: 1204 kB
VmPTE: 3072 kB

This is the record before B is killed:

$ Cat/proc/<pid of program B>/status
VmPeak: 1072512 kB
VmSize: 1072512 kB
VmLck: 0 kB
VmHWM: 234636 kB
VmRSS: 204692 kB
VmData: 1071200 kB
VmStk: 88 kB
VmExe: 4 kB
VmLib: 1204 kB
VmPTE: 1064 kB

VmRSS needs to be explained in detail. RSS is the Resident Set Size, which is the block allocated by the current process in the memory. Note that almost all swap partitions have been used before B to OOM, and A is useless. Obviously, malloc () does nothing except keep the memory.

Another question is: Since no page is written, why is there a 3056M ceiling? This exposes another restriction. On a 32-bit system, the memory address is 4 GB. 0-3 GB is used by users, and 3-4 GB is the kernel space.

NOTE: With Kernel patches, you can allocate 4 GB of space to the user, which requires context switching overhead.

OOM conclusion:

  1. No Available page in VM.

  2. There is not enough user address space.

  3. Both of the above.

So the strategy to avoid these situations is:

  1. Know the number of user spaces.

  2. Know the number of available pages.

When you use malloc () to apply for memory blocks, you actually need to check whether pre-allocated blocks are available in the runtime C library. The block size should be at least the same size as the user request. If yes, malloc () assigns this block to the user and marks it as used. Otherwise, malloc () must extend the stack heap to get more memory. All applied blocks are placed in the stack. Do not confuse with stack. stack is used to store local variables and function return addresses.

Where is Heap? You can see the process address ing:

$Cat/proc/self/maps
0039d000-003b2000 r-xp 00000000 1080084/lib/ld-2.3.3.so
003b2000-003b3000 r-xp 00014000 1080084/lib/ld-2.3.3.so
003b3000-003b4000 rwxp 00015000 1080084/lib/ld-2.3.3.so
003b6000-004cb000 r-xp 00000000 1080085/lib/tls/libc-2.3.3.so
004cb000-004cd000 r-xp 00115000 1080085/lib/tls/libc-2.3.3.so
004cd000-004cf000 rwxp 00117000 1080085/lib/tls/libc-2.3.3.so
004cf000-004d1000 rwxp 004cf000 00:00 0
08048000-0804c000 r-xp 00000000 130592/bin/cat
0804c000-0804d000 rwxp 00003000 130592/bin/cat
0804d000-0806e000 rwxp 0804d000 00:00 0 [heap]
B7d95000-b7f95000 r-xp 00000000 2239455/usr/lib/locale-archive
B7f95000-b7f96000 rwxp b7f95000 00:00 0
B7fa9000-b7faa000 r-xp b7fa9000 0 [vdso]
Bfe96000-bfeab000 rw-p bfe96000 0 [stack]

This is the actual cat ing distribution. Your results may be different depending on the kernel and the scheduled C library. The latest kernel (2.6.x) has tags, but they cannot be fully dependent.

Heap is basically not allocated to free space for program ing and stack, so it will reduce the available address space, that is, 3 GB minus all ing parts.

How does the map for program A look when it can't allocate more memory blocks? With a trivial change to pause the program (seeLoop. cAndLoop-calloc.c) Just before it exits, the final map is:

What does A look like when A cannot allocate memory blocks? Make a small adjustment to the program and pause it:

0009a000-0039d000 rwxp 0009a000 00:00 0 ---------> (allocated block)
0039d000-003b2000 r-xp 00000000 1080084/lib/ld-2.3.3.so
003b2000-003b3000 r-xp 00014000 1080084/lib/ld-2.3.3.so
003b3000-003b4000 rwxp 00015000 1080084/lib/ld-2.3.3.so
003b6000-004cb000 r-xp 00000000 1080085/lib/tls/libc-2.3.3.so
004cb000-004cd000 r-xp 00115000 1080085/lib/tls/libc-2.3.3.so
004cd000-004cf000 rwxp 00117000 1080085/lib/tls/libc-2.3.3.so
004cf000-004d1000 rwxp 004cf000 00:00 0
005ce000-08048000 rwxp 005ce000 00:00 0 ---> (allocated block)
08048000-08049000 r-xp 00000000 1267/test-program/loop
08049000-0804a000 rwxp 00000000 1267/test-program/loop
0806d000-b7f62000 rwxp 0806d000 00:00 0 ---> (allocated block)
B7f73000-b7f75000 rwxp b7f73000 00:00 0 ---> (allocated block)
B7f75000-b7f76000 r-xp b7f75000 0 [vdso]
B7f76000-bf7ee000 rwxp b7f76000 00:00 0 ---> (allocated block)
Bf80d000-bf822000 rw-p bf80d000 0 [stack]
Bf822000-bff29000 rwxp bf822000 00:00 0 ---> (allocated block)

The six virtual memory regions VMA reflect memory requests. VMA is a group of memory pages with the same access permissions and can exist anywhere in the user space.

Now you may wonder, why are there six instead of a large area? There are two reasons. First, it is generally difficult to find such a large "hole" in the memory ". Second, the program will not apply for all memory at a time. Therefore, glibc splitters can be freely planned on available pages as needed.

Why am I saying it is on an available page? Memory allocation is based on the page size. This is not an OS limitation, but a feature of the memory management unit MMU. The page size is not certain. Generally, the x86 platform is 4 K. You can get it through the getpagesize () or sysconf () (_ SC _PAGESIZE parameter. Libc distributor manages all pages: it is divided into smaller blocks, assigned to processes, released, and so on. For example, if the program uses 4097 bytes, you need two pages, even though the distribution actually gives you the limit between 5-4 109 bytes.

With 65536 MB of memory and no swap partition, you have available pages. Right? Not all. You need to know that some memory areas are occupied by kernel code and data, and some need to be reserved for emergencies or high priority. Dmesg can display the following information:

$Dmesg | grep-n kernel
36: Memory: 255716 k/262080 k available (2083 k kernel code, 5772 k reserved,
637 k data, 172 k init, 0 k highmem)
171: Freeing unused kernel memory: 172 k freed

The init part of the kernel code and data used during initialization is 172 KB, which will be released by the kernel. This actually occupies 2083 + 5772 + 637 = 8492 bytes .. Actually, 2123 pages are missing. If more kernel features and modules are used, more will be consumed.

The data structure of another kernel is page buffering. Page buffering stores the content of the read block device. The more buffers, the less memory available. However, if the system memory is not enough, the kernel will reclaim the memory occupied by the buffer.

From the perspective of kernel and hardware, the following are very important:

  1. Physical continuity of allocated memory cannot be guaranteed; they are only virtual continuity.
    This illusion comes from the address conversion method. In a protected environment, users use virtual addresses, while hardware uses physical addresses. Page Directory and page table are converted. For example, two blocks starting from 0 and 4096 may actually map to 1024 and 8192 addresses.

  2. This makes allocation easier. It is difficult to find continuous blocks. The kernel will find the desired block instead of the contiguous block, and will also adjust the page table to make it look virtual consecutive.
    This also has a price. Because the memory block is not continuous, sometimes the buffer of CPU L1 and L2 is insufficient, and the virtual continuous memory is dispersed in different physical buffer lines, which slows down the continuous memory access.
    Memory Allocation consists of two steps: Step 1: Expand the length of the memory area, and then allocate pages as needed. This is the on-demand paging. During the VMA extension process, the kernel only checks whether the request overlaps with the existing VMA and whether the range is within the user space. By default, the system will ignore the check for actual allocation.
    Therefore, if your application can request and obtain 1 GB of memory, it is no surprise that you only have 16 m and 64Mswap. Everyone is satisfied with this optimistic approach. The kernel has corresponding parameters to adjust the overcommitment.

  3. There are two types of pages: anonymous pages and file pages. When you create an mmap () file on the disk, the file page is generated. The anonymous page is from malloc (). They are not related to files. When the memory is insufficient, the inner core switches anonymous pages out and clears the file pages. In other words, anonymous pages consume swap partitions. The exception is that the mmap () file has the MAP_PRIVATE tag. At this time, file repair only occurs in the memory.

    These help you understand how to extend swap as memory. Of course, accessing a page requires it to return to the memory.

Distributor insider

The actual work is completed by the glibc memory distributor. The distributor delivers the block to the program and removes it from the heap of the kernel.

The distributor is the manager, and the kernel is the worker. In this way, we can understand that the greatest efficiency comes from a good distributor rather than the kernel.

Glibc uses an allocator named ptmalloc. wolfram Gloger created it as a modified version of the original malloc library created by Doug Lea. the allocator manages the allocated blocks in terms of "chunks. "Chunks represent the memory block you actually requested, but not its size. there is an extra header added inside this chunk besides the user data.
Glibc uses ptmalloc as the distributor. Wolfram Gloger created this modified version to replace Doug Lea's malloc. The distributor uses chunk to manage all allocated blocks. The chunk represents the actually applied memory block, but it is not that size. There is an additional header information in the block.

The allocator uses two functions to get a chunk of memory from the kernel:
The Allocator uses two functions to obtain the corresponding memory chunk:

  • Brk () sets the end of the process data segment.

  • Mmap () creates a VMA and passes it to the distributor.

Of course, malloc () uses these functions only when there is no chunk in the current pool.

The demo-on whether to use brk () or mmap () requires one simple check. if the request is equal or larger than M_MMAP_THRESHOLD, the allocator uses mmap (). if it is smaller, the allocator callbrk (). by default, M_MMAP_THRESHOLD is 128KB, but you may freely change it by using mallopt ().
Using brk () or mmap () requires a simple check. If the request is greater than or equal to M_MMAP_THRESHOLD, the distributor uses mmap (). If it is smaller than, brk () is used (). By default, M_MMAP_THRESHOLD is 128 K and can be adjusted using mallopt.

In OOM, It is interesting how to release the memory of ptmalloc. Blocks allocated by mmap () are completely released after unmap () is released. blocks allocated by brk () are used as release tags, but they are still under the control of the distributor. If the size of another malloc () request is smaller than or equal to the free chunk. The distributor can merge multiple consecutive free chunks or split them to meet the requirements.

This means that a free chunk may be discarded because it cannot meet the request. The failure of the Free chunk merge will also accelerate the generation of OOM. This is also a sign of bad memory fragmentation.

Restore

What should I do if OOM occurs? The kernel terminates a process. Why? This is the only way to terminate further request memory. The kernel does not assume that the process can be terminated automatically. The only option is to kill the process.

How does the kernel know who to kill? The answer isMm/oom_kill.cSource code. The so-called OOM killer uses the badness function () to measure the score of an existing process. The highest score is the victim. The following are the scoring criteria:

  1. VM Size. This is not the size of all allocation pages, but the total number of VMA owned by the process. The larger the size, the higher the score.

  2. The VM Size of sub-processes is also important. This count is cumulative.

  3. If the process priority is less than 0 (nice), the score is high.

  4. Superuser processes are assumed to be more important and therefore have a low score.

  5. When the process is running. The longer the time, the lower the score.

  6. The process can be immune from direct hardware access.

  7. Swapper, init, and other kernel threads are immune.

The process wins the election with the highest score and is then killed.

This mechanism is not perfect, but it is basically effective. Standard 1 and 2 clearly indicate the importance of the VMA size, rather than the actual number of pages. You may think that the VMA size may cause false alarms, but it does not. Badness () calls occur in the page assignment function. when only a few free pages fail to be recycled, this value is basically close to the number of pages owned by the process.

Why not count the actual number of pages? Because this requires more time and more locks, it also increases the overhead of quick judgment. Therefore, OOM is not perfect, and it may also cause an error.

The kernel uses the SIGTERM signal to notify the target process to shut down.

How to Reduce OOM risks

Simple rule: do not allocate memory that exceeds the actual idle size. However, there are many factors that affect the results, so the strategy should be more refined:

Reduce fragments through ordered allocation

An advanced distributor is not required. You can reduce fragments by orderly allocation and release. Use the LIFO policy: the first release of the last allocation.

For example, the following code:

Void *;
Void * B;
Void * c;
............
A = malloc (1024 );
B = malloc (5678 );
C = malloc (4096 );


......................



Free (B );
B = malloc (12345 );

Can be changed:

A = malloc (1024 );
C = malloc (4096 );
B = malloc (5678 );
......................


Free (B );
B = malloc (12345 );

In this way, no vulnerability exists between chunks a and c. You can also consider using realloc () to adjust the size of the generated malloc () block.

The two examples demonstrate this effect. At the end of the program, the number of memory bytes allocated by the system (kernel and glibc distributor) and the actual amount of memory used will be reported. For example, on Kernel 2.6.11.1 and glibc2.3.3.27, 319858832 bytes (about 305 MB) are wasted without the fragmented1 parameter, and fragmented2 wastes 2089200 bytes (2 MB). 152 times!

You can further experiment and pass the results of various parameters. The parameter is the request size of malloc.

Adjust overcommit behavior of the kernel

You can change the behavior of the Linux kernel through/ProcFilesystem, as allowed ented inDocumentation/vm/overcommit-accountingIn the Linux kernel's source code. You have three choices when tuning kernel overcommit, expressed as numbers in/Proc/sys/vm/overcommit_memory:
You canDocumentation/vm/overcommit-accountingChange the Linux kernel behavior through the/proc directory configuration.

There are three options:

  • 0 indicates that the default mode is used to determine whether to overcommit.

  • 1 means always overcommit. Now you should know how dangerous it is.

  • 2. Avoid overcommit. Adjustable/Proc/sys/vm/overcommit_ratio. The maximum commitment value is swap + overcommit_ratio * MEM.

Generally, it is enough by default, but Mode 2 provides better protection. Correspondingly, Mode 2 also requires you to carefully estimate program requirements. You certainly don't want the program to be executed because it cannot be executed. Of course, this can also avoid being killed.

Check NULL pointer after memory allocation, audit Memory leakage

This is a simple rule, but it is easy to ignore. Check NULL to know that the Allocator can expand the memory area, although it is not guaranteed to be able to allocate the required pages. Generally, you need to guarantee or deduct the allocation, depending on the situation. In combination with overcommit, malloc () will return NULL because it is unable to apply for a free page, thus avoiding OOM.

Memory leakage is unnecessary memory consumption. The application will no longer track leaked memory blocks, but the kernel will not recycle them, because the kernel thinks the program is still in use. Valgrind can be used to track this phenomenon.

Always query memory allocation statistics

The Linux Kernel provides/proc/meminfo to find memory status information. The top free vmstat information is here.

What you need to check is free and reusable memory. Free to explain, but what is recyclable? This refers to buffer and page cache. When the memory is insufficient, the system can write it back to the disk for recovery.

$Cat/proc/meminfo
MemTotal: 255944 kB
MemFree: 3668 kB
Buffers': 13640 kB
Cached: 171788 kB
SwapCached: 0 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 255944 kB
LowFree: 3668 kB
SwapTotal: 909676 kB
SwapFree: 909676 kB

Based on the above output, the free virtual memory is MemFree + Buffers + Cached + SwapFree.

I failed to find any formalized C (glibc) function to find out free (including reclaimable) memory space. the closest I found is by using get_avphys_pages () or sysconf () (with the_ SC _AVPHYS_PAGES parameter ). they only report the amount of free memory, not the free + reclaimable amount. I cannot find a formal C function to find free (including recyclable) memory space. The closest is get_avphys_pages () or sysconf () (add the _ SC _AVPHYS_PAGES parameter). They only report the total amount of free memory rather than the amount of free memory that can be recycled.

This means that for accurate information, you need to parse/proc/meminfo and calculate it yourself. If you are lazy, refer to procps source code. It includes the ps top free tool.

Experiments on other memory Splitters

Different splitters use different methods to manage the memory chunk. Hoard is an example. Emery Berger from the University of mascript usetts uses it for high-performance memory allocation. It is used in multi-threaded programs and introduces the concept of heap per CPU.

Use a 64-bit Platform

Users who need a larger user address space can consider 64-bit computing. The kernel no longer uses the 3:1 method to separate VMS. Therefore, it is suitable for machines larger than 4 GB memory.

This has nothing to do with the extended address. For example, INTEL's PAE allows 32-bit processors to address 64 GB memory. The address is a physical address and has nothing to do with the user. In the virtual address area, users still use 3 GB. The excess memory can be accessed, but not all can be mapped to the address space. Areas that cannot be mapped are not available.

Consider using the packaging type in the Structure

Packed attributes can help to squeeze the size of structs, enums, and unions. this is a way to save more bytes, especially for array of structs. here is a declaration example: the packaged attributes can compress the size of struct enum and union. This can save a lot of money on struct.

Struct test
{
Char;
Long B;
} _ Attribute _ (packed ));

This trick is that it causes non-alignment of each row and consumes more CPU cycles. Alignment means that the address of the variable is an integer multiple of the original address of the data type. The data-based Access frequency is slower, but the relevance between sorting and buffering is considered.

Use ulimit () in user processes ()

You can use ullmit-v to limit the memory address space of mmap. After the upper limit is reached, mmap () and malloc () will return 0, so OOM will not start. It is very useful for multi-user systems, because it can prevent the innocent.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.