Recently, some of the online memory-intensive applications are more sensitive. At the peak of the visit, the occasional kill drops, causing the service to restart. The discovery is triggered by the mechanism of Linux out-of-memory Kiiler.
Http://linux-mm.org/OOM_Killer
Oom Kiiler will kill the memory in turn when the memory is tight, the process is high, send signal (SIGTERM). and recorded in the/var/log/message. There will be some records such as pid,process name. CPU Mask,trace and other information, through monitoring can detect similar problems.
Today, we deliberately analyzed the related selection mechanism of Oom killer. Dig a bit of code. Feel the mechanism simple and rude. But the effect is quite obvious. To share it with you.
A simple code fragment that allocates the heap Memroy (BIG_MM.C):
#define BLOCK (1024L*1024L*MB) #define MB 64L unsigned long total = 0L; for (;;) { //malloc big block memory and ZERO it ! char* mm = (char*) malloc (block); Usleep (100000); if (NULL = = mm) continue; Bzero (mm,block); Total + = MB; fprintf (stdout, "Alloc%lum mem\n", total); }
here are 2 places to be aware of:1, malloc is the allocation of virtual address space, if not memset or bzero, then will not trigger physical allocate, will not map the physical address, so here with Bzero fill 2, each application of the block size is more fastidious. The Linux kernel is divided into Lowmemroy and highmemroy,lowmemory for memory-intensive resources, Lowmemroy has a threshold, through FREE-LM and
/proc/sys/vm/lowmem_reserve_ratio to see the current low size and threshold low size. After the threshold is not triggered by the Oom killer, so the allocation of the block in the default 256M, otherwise assuming 512M (more than 128M), malloc may be the underlying BRK this syscall blocked, the kernel triggers page Cache write-Back or slab collection.
Test:
gcc big_mm.c-o big_mm;./big_mm &./big_mm &./BIG_MM &
(start multiple big_mm processes at the same time scramble for memory)
After starting, some big_mm are killed. Under/var/log/message tail-n 1000 | Grep-i Oom See:
APR 16:56:16 v125000100.bja kernel:: [22254383.898423] out of Memory:kill process 24894 (BIG_MM) score 277 or Sacrifi Ce childapr 16:56:16 v125000100.bja kernel:: [22254383.899708] killed process 24894, UID 55120, (big_mm) total-vm:2301 932kB, anon-rss:2228452kb, file-rss:24kbapr 16:56:18 v125000100.bja kernel:: [22254386.738942] big_mm invoked Oom-kil Ler:gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0apr 16:56:18 v125000100.bja kernel:: [22254386.738947] Big_ MM cpuset=/mems_allowed=0apr 16:56:18 v125000100.bja kernel:: [22254386.738950] pid:24893, comm:big_mm not tainted 2.6.32-220.23.2.ali878.el6.x86_64 #1Apr 16:56:18 v125000100.bja kernel:: [22254386.738952] Call TRACE:APR 18 16:56:18 V125000100.bja kernel:: [22254386.738961] [<ffffffff810c35e1>]? CPUSET_PRINT_TASK_MEMS_ALLOWED+0X91/0XB0APR 16:56:18 v125000100.bja kernel:: [22254386.738968] [< Ffffffff81114d70>]? DUMP_HEADER+0X90/0X1B0APR 16:56:18 v125000100.bja kernel:: [22254386.738973] [<ffffffff810e1b2e>]? __DELAYACCT_FREEPAGES_END+0X2E/0X30APR 16:56:18 v125000100.bja kernel:: [22254386.738979] [<FFFFFFFF81213FFC ;]? SECURITY_REAL_CAPABLE_NOAUDIT+0X3C/0X70APR 16:56:18 v125000100.bja kernel:: [22254386.738982] [< Ffffffff811151fa>]?OOM_KILL_PROCESS+0X8A/0X2C0APR 16:56:18 v125000100.bja kernel:: [22254386.738985] [<ffffffff81115131>]?
SELECT_BAD_PROCESS+0XE1/0X120APR 16:56:18 v125000100.bja kernel:: [22254386.738989] [<ffffffff81115650>]?
OUT_OF_MEMORY+0X220/0X3C0APR 16:56:18 v125000100.bja kernel:: [22254386.738995] [<ffffffff81125929>]? __ALLOC_PAGES_NODEMASK+0X899/0X930APR 16:56:18 v125000100.bja kernel:: [22254386.739001] [<ffffffff81159c6a ;]? alloc_pages_vma+0x9a/0x150
By the red section you can see that big_mm occupies 2301932k,anon-rss all of the large chunks of memory that are mmap allocated. The red Calltrace in the back identifies the stack kernel Oom-killer, and we'll analyze the Oom killer code for that call trace later.
- Analysis on the mechanism of Oom killer
We have triggered the oom killer mechanism. So the Oom killer is calculated to choose which process to kill? Let's take a look at some of the/proc that kernel provides to the user state:
/proc/[pid]/oom_adj, the PID process by the Oom killer killed the weight, between [ -17,15], the higher the weight, meaning more likely to be Oom killer selected, 17 means that the kill is forbidden.
/proc/[pid]/oom_score, the kill fraction of the current PID process. The higher the score means the more likely to be killed, the value is based on the results of the Oom_adj operation, and is the main reference for Oom_killer.
There are 2 configurable options under Sysctl:
Vm.panic_on_oom = 0 #内存不够时内核是否直接panic
Vm.oom_kill_allocating_task = 1 #oom-killer whether to select the process that is currently requesting memory for kill
When an oom killer is triggered/var/log/message prints the score of the process:
APR 16:56:18 v125000100.bja kernel:: [22254386.758297] [PID] UID tgid total_vm RSS CPU Oom_adj OOM_SCORE_ADJ NAMEAPR 16:56:18 v125000100.bja kernel:: [22254386.758311] [399] 0 399 2709 133 2-17 -1000 udevdapr 16:56:18 v125000100.bja kernel:: [22254386.758314] [810] 0 810 2847 43 0 0 0 svscanbootapr 16:56:18 v125000100.bja kernel:: [22254386.758317] [824] 0 824 1039 21 0 0 0 svscanapr 16:56:18 v125000100.bja kernel:: [22254386.758320] [825] 0 825 993 1 0 0 readproctitleapr 16:56:18 v125000100.bja kernel:: [22254386.758322] [826] 0 826 996 0 0 0 superviseapr 16:56:18 v125000100.bja kernel:: [22254386.758325] [827] 0 827 996 0 0 0 superviseapr 16:56:18 v125000100.bja kernel:: [22254386.758327] [828] 0 828 996 0 0 0 superviseapr 16:56:18 v125000100.bja kernel:: [22254386.758330] [829] 0 829 996 2 0 0 superviseapr 16:56:18 v125000100.bja kernel:: [22254386.758333] [8 0 830 6471 0 0 0 runapr 16:56:18 v125000100.bja kernel:: [22254386.758335] [831] 831 1032 0 0 0 Multilog
So. Suppose you want to change the probability of being selected by the Oom Killer, change the tree in the number of parameters.
The above has given the corresponding strategy, the following analysis of kernel corresponding code. There is a clear understanding.
The code selects the code for the kernel 3.0.12, the source code file MM/OOM_KILL.C. First look at the call trace invocation relationship:
__alloc_pages_nodemask allocating memory--found out of memory (or lower than low memory)out_of_memory , select one of the highest scoring processor select_bad_process , kill
/** * Out_of_memory-kill the ' best ' process when we run out of memory */void out_of_memory (struct zonelist *zonelist, GF p_t Gfp_mask, int order, nodemask_t *nodemask, bool Force_kill) {//wait for the notifier call chain to return, assuming there is memory then return Blocking_notifi Er_call_chain (&oom_notify_list, 0, &freed); if (Freed > 0) return; Assuming that the process is about to exit, it indicates that memory may be available, returning if (fatal_signal_pending (current) | | Current->flags & pf_exiting) {Set_thr Ead_flag (Tif_memdie); Return }//Assuming the Sysctl Panic_on_oom is set, the kernel directly panic check_panic_on_oom (constraint, gfp_mask, order, mpol_mask); Suppose you set oom_kill_allocating_task//To kill the process if you are applying for memory (Sysctl_oom_kill_allocating_task && current->mm &&!oom_unkillable_task (Current, NULL, nodemask) && Current->signal->oom_score_adj! = Oom_score_adj_min) {get_task_struct (current); Oom_kill_process (Current, gfp_mask, order, 0, TotalPages, NULL, Nodemask, "Out of Memory (Oom_kill_allocating_task)"); Goto out; }//Use Select_bad_process () to select badness =//Number (Oom_score) highest process P = select_bad_process (&points, TotalPages, Mpol_ma SK, Force_kill); if (!p) {Dump_header (null, gfp_mask, order, NULL, mpol_mask); Panic ("Out of memory and no killable processes...\n"); if (P! = (void *) -1ul) {//See if the child process is to be killed, it directly affects the current parent process oom_kill_process (p, Gfp_mas K, order, points, totalpages, NULL, Nodemask, "Out of Memory"); killed = 1; }out:if (killed) schedule_timeout_killable (1);}
Select_bad_process () Call oom_badness compute weights:
/** * oom_badness-heuristic function to determine which candidate task to kill * */unsigned long oom_badness (struct TAS K_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages) {long points; Long adj; Whether the internal inference is a INITD process with a PID of 1, or whether it is a kthread kernel process. Whether it is another cgroup. Suppose yes then skip if (Oom_unkillable_task (p, memcg, Nodemask)) return 0; p = find_lock_task_mm (p); if (!p) return 0; Obtain the/proc/[pid]/oom_adj weight, assuming that oom_score_adj_min returns ADJ = (long) p->signal->oom_score_adj; if (adj = = oom_score_adj_min) {task_unlock (P); return 0; }//Get process RSS and swap memory consumption points = Get_mm_rss (p->mm) + p->mm->nr_ptes + get_mm_counter (p->mm, Mm_swa pents); Task_unlock (P); The calculation process is as follows. "The computational logic is simpler and does not dwell on the" if (Has_capability_noaudit (p, cap_sys_admin)) adj-= 30; Adj *= totalpages/1000; points + = adj; Return points > 0?Points:1;}
Summing up, we can adjust the Oom killer according to the above strategy, prohibit or give oom_adj minimum or small value, also can adjust Oom killer behavior through SYSCTL!
!
Linux--An oom killer mechanism and code analysis of Memory control