oom kill行为解析

oom是out of memory的简写是linux内存到达低水位后的内存异常处理机制，出现oom后会优先kill掉内存占用最高（评分最高最bad）的进程，向这些进程发送kill -9 强制进程退出。

一、oom触发原因：

1、常见的oom大多是内存泄漏导致的，或者瞬时申请的内存较大，触发了低水位的内存保护；
2、打开文件资源及其他资源太多，没来得及回收，也会将内存损耗拖入低水位；

二、调用逻辑

out_of_memory \\oom killer入口函数；select_bad_process->oom_scan_process_thread \\选出最bad的进程oom_kill_process->oom_badness \\杀掉这个进程，如果这个进程有子进程并且不共享mm则先杀子进程；

注意
oom kill 触发一次仅杀死一个进程，如果内存仍然不足，才会触发下一次。在select_bad_process过程中选择的是得分最高的进程，在oom_kill_process中kill的时候如果主子进程不共享mm，则会杀掉子进程。所以在主进程得分最高的时候看到的现象是先杀掉了其创建的子进程，然后内存还是不足才最终杀掉主进程；

三、代码解析

主要围绕上面的触发过程涉及的函数进行讲解：
入口函数：

    /** OOM处理的主流程，上面的注释应该比较清楚了。*/void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,int order, nodemask_t *nodemask, bool force_kill){const nodemask_t *mpol_mask;struct task_struct *p;unsigned long totalpages;unsigned long freed = 0;unsigned int uninitialized_var(points);enum oom_constraint constraint = CONSTRAINT_NONE;int killed = 0;// 调用block通知链oom_nofify_list中的函数blocking_notifier_call_chain(&oom_notify_list, 0, &freed);if (freed > 0)/* Got some memory back in the last second. */return;/** If current has a pending SIGKILL or is exiting, then automatically* select it. The goal is to allow it to allocate so that it may* quickly exit and free its memory.*//** 如果当前进程有pending的SIGKILL(9)信号，或者正在退出，则选择当前进程来kill,* 这样可以最快的达到释放内存的目的。*/if (fatal_signal_pending(current) || current->flags & PF_EXITING) {set_thread_flag(TIF_MEMDIE);return;}/** Check if there were limitations on the allocation (only relevant for* NUMA) that may require different handling.*//** 检查是否有限制，有几种不同的限制策略，仅用于NUMA场景*/constraint = constrained_alloc(zonelist, gfp_mask, nodemask,&totalpages);mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;// 检查是否配置了/proc/sys/kernel/panic_on_oom，如果是则直接触发paniccheck_panic_on_oom(constraint, gfp_mask, order, mpol_mask);/** 检查是否配置了oom_kill_allocating_task，即是否需要kill current进程来* 回收内存，如果是，且current进程是killable的，则kill current进程。*/if (sysctl_oom_kill_allocating_task && current->mm &&!oom_unkillable_task(current, NULL, nodemask) &&current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {get_task_struct(current);// kill被选中的进程。oom_kill_process(current, gfp_mask, order, 0, totalpages, NULL,nodemask,"Out of memory (oom_kill_allocating_task)");goto out;}// 根据既定策略选择需要kill的process。p = select_bad_process(&points, totalpages, mpol_mask, force_kill);/* Found nothing?!?! Either we hang forever, or we panic. *//** 如果没有选出来，即没有可kill的进程，那么直接panic* 通常不会走到这个流程，但也有例外，比如，当被选中的进程处于D状态，或者正在被kill*/if (!p) {dump_header(NULL, gfp_mask, order, NULL, mpol_mask);panic("Out of memory and no killable processes...\n");}// kill掉被选中的进程，以释放内存。if (PTR_ERR(p) != -1UL) {oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,nodemask, "Out of memory");killed = 1;}out:/** Give the killed threads a good chance of exiting before trying to* allocate memory again.*//** 在重新分配内存之前，给被kill的进程1s的时间完成exit相关处理，通常情况* 下，1s应该够了。*/if (killed)schedule_timeout_killable(1);}

选出要kill的进程：

    /** OOM流程中，用来选择被kill的进程的函数* @ppoints:点数，用来计算每个进程被"选中"可能性，点数越高，越可能被"选中"*/static struct task_struct *select_bad_process(unsigned int *ppoints,unsigned long totalpages, const nodemask_t *nodemask,bool force_kill){struct task_struct *g, *p;struct task_struct *chosen = NULL;unsigned long chosen_points = 0;rcu_read_lock();// 遍历系统中的所有进程，进行"点数"计算do_each_thread(g, p) {unsigned int points;/** 进行一些特殊情况的处理，比如: 优先选择触发OOM的进程、不处理* 正在exit的进程等。*/        switch (oom_scan_process_thread(p, totalpages, nodemask,force_kill)) {case OOM_SCAN_SELECT:chosen = p;chosen_points = ULONG_MAX;/* fall through */case OOM_SCAN_CONTINUE:continue;case OOM_SCAN_ABORT:rcu_read_unlock();return ERR_PTR(-1UL);case OOM_SCAN_OK:break;};// 计算"点数"，选择点数最大的进程。points = oom_badness(p, NULL, nodemask, totalpages);if (points > chosen_points) {chosen = p;chosen_points = points;}} while_each_thread(g, p);if (chosen)get_task_struct(chosen);rcu_read_unlock();*ppoints = chosen_points * 1000 / totalpages;return chosen;}

计算得分：

    /** 计算进程"点数"(代表进程被选中的可能性)的函数，点数根据进程占用的物理内存来计算* 物理内存占用越多，被选中的可能性越大。root processes有3%的bonus。*/unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,const nodemask_t *nodemask, unsigned long totalpages){long points;long adj;if (oom_unkillable_task(p, memcg, nodemask))return 0;// 确认进程是否还存在p = find_lock_task_mm(p);if (!p)return 0;adj = (long)p->signal->oom_score_adj;if (adj == OOM_SCORE_ADJ_MIN) {task_unlock(p);return 0;}/** The baseline for the badness score is the proportion of RAM that each* task's rss, pagetable and swap space use.*/// 点数=rss(驻留内存/占用物理内存)+pte数+交换分区用量points = get_mm_rss(p->mm) + p->mm->nr_ptes +get_mm_counter(p->mm, MM_SWAPENTS);task_unlock(p);/** Root processes get 3% bonus, just like the __vm_enough_memory()* implementation used by LSMs.*//** root用户启动的进程，有总 内存*3% 的bonus，就是说可以使用比其它进程多3%的内存* 3%=30/1000*/if (has_capability_noaudit(p, CAP_SYS_ADMIN))adj -= 30;/* Normalize to oom_score_adj units */// 归一化"点数"单位adj *= totalpages / 1000;points += adj;/** Never return 0 for an eligible task regardless of the root bonus and* oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).*/return points > 0 ? points : 1;}

杀死bad进程：

    /** kill被选中的进程，在OOM流程中被调用*/void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,unsigned int points, unsigned long totalpages,struct mem_cgroup *memcg, nodemask_t *nodemask,const char *message){struct task_struct *victim = p;struct task_struct *child;struct task_struct *t = p;struct mm_struct *mm;unsigned int victim_points = 0;static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,DEFAULT_RATELIMIT_BURST);/** If the task is already exiting, don't alarm the sysadmin or kill* its children or threads, just set TIF_MEMDIE so it can die quickly*//** 如果进程正在exiting，就没有必要再kill它了，直接设置TIF_MEMDIE，然后返回。*/if (p->flags & PF_EXITING) {set_tsk_thread_flag(p, TIF_MEMDIE);put_task_struct(p);return;}if (__ratelimit(&oom_rs))dump_header(p, gfp_mask, order, memcg, nodemask);task_lock(p);pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",message, task_pid_nr(p), p->comm, points);task_unlock(p);/** If any of p's children has a different mm and is eligible for kill,* the one with the highest oom_badness() score is sacrificed for its* parent. This attempts to lose the minimal amount of work done while* still freeing memory.*//** 如果被选中的进程的子进程，不跟其共享mm(通常是这样)，且膐om_badness的* 得分更高，那么重新选择该子进程为被kill的进程。*/read_lock(&tasklist_lock);do {// 遍历被选中进程的所有子进程list_for_each_entry(child, &t->children, sibling) {unsigned int child_points;// 如果不共享mmif (child->mm == p->mm)continue;/** oom_badness() returns 0 if the thread is unkillable*/// 计算child?om_badness得分child_points = oom_badness(child, memcg, nodemask,totalpages);// 如果child得分更高，则将被选中进程换成childif (child_points > victim_points) {put_task_struct(victim);victim = child;victim_points = child_points;get_task_struct(victim);}}} while_each_thread(p, t);read_unlock(&tasklist_lock);rcu_read_lock();/** 遍历确认被选中进程的线程组，判断是否还存在task_struct->mm，如果不存在* (有可能这个时候进程退出了，或释放了mm),就没必要再kill了。* 如果存在则选择线程组中的进程。*/p = find_lock_task_mm(victim);if (!p) {rcu_read_unlock();put_task_struct(victim);return;// 如果新选择的进程跟之前的不是同一个，那么更新victim。} else if (victim != p) {get_task_struct(p);put_task_struct(victim);victim = p;}/* mm cannot safely be dereferenced after task_unlock(victim) */mm = victim->mm;pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),K(get_mm_counter(victim->mm, MM_ANONPAGES)),K(get_mm_counter(victim->mm, MM_FILEPAGES)));task_unlock(victim);/** Kill all user processes sharing victim->mm in other thread groups, if* any. They don't get access to memory reserves, though, to avoid* depletion of all memory. This prevents mm->mmap_sem livelock when an* oom killed thread cannot exit because it requires the semaphore and* its contended by another thread trying to allocate memory itself.* That thread will now get access to memory reserves since it has a* pending fatal signal.*//** 遍历系统中的所有进程，寻找在其它线程组中，跟被选中进程(victim)共享mm结构* 的进程(内核线程除外)，共享mm结构即共享进程地址空间，比如fork后exec之前，* 父子进程是共享mm的，回收内存必须要将共享mm的所有进程都kill掉。*/for_each_process(p)if (p->mm == mm && !same_thread_group(p, victim) &&!(p->flags & PF_KTHREAD)) {if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)continue;// 进行task_struct相关操作时，通常需要获取该锁。task_lock(p);    /* Protect ->comm from prctl() */pr_err("Kill process %d (%s) sharing same memory\n",task_pid_nr(p), p->comm);task_unlock(p);// 通过向被选中的进程发送kill信号，来kill进程。do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);}rcu_read_unlock();// 进程设置TIF_MEMDIE标记，表示进程正在被oom killer终止中。set_tsk_thread_flag(victim, TIF_MEMDIE);/** 最终通过向被选中的进程发送kill信号，来kill进程，被kill的进程在从内核态* 返回用户态时，进行信号处理。* 被选中的进程可以是自己(current)，则current进程会在oom流程执行完成后，返回* 用户态时，处理信号。*/do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);put_task_struct(victim);}

oom kill行为解析相关推荐

OOM问题原理解析（四）：Bitmap压缩方案总结
Bitmap占用内存 = 长 X 宽 X 1个像素所占字节,降低任意参数就可减少Bitmap占用内存! 一.质量压缩bitmap.compress(···quality···) ByteArrayOu ...
Linux OOM 基本原理解析
1.序言内存对计算机系统来说是一项非常重要的资源,直接影响着系统运行的性能.最初的时候,系统是直接运行在物理内存上的,这存在着很多的问题,尤其是安全问题.后来出现了虚拟内存,内核和进程都运行在虚拟内 ...
OOM问题排查及原因解析
一.前言最近公司线上出了故障,有业务反馈说线上某台机器发出的请求status都是101,代表是超时.于是顺着调用栈和监控去查,最后发现这台机器上的网关挂掉了,所以导致请求发不出去,导致业务超时.那为 ...
详解 Flink 容器化环境下的 OOM Killed
简介:本文将解析 JVM 和 Flink 的内存模型,并总结在工作中遇到和在社区交流中了解到的造成 Flink 内存使用超出容器限制的常见原因.由于 Flink 内存使用与用户代码.部署环境.各种依赖 ...
由 OOM 引发的 ext4 文件系统卡死
注:本问题影响 3.10.0-862.el7.centos 及之后的 CentOS 7 版本内核,目前问题还未被修复. 背景近日,我司的测试同学发现内部集群中一个存储节点无法通过 ssh 访问了.i ...
20篇精品文章+视频，手把手带你攻克OOM难题｜HeapDump性能社区专题精选
Out of memory (OOM) 是一种操作系统或者程序已经无法再申请到内存的状态.经常是因为所有可用的内存,包括磁盘交换空间都已经被分配了.OOM的官方解释是:Understand the O ...
Android面试真题解析火爆全网，薪资翻倍
一些闲言闲语风萧萧兮易水寒,壮士一去兮怎么还? 卑微小刘在线征婚?啊,呸.说错了,卑微小刘在线求面试资料啊! 不知道,大家有没有过这样的经历,这个故事还要从很久很久以前讲起,从前有一个美丽的小村庄- ...
认真理解 oom killer 备忘
最近项目测试,发现一个oom killer问题,所以搜集了一些文章,理解并做记录. 现象:做性能测试时,程序自己退出,记录"killed"日志.查了下syslog发现详细记录了问题 ...
史上最全Android性能优化方案解析
Android中的性能优分为以下几个方面: 布局优化网络优化安装包优化内存优化卡顿优化启动优化 -- 一.布局优化布局优化的本质就是减少View的层级.常见的布局优化方案如下: 在Line ...

oom kill行为解析

一、oom触发原因：

二、调用逻辑

三、代码解析

oom kill行为解析相关推荐

最新文章

热门文章