kubernetes: Container with multiple processes not terminated when OOM

/kind bug

What happened:

A pod container reached its memory limit. Then the oom-killer killed only one process within the container. This container has a uwsgi python server which gave this error in its logs:

DAMN ! worker 1 (pid: 1432) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 1 (new pid: 1473)

The only errors I could find in k8s were in the syslog on the node:

Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.105281] uwsgi invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=-998
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.109569] uwsgi cpuset=05d27aafc4e80e117506eb5da77dea2d881129d8db17466d31c0cc8ad8e13c52 mems_allowed=0
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.115236] CPU: 0 PID: 13965 Comm: uwsgi Tainted: G            E   4.4.65-k8s #1
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.118330] Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.120445]  0000000000000286 00000000a2a9f130 ffffffff812f67b5 ffff880011423e20
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.120445]  ffff8801fd092800 ffffffff811d8855 ffffffff81826173 ffff8801fd092800
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.120445]  ffffffff81a6b740 0000000000000206 0000000000000002 ffff8800e9f10ab8
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.120445] Call Trace:
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.120445]  [<ffffffff812f67b5>] ? dump_stack+0x5c/0x77
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.120445]  [<ffffffff811d8855>] ? dump_header+0x62/0x1d7
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.120445]  [<ffffffff8116ded1>] ? oom_kill_process+0x211/0x3d0
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.120445]  [<ffffffff811d0f4f>] ? mem_cgroup_iter+0x1cf/0x360
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.120445]  [<ffffffff811d2de3>] ? mem_cgroup_out_of_memory+0x283/0x2c0
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.120445]  [<ffffffff811d3abd>] ? mem_cgroup_oom_synchronize+0x32d/0x340
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.120445]  [<ffffffff811cf170>] ? mem_cgroup_begin_page_stat+0x90/0x90
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.120445]  [<ffffffff8116e5b4>] ? pagefault_out_of_memory+0x44/0xc0
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.120445]  [<ffffffff815a65b8>] ? page_fault+0x28/0x30
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.200077] Task in /kubepods/pod71cf1407-73c1-11e7-8d6e-063b53e2a39f/05d27aafc4e80e117506eb5da77dea2d881129d8db17466d31c0cc8ad8e13c52 killed as a result of limit of /kubepods/pod71cf1407-73c1-11e7-8d6e-063b53e2a39f
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.209412] memory: usage 256000kB, limit 256000kB, failcnt 379
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.212293] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.215517] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.219331] Memory cgroup stats for /kubepods/pod71cf1407-73c1-11e7-8d6e-063b53e2a39f: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.231158] Memory cgroup stats for /kubepods/pod71cf1407-73c1-11e7-8d6e-063b53e2a39f/d836fcdfc1ab1f1d4ec7a49ba763b8770015248f3b7da43bdb0948faee5d6163: cache:0KB rss:36KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:36KB inactive_file:0KB active_file:0KB unevictable:0KB
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.245937] Memory cgroup stats for /kubepods/pod71cf1407-73c1-11e7-8d6e-063b53e2a39f/05d27aafc4e80e117506eb5da77dea2d881129d8db17466d31c0cc8ad8e13c52: cache:696KB rss:255268KB rss_huge:0KB mapped_file:696KB dirty:0KB writeback:0KB inactive_anon:300KB active_anon:255652KB inactive_file:0KB active_file:0KB unevictable:0KB
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.260703] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.265697] [ 6726]     0  6726      257        1       4       2        0          -998 pause
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.270433] [ 6857]     1  6857    29473     2393      29       3        0          -998 uwsgi
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.274906] [13957]     1 13957   385176    66344     270       5        0          -998 uwsgi
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.279180] Memory cgroup out of memory: Kill process 13957 (uwsgi) score 42 or sacrifice child
Aug 14 12:06:33 ip-172-20-157-22 kernel: [1620280.283460] Killed process 13957 (uwsgi) total-vm:1540704kB, anon-rss:252808kB, file-rss:12568kB

What you expected to happen:

I expected the whole container/pod to be terminated (and then restarted by the replica-set controller). I also expected to see “Restarts” count above 0 on the pods, and events in the pod or replica-set.

According to documentation at https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#exceed-a-containers-memory-limit the whole container should be terminated:

If a container allocates more memory than its limit, the Container becomes a candidate for termination. If the Container continues to to consume memory beyond its limit, the Container is terminated. If a terminated Container is restartable, the kubelet will restart it, as with any other type of runtime failure.

How to reproduce it (as minimally and precisely as possible):

Setup a multi-process server in a pod, e.g. uwsgi and django, where the uwsgi is the main process started in the container by k8s. Then have the child process use up more memory than the container limit.

Anything else we need to know?:

Another nice-to-have would be for the endpoint of a container that reaches the mem limit to immediately become not-ready and endpoints to be taken out of services until it passes health checks again. Because of the hard sigkill, we’re not able to gracefully handle this condition and client connections get dropped. I saw the workaround in #40157, so we will try that.

Environment:

  • Kubernetes version: v1.6.4
  • Cloud provider or hardware configuration: AWS
  • OS: Debian GNU/Linux 8 (jessie)
  • Kernel: 4.4.65-k8s
  • Install tools: kops
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 2
  • Comments: 20 (10 by maintainers)

Most upvoted comments

Ok, this has bitten me in the a** bigtime. Cost me a day to find out that one of my Python child processes was OOM killed. I would absolutely vote for a oom-killer which always kills the parent process no matter what. That makes the behaviour at least consistent. You assign a resource limit to the pod (as an entity), and clearly the pod went over that limit so it should have been restarted.

I agree with @kellycampbell this behavior is not very well documented…

I’d say this is Working as Intended.

I just ran into this issue too and I agree that this isn’t well documented. I can see how one would assume that k8s enforces the memory limit and communicates this via the api/events/metrics.

The real problem IMO though is a lack of visibility when this happens. You can get this from the kernel log and more recent kernels expose that in vmstat (which is surfaced by the node-exporter as node_vmstat_oom_kill) but can’t be correlated to a pod.

Hello, This behavior is quite misleading as it actually delegates the termination of the Pod to… the container itself.

This can lead to misbehaving or non-optimal Pods which still pass the healthchecks but should be destroyed anyway.

I actually had a case where the same process was being killed over and over (~2000 times over 1 hour) but was kept being re-spawned by its init process. Then the init process got OOMKilled and the container restarted.

I suppose this issue is more a Docker issue than a Kubernetes one.

@xiangpengzhao yes, that’s why i was looking at options for @kellycampbell where the main process is the only one in the container. Guess we need a big disclaimer about child process(es) being oom-killed

Containers are marked as OOM killed only when the init pid gets killed by the kernel OOM killer. There are apps that can tolerate OOM kills of non init processes and so we chose to not track non-init process OOM kills.

I think maybe it’s not clear how the resource limits are enforced. After troubleshooting this issue, I discovered my own misunderstanding of responsibilities for k8s vs the container runtime and linux cgroups.

I found this documentation helpful in understanding what is happening: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html

This other page in the k8s docs could have better info under the “How Pods with resource limits are run” section: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/

Outside of documentation changes, the two other things that I think should be considered long-term for k8s are:

a) how to surface the event when a particular container breached its memory limit and had processes killed (if the pod doesn’t terminate itself from the oom-killed process) so admins know why things aren’t working.

b) a way to more gracefully handle containers reaching their memory limits, e.g. with a signal to the pod (#40157)