kubernetes: Log and expose cgroup OOM events to the associated Pod resource

What happened:

Previously the cadvisor library had a regex parsing error, which resulted in it falling back to returning a string character of / when parsing system oom messages on kernels >= 5.0. This was fixed in google/cadvisor#2813

~Since cadvisor was bumped in #99875, system OOM event messages are no longer being emitted because of the following line, where its checking that the event.VictimContainerName == "/". https://github.com/kubernetes/kubernetes/blob/e557f61784a90adf8dfe4a0bca875043e895cc8b/pkg/kubelet/oom/oom_watcher_linux.go#L75~

Since cadvisor was bumped in #99875, we can now retrieve the OOM’d pod id, creating a log entry and emit an event for the associated resource

What you expected to happen:

System OOM messages events to be emitted to the Node resource.

How to reproduce it (as minimally and precisely as possible):

Running a 1.21.x cluster, create a pod that OOM’s.

apiVersion: v1
kind: Pod
metadata:
  name: memory-demo-2
  namespace: default
spec:
  containers:
  - name: memory-demo-2-ctr
    image: polinux/stress
    resources:
      requests:
        memory: "50Mi"
      limits:
        memory: "100Mi"
    command: ["stress"]
    args: ["--vm", "1", "--vm-bytes", "250M", "--vm-hang", "1"]

Observe that there is no SystemOOM message for the events of the node where that pod is running.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.21.0-beta.1.382+1a983bb958ba66
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a): Linux ip-172-31-48-224 5.4.0-1038-aws #40-Ubuntu SMP Fri Feb 5 23:50:40 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 6
Comments: 20 (4 by maintainers)

Commits related to this issue

Slight refactor after discussions in issue #100483 Previous functionality of how oom parsing was emitting cgroup OOMs to the node resource was not intended and a side-effect of the incorrect parsing ... — committed to kgtw/kubernetes by kgtw 3 years ago

Most upvoted comments

Probably still needed/valid. Please remove stale/rotten label.

jecnua on Jan 15, 2022

After this is implemented I’ll be able to associate a OOM with the pod where it happened, right? If so that would be fantastic! With the current OOM description:

SystemOOM: System OOM encountered, victim process: celery, pid: 16858

I have no idea how to correlate a process ID the culprit pod. Normally I’d just look at memory usage or restart count, but brief memory spikes might not be logged and due to https://github.com/kubernetes/kubernetes/issues/50632 the pod might not be restarted either.

Other people have had the same difficulty: https://stackoverflow.com/questions/58749290/process-inside-pod-is-oomkilled-even-though-pod-limits-not-reached

caleb15 on Sep 17, 2021

The issue around silent pod-process OOM kills is a mess and comes biting in the ass over and over again, where a crucial pod-process that is not PID 1 dies and services go offline perfectly silently. This happens with Selenium, Spark Operator, any JFrog chart, Couchbase and so, just to name a few.

It is very hard to identify what is the problem and where.
K8S can not do recovery. Let’s say a Pod goes OOM every 2 days this way, a Pod restart would keep the service up, but in this case, it is just malfunctioning until someone gets his/her hands dirty to figure out the root cause.

zzvara on Feb 1, 2022

Thanks for the clarification. In that case I think it will make sense to update this issue & associated pr to be a feature request instead of a bug, which would cover some of the criteria in #69676. Specifically logs in kubelet, and emitting the event to the pod resource.

kgtw on Mar 24, 2021