kubernetes: Log something about OOMKilled containers

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:

Container gets killed because it tries to use more memory than allowed.

What you expected to happen:

Have an OOMKilled event tied to the pod and logs about this

/sig node

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 106
  • Comments: 72 (25 by maintainers)

Most upvoted comments

This has been discussed in #sig-instrumentation on Slack and was brought up on the sig-node call yesterday to determine a path forward.

There are two requests:

  1. Have an OOMKilled event tied to the Pod (as noted by @sylr)
  2. Have a count of termination reason by Pod in the Kubelet (or cAdvisor?), exposed to Prometheus as a monotonically increasing counter

To summarize what’s currently available in kube-state-metrics:

  • kube_pod_container_status_terminated_reason This is a (binary) gauge which has a value of 1 for the current reason, and 0 for all other reasons. As soon as the Pod restarts, all reasons go to 0.

  • kube_pod_container_status_last_terminated_reason Same as above for the prior reason, so it’s available after the Pod restarts.

  • kube_pod_container_status_restarts_total A count of the restarts, with no detail on the reason.

The issues are:

  1. There is no way to get a count of the reasons over time (for alerting and debugging).
  2. Some termination reasons will never be recorded by Prometheus when the reason changes before the next Prometheus scrape.

For example, given a Pod that is sometimes being OOMKilled, and sometimes crashing, it’s desired to be able to view the historical termination reasons over time.

As a note: it was discussed and it appears the design of kube-state-metrics prevents aggregating the reason gauge into counters, and it’s preferred if this happens at the source.

Implementing both of the above requests will significantly improve the ability of cluster-users and monitoring vendors to debug when Pods are failing.

Can @kubernetes/sig-node-feature-requests provide some guidance on the next steps here?

CC: @dchen1107

This query combines container restart and termination reason:

sum by (pod, container, reason) (kube_pod_container_status_last_terminated_reason{})
* on (pod,container) group_left
sum by (pod, container) (changes(kube_pod_container_status_restarts_total{}[1m]))

Our team came up with a custom controller to implement the idea of having an OOMKilled event tied to the Pod. Please find it here: https://github.com/xing/kubernetes-oom-event-generator

From the README: The Controller listens to the Kubernetes API for “Container Started” events and searches for those claiming they were OOMKilled previously. For matching ones an Event is generated as Warning with the reason PreviousContainerWasOOMKilled.

We would be very happy to get feedback on it.

Indeed, it seems to work 😃

@brancz do you know why this happens? also tried it in 1.3.1.

    - alert: OOMKilled
      expr: sum_over_time(kube_pod_container_status_terminated_reason{reason="OOMKilled"}[5m]) > 0
      for: 1m
      labels:
        severity: warning
      annotations:
        description:  Pod {{$labels.pod}} in {{$labels.namespace}} got OOMKilled

Now that #87856 is closed, what is the best way to alert on OOMKilled containers?

@lukeschlather #100487 should cover the logging and oom event being created for the associated pod that you are wanting.

/remove-lifecycle stale

/remove-lifecycle stale

This query combines container restart and termination reason:

sum by (pod, container, reason) (kube_pod_container_status_last_terminated_reason{})
* on (pod,container) group_left
sum by (pod, container) (changes(kube_pod_container_status_restarts_total{}[1m]))

Thanks, this seems to work fine for my use case:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: oom-rules
  namespace: kube-prometheus-stack
spec:
  groups:
  - name: OOMKilled
    rules:
    - alert: OOMKilled
      expr: 'sum by (pod, container, reason, namespace) (kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}) * on (pod,container) group_left
sum by (pod, container) (changes(kube_pod_container_status_restarts_total{}[1m])) > 0'
      labels:
        severity: warning
      annotations:
        summary: "Container ({{ $labels.container }}) OOMKilled ({{ $labels.namespace }}/{{ $labels.pod }})"

This throws an alert on container OOM events and resolves the alert directly afterwards.

/remove-lifecycle stale

Is there a good way of probing OOMKilled? My use case is I want to detect OOM and have actions based on it. Thanks!

/remove-lifecycle rotten

I still think this should be more properly addressed.

@bjhaid fwiw you can use mtail against dmesg to produce metrics about oomkill messages.

The problem here is that a pod can disappear and there’s no record of why. A metric is useful in that it lets you know something is wrong but it doesn’t actually tell you what is wrong. K8s shouldn’t be killing pods without leaving a record of why it killed which pod in an obvious place.

/remove-lifecycle stale

There’s an in progress PR about this now. https://github.com/kubernetes/kubernetes/pull/87856

@anderson4u2 I am a bit confused by your last comment. You wrote:

just tried kube_pod_container_status_last_terminated_reason in version 1.4.0

But in the example below you use kube_pod_container_status_terminated_reason, not kube_pod_container_status_last_terminated_reason.

So as far as I see, the new (very useful) metric kube_pod_container_status_last_terminated_reason is still unreleased.

/remove-lifecycle stale

Is this still relevant after https://github.com/kubernetes/kubernetes/pull/108004? It seems to me that it is covering the gaps kube-state-metrics has with OOMKilled events.

Are the memory requests and limits just cgroups under the hood?

@lukeschlather for the record, the kernel kills pods, not k8s. that’s the whole problem with this issue 😦

please google for ( “oom kill kernel” )

/remove-lifecycle stale

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

What is the component that actually OOMkills the container for going over the memory limit? Can that component simply log something? Where would that log go in GKE? The kubernetes apiserver logs? The node logs?

It seems like a lot of the related issues to this one get bogged down in how to deal with pathological cases (stuff getting killed by the kernel rather than simply getting killed for going over its limit.) Also, I want an event but if it’s going to be another 2 years before someone can figure out how to properly generate an event I would settle for logging anything anywhere at all.

What’s the equivalent to looking in dmesg if you’re using a hosted solution like GKE (my actual question) or EKS/AKS?

To the best of my knowledge there is so far no built in way for GKE.

We are using https://github.com/xing/kubernetes-oom-event-generator in combination with alerting on a metric. Just be aware: This only works if the main process is killed and the POD gets evicted. If a subprocess (like a gunicorn worker) is nuked you need to rely on the logging of your running application. See e.g. https://github.com/benoitc/gunicorn/pull/2475