kubernetes: Standardization of the OOM kill communication between container runtime and kubelet

Opening the issue to move the discussion here, based on the message to CNCF Tag Runtime and the group meeting discussion.

The motivation is from the KEP we are working on in Kubernetes: “Retriable and non-retriable Pod failures for Jobs“ (https://github.com/kubernetes/enhancements/issues/3329). In particular, for Beta, we are going to add a Pod condition ResourceExhausted to the Pod status whenever a pod’s container is killed by the OOM killer.

Currently, the leading implementations of the CRI set the container’s reason field to OOMKilled:

Containerd (see: https://github.com/containerd/containerd/blob/23f66ece59654ea431700576b6020baffe1a4e49/pkg/cri/server/events.go#L344 and https://github.com/containerd/containerd/blob/36d0cfd0fddb3f2ca4301533a7e7dcf6853dc92c/pkg/cri/server/helpers.go#L62)

and CRI-O (see: https://github.com/cri-o/cri-o/blob/edf889bd277ae9a8aa699c354f12baaef3d9b71d/server/container_status.go#L88-L89).

However, there are two issues with the status quo:

The communication between container runtime and Kubelet is not standardized, leaving the reason field an unrestricted CamelCase string (see CRI API: https://github.com/kubernetes/cri-api/blob/3af67d6e7a5160e066444dac2a62e6218d67066b/pkg/apis/runtime/v1/api.proto#L1147).
There is no way to determine from within the Kubelet if the container was OOM killed due to exceeding its configured limits or due to the system running low on memory.

We suggest the following solutions (up for discussion):

For the first issue, we suggest extending the documentation of the CRI API field reason, to say that the reason field should be set to OOMKilled if the container is killed due to “OOM killer”. This way we would ensure (in a backwards-compatible way) that the systems which recognize the OOM kill events by observing the reason field equal to OOMKill would not break in the future.

For the second issue we suggest either:

extend the CRI API documentation of the message field to make sure the implementations communicate the information via message in a standard way;
Introduce a new dedicated field, such as “oom_reason”.
standardize the OOMKilled as a prefix for the OOM kill reasons. In this approach we would introduce a pair of new reasons: OOMKilledNamespaceMemoryExceeded and OOMKilledMemoryPressure. However, this might be risky as the current implementations of containerd and CRI-O have 5 years so many systems may already depend on the field being equal to OOMKilled.

While both issues are related and important from the perspective of our work we could also consider decoupling them, as the first issue of standardization has a higher priority and should be just about freezing the status quo. Fixing the second issue may involve substantial work on the side of the container runtime implementations to convey the information.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 16 (14 by maintainers)

Most upvoted comments

For the first issue, we suggest extending the documentation of the CRI API field reason, to say that the reason field should be set to OOMKilled if the container is killed due to “OOM killer”.

This is the reality today, and totally make sense to officially document it.

standardize the OOMKilled as a prefix for the OOM kill reasons. In this approach we would introduce a pair of new reasons: OOMKilledNamespaceMemoryExceeded and OOMKilledMemoryPressure.

At least based on the containerd API today, it does not distinguish these 2 cases. Is this distinguishable by containerd in theory? https://github.com/containerd/containerd/tree/77d53d2d230c3bcd3f02e6f493019a72905c875b/pkg/oom

Random-Liu on Oct 11, 2022