kubernetes: Pod pending in PodInitializing state after Init container OOMKilled

What happened?

We recently upgraded to 1.23 (from 1.22) and started to experience sporadic issues of stuck pods. This seems to happen when one of the Init containers of a pod exits with Reason OOMKilled (and Exit State: 0) We have restartPolicy set to ‘Never’

What did you expect to happen?

Pod will be evicted

How can we reproduce it (as minimally and precisely as possible)?

As said. The issue is sporadic and happens when there is a high memory load on the node (but not always).

Anything else we need to know?

Here is a sanitized pod description of a stuck pod: `Name: wf-3365-1282933487 Namespace: default Priority: 0 Node: gke–producti-m128-c16-sa-work-e7608549-fq5h/10.132.0.139 Start Time: Tue, 14 Mar 2023 23:42:54 +0200 Labels: workflows.argoproj.io/completed=false workflows.argoproj.io/workflow=wf-3365 Annotations: kubectl.kubernetes.io/default-container: main kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container wait; cpu request for container main; cpu request for init container init; cpu request f… workflows.argoproj.io/node-id: wf-3365-1282933487 workflows.argoproj.io/node-name: wf-3365(0)[0].wf-step0(0) Status: Pending IP: 10.4.29.112 IPs: IP: 10.4.29.112 Controlled By: Workflow/wf-3365 Init Containers: init: Container ID: containerd://c9877… Image: quay.io/argoproj/argoexec:v3.4.4 Image ID: quay.io/argoproj/argoexec@sha256:… Port: <none> Host Port: <none> Command: argoexec init –loglevel info –log-format text State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 14 Mar 2023 23:42:55 +0200 Finished: Tue, 14 Mar 2023 23:42:55 +0200 Ready: True Restart Count: 0 Requests: cpu: 100m Environment: ARGO_POD_NAME: wf-3365-1282933487 (v1:metadata.name) ARGO_POD_UID: (v1:metadata.uid) GODEBUG: x509ignoreCN=0 ARGO_WORKFLOW_NAME: wf-3365 ARGO_CONTAINER_NAME: init ARGO_TEMPLATE: … ARGO_NODE_ID: wf-3365-1282933487 ARGO_INCLUDE_SCRIPT_OUTPUT: false ARGO_DEADLINE: 2023-03-16T21:42:53Z ARGO_PROGRESS_FILE: /var/run/argo/progress ARGO_PROGRESS_PATCH_TICK_DURATION: 1m0s ARGO_PROGRESS_FILE_TICK_DURATION: 3s Mounts: … preparedata: Container ID: containerd://9f26… Image: google/cloud-sdk:latest Image ID: docker.io/google/cloud-sdk@… Port: <none> Host Port: <none> Command: bash -c Args: … A bunch of shell commands

State:          Terminated
  Reason:       OOMKilled
  Exit Code:    0
  Started:      Tue, 14 Mar 2023 23:42:59 +0200
  Finished:     Tue, 14 Mar 2023 23:43:58 +0200
Ready:          True
Restart Count:  0
Requests:
  cpu:  100m
Environment:
  ARGO_CONTAINER_NAME:                preparedata
  ARGO_TEMPLATE:                      ...
  ARGO_NODE_ID:                       wf-3365-1282933487
  ARGO_INCLUDE_SCRIPT_OUTPUT:         false
  ARGO_DEADLINE:                      2023-03-16T21:42:53Z
  ARGO_PROGRESS_FILE:                 /var/run/argo/progress
  ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
  ARGO_PROGRESS_FILE_TICK_DURATION:   3s
Mounts:
  ...

Containers: wait: Container ID: Image: quay.io/argoproj/argoexec:v3.4.4 Image ID: Port: <none> Host Port: <none> Command: argoexec wait –loglevel info –log-format text State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Requests: cpu: 100m Environment: ARGO_POD_NAME: wf-3365-1282933487 (v1:metadata.name) ARGO_POD_UID: (v1:metadata.uid) GODEBUG: x509ignoreCN=0 ARGO_WORKFLOW_NAME: wf-3365 ARGO_CONTAINER_NAME: wait ARGO_TEMPLATE: … ARGO_NODE_ID: wf-3365-1282933487 ARGO_INCLUDE_SCRIPT_OUTPUT: false ARGO_DEADLINE: 2023-03-16T21:42:53Z ARGO_PROGRESS_FILE: /var/run/argo/progress ARGO_PROGRESS_PATCH_TICK_DURATION: 1m0s ARGO_PROGRESS_FILE_TICK_DURATION: 3s Mounts: … main: Container ID: Image: eu.gcr.io/… Image ID: Port: <none> Host Port: <none> Command: /var/run/argo/argoexec emissary –loglevel info –log-format text – sh -c Args: …

State:          Waiting
  Reason:       PodInitializing
Ready:          False
Restart Count:  0
Requests:
  cpu:     100m
  memory:  020Gi
Environment Variables from:
  cr-configmap  ConfigMap  Optional: false
Environment:
  CLIENT_ID:                          ...
  CLIENT_SECRET:                      ...
  CLIENT:                             Machine
  TENANT:                             Production
  SERVER_URL:                         ...
  SERVICE_ACCOUNT:                    ...
  GCS_AUTH_FILE:                      ...
  CREATED_BY_WORKFLOW:                wf-3365
  GOOGLE_APPLICATION_CREDENTIALS:     ...
  DATA_ACCESS:                        test
  POD_NAME:                           wf-3365-1282933487
  CURRENT_RETRY:                      0
  ARGO_CONTAINER_NAME:                main
  ARGO_TEMPLATE:                      ...
  ARGO_NODE_ID:                       wf-3365-1282933487
  ARGO_INCLUDE_SCRIPT_OUTPUT:         false
  ARGO_DEADLINE:                      2023-03-16T21:42:53Z
  ARGO_PROGRESS_FILE:                 /var/run/argo/progress
  ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
  ARGO_PROGRESS_FILE_TICK_DURATION:   3s
Mounts:
  ...

Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: var-run-argo: Type: EmptyDir (a temporary directory that shares a pod’s lifetime) Medium: SizeLimit: <unset> tmp-dir-argo: Type: EmptyDir (a temporary directory that shares a pod’s lifetime) Medium: SizeLimit: <unset> el-vol: Type: EmptyDir (a temporary directory that shares a pod’s lifetime) Medium: SizeLimit: <unset> storage-credentials: Type: Secret (a volume populated by a Secret) SecretName: storage-secrets Optional: false storage-secrets: Type: Secret (a volume populated by a Secret) SecretName: storage-secrets Optional: false kube-api-access-pdhls: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Burstable Node-Selectors: nodetype=WORKFLOW Tolerations: WORKFLOW=BASIC:NoSchedule node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: <none> `

I see a lot of these messages in the log: No ready sandbox for pod can be found. Need to start a new one" pod="default/wf-3365-1282933487

Kubernetes version

1.23.14-gke.1800

Cloud provider

GCP

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 2
  • Comments: 20 (15 by maintainers)

Most upvoted comments

Actually, I can only repro this issue on 1.26 (with containerd 1.6.18 and 1.7.0), but cannot repro it on 1.27.1 (with containerd 1.7.0 and 1.6.9.)

On 1.27.1, the exit code of the init container is set correctly to 137

Status:       Failed
...
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    137

, whereas on 1.26, the exit code is incorrectly set to 0:

Status:       Pending
...
Init Containers:
...
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    0

Sync’ed with @bobbypage , this could be related to https://github.com/kubernetes/kubernetes/pull/115331 in 1.27

So @yonirab and @EladProject - can you try 1.27 and see if you still have the same issue?