kubernetes: Pod pending in PodInitializing state after Init container OOMKilled
What happened?
We recently upgraded to 1.23 (from 1.22) and started to experience sporadic issues of stuck pods. This seems to happen when one of the Init containers of a pod exits with Reason OOMKilled (and Exit State: 0) We have restartPolicy set to ‘Never’
What did you expect to happen?
Pod will be evicted
How can we reproduce it (as minimally and precisely as possible)?
As said. The issue is sporadic and happens when there is a high memory load on the node (but not always).
Anything else we need to know?
Here is a sanitized pod description of a stuck pod: `Name: wf-3365-1282933487 Namespace: default Priority: 0 Node: gke–producti-m128-c16-sa-work-e7608549-fq5h/10.132.0.139 Start Time: Tue, 14 Mar 2023 23:42:54 +0200 Labels: workflows.argoproj.io/completed=false workflows.argoproj.io/workflow=wf-3365 Annotations: kubectl.kubernetes.io/default-container: main kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container wait; cpu request for container main; cpu request for init container init; cpu request f… workflows.argoproj.io/node-id: wf-3365-1282933487 workflows.argoproj.io/node-name: wf-3365(0)[0].wf-step0(0) Status: Pending IP: 10.4.29.112 IPs: IP: 10.4.29.112 Controlled By: Workflow/wf-3365 Init Containers: init: Container ID: containerd://c9877… Image: quay.io/argoproj/argoexec:v3.4.4 Image ID: quay.io/argoproj/argoexec@sha256:… Port: <none> Host Port: <none> Command: argoexec init –loglevel info –log-format text State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 14 Mar 2023 23:42:55 +0200 Finished: Tue, 14 Mar 2023 23:42:55 +0200 Ready: True Restart Count: 0 Requests: cpu: 100m Environment: ARGO_POD_NAME: wf-3365-1282933487 (v1:metadata.name) ARGO_POD_UID: (v1:metadata.uid) GODEBUG: x509ignoreCN=0 ARGO_WORKFLOW_NAME: wf-3365 ARGO_CONTAINER_NAME: init ARGO_TEMPLATE: … ARGO_NODE_ID: wf-3365-1282933487 ARGO_INCLUDE_SCRIPT_OUTPUT: false ARGO_DEADLINE: 2023-03-16T21:42:53Z ARGO_PROGRESS_FILE: /var/run/argo/progress ARGO_PROGRESS_PATCH_TICK_DURATION: 1m0s ARGO_PROGRESS_FILE_TICK_DURATION: 3s Mounts: … preparedata: Container ID: containerd://9f26… Image: google/cloud-sdk:latest Image ID: docker.io/google/cloud-sdk@… Port: <none> Host Port: <none> Command: bash -c Args: … A bunch of shell commands
State: Terminated
Reason: OOMKilled
Exit Code: 0
Started: Tue, 14 Mar 2023 23:42:59 +0200
Finished: Tue, 14 Mar 2023 23:43:58 +0200
Ready: True
Restart Count: 0
Requests:
cpu: 100m
Environment:
ARGO_CONTAINER_NAME: preparedata
ARGO_TEMPLATE: ...
ARGO_NODE_ID: wf-3365-1282933487
ARGO_INCLUDE_SCRIPT_OUTPUT: false
ARGO_DEADLINE: 2023-03-16T21:42:53Z
ARGO_PROGRESS_FILE: /var/run/argo/progress
ARGO_PROGRESS_PATCH_TICK_DURATION: 1m0s
ARGO_PROGRESS_FILE_TICK_DURATION: 3s
Mounts:
...
Containers: wait: Container ID: Image: quay.io/argoproj/argoexec:v3.4.4 Image ID: Port: <none> Host Port: <none> Command: argoexec wait –loglevel info –log-format text State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Requests: cpu: 100m Environment: ARGO_POD_NAME: wf-3365-1282933487 (v1:metadata.name) ARGO_POD_UID: (v1:metadata.uid) GODEBUG: x509ignoreCN=0 ARGO_WORKFLOW_NAME: wf-3365 ARGO_CONTAINER_NAME: wait ARGO_TEMPLATE: … ARGO_NODE_ID: wf-3365-1282933487 ARGO_INCLUDE_SCRIPT_OUTPUT: false ARGO_DEADLINE: 2023-03-16T21:42:53Z ARGO_PROGRESS_FILE: /var/run/argo/progress ARGO_PROGRESS_PATCH_TICK_DURATION: 1m0s ARGO_PROGRESS_FILE_TICK_DURATION: 3s Mounts: … main: Container ID: Image: eu.gcr.io/… Image ID: Port: <none> Host Port: <none> Command: /var/run/argo/argoexec emissary –loglevel info –log-format text – sh -c Args: …
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 100m
memory: 020Gi
Environment Variables from:
cr-configmap ConfigMap Optional: false
Environment:
CLIENT_ID: ...
CLIENT_SECRET: ...
CLIENT: Machine
TENANT: Production
SERVER_URL: ...
SERVICE_ACCOUNT: ...
GCS_AUTH_FILE: ...
CREATED_BY_WORKFLOW: wf-3365
GOOGLE_APPLICATION_CREDENTIALS: ...
DATA_ACCESS: test
POD_NAME: wf-3365-1282933487
CURRENT_RETRY: 0
ARGO_CONTAINER_NAME: main
ARGO_TEMPLATE: ...
ARGO_NODE_ID: wf-3365-1282933487
ARGO_INCLUDE_SCRIPT_OUTPUT: false
ARGO_DEADLINE: 2023-03-16T21:42:53Z
ARGO_PROGRESS_FILE: /var/run/argo/progress
ARGO_PROGRESS_PATCH_TICK_DURATION: 1m0s
ARGO_PROGRESS_FILE_TICK_DURATION: 3s
Mounts:
...
Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: var-run-argo: Type: EmptyDir (a temporary directory that shares a pod’s lifetime) Medium: SizeLimit: <unset> tmp-dir-argo: Type: EmptyDir (a temporary directory that shares a pod’s lifetime) Medium: SizeLimit: <unset> el-vol: Type: EmptyDir (a temporary directory that shares a pod’s lifetime) Medium: SizeLimit: <unset> storage-credentials: Type: Secret (a volume populated by a Secret) SecretName: storage-secrets Optional: false storage-secrets: Type: Secret (a volume populated by a Secret) SecretName: storage-secrets Optional: false kube-api-access-pdhls: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Burstable Node-Selectors: nodetype=WORKFLOW Tolerations: WORKFLOW=BASIC:NoSchedule node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: <none> `
I see a lot of these messages in the log:
No ready sandbox for pod can be found. Need to start a new one" pod="default/wf-3365-1282933487
Kubernetes version
1.23.14-gke.1800
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 2
- Comments: 20 (15 by maintainers)
Actually, I can only repro this issue on 1.26 (with containerd 1.6.18 and 1.7.0), but cannot repro it on 1.27.1 (with containerd 1.7.0 and 1.6.9.)
On 1.27.1, the exit code of the init container is set correctly to 137
, whereas on 1.26, the exit code is incorrectly set to 0:
Sync’ed with @bobbypage , this could be related to https://github.com/kubernetes/kubernetes/pull/115331 in 1.27
So @yonirab and @EladProject - can you try 1.27 and see if you still have the same issue?