argo-workflows: Pod OOM sometimes causes a workflow to get stuck in "Running" state
Checklist
- Double-checked my configuration.
- Tested using the latest version.
- Used the Emissary executor.
Summary
This follows up on #7457.
What happened/what you expected to happen?
A small percentage of our pods fail with an OOMKilled
status and should be restarted (per our workflow config) but aren’t. The pod (and workflow) hangs around indefinitely without appearing to do anything. It seems like the controller loses track of the pod until it’s either manually deleted or its hosting node is shut down. If either of those things happens, then the pod is restarted and the workflow usually succeeds.
It’s expected that sometimes we’ll run out of memory when running this workload. What’s surprising is that most OOMKilled pods restart automatically, but some don’t.
For a better sense of proportion, we run roughly 90,000 workflows per day. A few dozen of them get stuck in this state and have to be fixed manually.
The expected behavior is that Argo restarts the OOMKilled pod automatically until the retry limit is reached.
What version are you running?
3.3.2 for both the controller and executor.
Diagnostics
We’re running on EKS 1.21 and use cluster-autoscaler. I don’t think the problem is related to scaling, though, since we see the issue in pools of one node as well as pools of many nodes.
The stuck pods look like this:
$ kubectl -n argo-workflows get po | grep OOMKilled
my-workflow-8acc183e-ffrf2-1452349024 1/2 OOMKilled 0 38m
Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.
The below workflow matches our config as closely as possible – TTL, retries, etc. The resource limits are slightly lower. If you run it enough times, it should be possible to reproduce the bug.
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
namespace: argo-workflows
name: oom-test
spec:
ttlStrategy:
secondsAfterSuccess: 60
secondsAfterFailure: 300
entrypoint: oom-test
templates:
- name: oom-test
retryStrategy:
limit: "3"
retryPolicy: Always
container:
image: ruby:3.0-alpine
imagePullPolicy: IfNotPresent
env:
command: ["ruby", "-e", "foo = []; while true; foo << \"abcdefgh\"; end"]
resources:
limits:
cpu: 500m
memory: 128Mi
# Logs from the workflow controller:
$ kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
Found 3 pods, using pod/workflow-controller-6dd5db7b67-7wbkr
time="2022-04-22T19:21:28.908Z" level=info msg="Processing workflow" namespace=argo-workflows workflow=my-workflow-8acc183e-ffrf2
time="2022-04-22T19:21:28.908Z" level=info msg="Task-result reconciliation" namespace=argo-workflows numObjs=0 workflow=my-workflow-8acc183e-ffrf2
time="2022-04-22T19:21:28.908Z" level=info msg="node unchanged" nodeID=my-workflow-8acc183e-ffrf2-1452349024
time="2022-04-22T19:21:28.909Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=my-workflow-8acc183e-ffrf2
time="2022-04-22T19:21:28.909Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=my-workflow-8acc183e-ffrf2
# The workflow's pods that are problematic:
$ kubectl -n argo-workflows describe po my-workflow-8acc183e-ffrf2-1452349024
Name: my-workflow-8acc183e-ffrf2-1452349024
Namespace: argo-workflows
Priority: 0
Node: ip-10-161-69-226.ec2.internal/10.161.69.226
Start Time: Fri, 22 Apr 2022 14:39:19 -0400
Labels: workflows.argoproj.io/completed=false
workflows.argoproj.io/workflow=my-workflow-8acc183e-ffrf2
Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: true
cni.projectcalico.org/containerID: 4295e8d5db4862b7ec687add0197b23e86795f2cc52885eb279976b00a64d98c
cni.projectcalico.org/podIP: 192.168.80.248/32
cni.projectcalico.org/podIPs: 192.168.80.248/32
kubectl.kubernetes.io/default-container: main
kubernetes.io/psp: eks.privileged
workflows.argoproj.io/node-id: my-workflow-8acc183e-ffrf2-1452349024
workflows.argoproj.io/node-name: my-workflow-8acc183e-ffrf2(0)
Status: Running
IP: 192.168.80.248
IPs:
IP: 192.168.80.248
Controlled By: Workflow/my-workflow-8acc183e-ffrf2
Init Containers:
init:
Container ID: containerd://c7096309368e2dfc3ec809ec7dfbe9f5c164b3780b7fe0fe125023efddf93d1b
Image: quay.io/argoproj/argoexec:v3.3.2
Image ID: quay.io/argoproj/argoexec@sha256:817a790f1bbd2fd31b71c2126e5489e06b49a98936352e4851fd0f325e3332c3
Port: <none>
Host Port: <none>
Command:
argoexec
init
--loglevel
info
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 22 Apr 2022 14:39:48 -0400
Finished: Fri, 22 Apr 2022 14:39:48 -0400
Ready: True
Restart Count: 0
Environment:
ARGO_POD_NAME: my-workflow-8acc183e-ffrf2-1452349024 (v1:metadata.name)
ARGO_POD_UID: (v1:metadata.uid)
ARGO_CONTAINER_RUNTIME_EXECUTOR: emissary
GODEBUG: x509ignoreCN=0
ARGO_WORKFLOW_NAME: my-workflow-8acc183e-ffrf2
ARGO_CONTAINER_NAME: init
ARGO_TEMPLATE: ...
ARGO_NODE_ID: my-workflow-8acc183e-ffrf2-1452349024
ARGO_INCLUDE_SCRIPT_OUTPUT: false
ARGO_DEADLINE: 0001-01-01T00:00:00Z
ARGO_PROGRESS_FILE: /var/run/argo/progress
ARGO_PROGRESS_PATCH_TICK_DURATION: 1m0s
ARGO_PROGRESS_FILE_TICK_DURATION: 3s
Mounts:
/var/run/argo from var-run-argo (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p6tnt (ro)
Containers:
wait:
Container ID: containerd://51b5e17ea47c919f6d4cc60ef359587c1da93d3e007b23312e43637c88abb78e
Image: quay.io/argoproj/argoexec:v3.3.2
Image ID: quay.io/argoproj/argoexec@sha256:817a790f1bbd2fd31b71c2126e5489e06b49a98936352e4851fd0f325e3332c3
Port: <none>
Host Port: <none>
Command:
argoexec
wait
--loglevel
info
State: Running
Started: Fri, 22 Apr 2022 14:39:55 -0400
Ready: True
Restart Count: 0
Environment:
ARGO_POD_NAME: my-workflow-8acc183e-ffrf2-1452349024 (v1:metadata.name)
ARGO_POD_UID: (v1:metadata.uid)
ARGO_CONTAINER_RUNTIME_EXECUTOR: emissary
GODEBUG: x509ignoreCN=0
ARGO_WORKFLOW_NAME: my-workflow-8acc183e-ffrf2
ARGO_CONTAINER_NAME: wait
ARGO_TEMPLATE: ...
ARGO_NODE_ID: my-workflow-8acc183e-ffrf2-1452349024
ARGO_INCLUDE_SCRIPT_OUTPUT: false
ARGO_DEADLINE: 0001-01-01T00:00:00Z
ARGO_PROGRESS_FILE: /var/run/argo/progress
ARGO_PROGRESS_PATCH_TICK_DURATION: 1m0s
ARGO_PROGRESS_FILE_TICK_DURATION: 3s
Mounts:
/mainctrfs/tmp from ephemeral-work-volume (rw)
/var/run/argo from var-run-argo (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p6tnt (ro)
main:
Container ID: containerd://6abf9a62c12400543e2e358c074af5bbe0249f2781af699da2025fd19dee61f1
Image: ....dkr.ecr.us-east-1.amazonaws.com/argo-workflow-images:my-app-my-workflow-default-8acc183e
Image ID: ....dkr.ecr.us-east-1.amazonaws.com/argo-workflow-images@sha256:8bc8de084174cd09b7f9271e0468416d8354057f5227dea72028c656e79be981
Port: <none>
Host Port: <none>
Command:
/var/run/argo/argoexec
emissary
--
...
Args:
...
State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Fri, 22 Apr 2022 14:39:55 -0400
Finished: Fri, 22 Apr 2022 14:41:01 -0400
Ready: False
Restart Count: 0
Limits:
cpu: 500m
memory: 512Mi
Requests:
cpu: 250m
memory: 512Mi
Environment:
AWS_REGION: us-east-1
AWS_ACCOUNT_NUMBER: ...
AWS_SCHEME: https://
ENVIRONMENT: production
LOG_LEVEL: info
RUNNING_ON_ARGO_WORKER: true
ARGO_CONTAINER_NAME: main
ARGO_TEMPLATE: [...]
ARGO_NODE_ID: my-workflow-8acc183e-ffrf2-1452349024
ARGO_INCLUDE_SCRIPT_OUTPUT: false
ARGO_DEADLINE: 0001-01-01T00:00:00Z
ARGO_PROGRESS_FILE: /var/run/argo/progress
ARGO_PROGRESS_PATCH_TICK_DURATION: 1m0s
ARGO_PROGRESS_FILE_TICK_DURATION: 3s
Mounts:
/tmp from ephemeral-work-volume (rw)
/var/run/argo from var-run-argo (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p6tnt (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
var-run-argo:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
ephemeral-work-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-p6tnt:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: ...
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 39m default-scheduler Successfully assigned argo-workflows/my-workflow-8acc183e-ffrf2-1452349024 to ip-10-161-69-226.ec2.internal
Normal Pulled 39m kubelet Container image "quay.io/argoproj/argoexec:v3.3.2" already present on machine
Normal Created 39m kubelet Created container init
Normal Started 39m kubelet Started container init
Normal Pulled 39m kubelet Container image "quay.io/argoproj/argoexec:v3.3.2" already present on machine
Normal Created 39m kubelet Created container wait
Normal Started 39m kubelet Started container wait
Normal Pulled 39m kubelet Container image "....dkr.ecr.us-east-1.amazonaws.com/argo-workflow-images:my-app-my-workflow-default-8acc183e" already present on machine
Normal Created 39m kubelet Created container main
Normal Started 39m kubelet Started container main
# Logs from in your workflow's wait container, something like:
$ kubectl -n argo-workflows logs my-workflow-8acc183e-ffrf2-1452349024 wait
time="2022-04-22T18:39:55.721Z" level=info msg="Starting Workflow Executor" executorType=emissary version=v3.3.2
time="2022-04-22T18:39:55.724Z" level=info msg="Creating a emissary executor"
time="2022-04-22T18:39:55.725Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2022-04-22T18:39:55.725Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo-workflows podName=my-workflow-8acc183e-ffrf2-1452349024 template="..."
time="2022-04-22T18:39:55.725Z" level=info msg="Starting deadline monitor"
time="2022-04-22T18:44:55.726Z" level=info msg="Alloc=4954 TotalAlloc=10471 Sys=19922 NumGC=5 Goroutines=7"
time="2022-04-22T18:49:55.726Z" level=info msg="Alloc=4976 TotalAlloc=10582 Sys=19922 NumGC=7 Goroutines=7"
time="2022-04-22T18:54:55.726Z" level=info msg="Alloc=4957 TotalAlloc=10692 Sys=19922 NumGC=10 Goroutines=7"
time="2022-04-22T18:59:55.726Z" level=info msg="Alloc=4978 TotalAlloc=10800 Sys=19922 NumGC=12 Goroutines=7"
time="2022-04-22T19:04:55.726Z" level=info msg="Alloc=4958 TotalAlloc=10909 Sys=19922 NumGC=15 Goroutines=7"
time="2022-04-22T19:09:55.726Z" level=info msg="Alloc=4979 TotalAlloc=11017 Sys=19922 NumGC=17 Goroutines=7"
time="2022-04-22T19:14:55.726Z" level=info msg="Alloc=4959 TotalAlloc=11126 Sys=19922 NumGC=20 Goroutines=7"
time="2022-04-22T19:19:55.725Z" level=info msg="Alloc=4979 TotalAlloc=11235 Sys=19922 NumGC=22 Goroutines=7"
time="2022-04-22T19:24:55.726Z" level=info msg="Alloc=4959 TotalAlloc=11343 Sys=19922 NumGC=25 Goroutines=7"
time="2022-04-22T19:29:55.726Z" level=info msg="Alloc=4979 TotalAlloc=11451 Sys=19922 NumGC=27 Goroutines=7"
time="2022-04-22T19:34:55.725Z" level=info msg="Alloc=4959 TotalAlloc=11560 Sys=19922 NumGC=30 Goroutines=7"
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 3
- Comments: 19 (6 by maintainers)
Commits related to this issue
- fix: Pod `OOMKilled` should failed workflow. Fixes #8456 Signed-off-by: Alex Collins <alex_collins@intuit.com> — committed to alexec/argo-workflows by alexec 2 years ago
- fix: Pod `OOMKilled` should failed workflow. Fixes #8456 Signed-off-by: Alex Collins <alex_collins@intuit.com> — committed to alexec/argo-workflows by alexec 2 years ago
- fix: Pod `OOMKilled` should fail workflow. Fixes #8456 (#8478) Signed-off-by: Alex Collins <alex_collins@intuit.com> — committed to argoproj/argo-workflows by alexec 2 years ago
- fix: Pod `OOMKilled` should fail workflow. Fixes #8456 (#8478) Signed-off-by: Alex Collins <alex_collins@intuit.com> — committed to argoproj/argo-workflows by alexec 2 years ago
- Revert "fix: Pod `OOMKilled` should fail workflow. Fixes #8456 (#8478)" This reverts commit f2b075c29ee97c95cfebb453b18c0ce5f16a5f04. Signed-off-by: Alex Collins <alex_collins@intuit.com> — committed to argoproj/argo-workflows by alexec 2 years ago
Tip of master is published as
:latest
. You can run that.