argo-workflows: Wait container cannot exit if main container is OOMKilled
What happened/what you expected to happen?
wait container cannot exit if main container is OOMKilled
Use Case: Start a workflow, then let the main container oom, and find that the wait container does not exit.
Reason: I read the source code and found that the wait container decides whether to exit by checking if the /var/run/argo/ctr/main/exitcode file exists, and the main container calls the defer function in the NewEmissaryCommand function to create the file. But when the main container is OOMKilled, the main container crashes directly, the defer function will not be called, and the wait container will not exit.
What version are you running? v3.3.5
Diagnostics
Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: dag-test2
namespace: icdc-test
spec:
arguments: {}
entrypoint: scheduler-alpha-entrypoint
podSpecPatch: '{"terminationGracePeriodSeconds": 0, "enableServiceLinks": false}'
serviceAccountName: argo-workflow
templates:
- name: A
container:
image: ubuntu
resources:
requests:
cpu: 1
memory: 10Ki
limits:
cpu: 1
memory: 50Mi
command: ["/bin/bash"]
args: ["-c","for x in {1..200}; do echo 'Round $x'; bash -c 'for b in {0..99999999}; do a=$b$a; done'; done"]
- name: scheduler-alpha-entrypoint
dag:
tasks:
- arguments: { }
name: C
template: A
inputs: {}
metadata: {}
outputs: {}
# Logs from the workflow controller:
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
# If the workflow's pods have not been created, you can skip the rest of the diagnostics.
# The workflow's pods that are problematic:
kubectl get pod -o yaml -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
# Logs from in your workflow's wait container, something like:
kubectl logs -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 32 (18 by maintainers)
Commits related to this issue
- fix: Fixes #9083 (main oom but wait stuck in running state) — committed to firetaker/argo-workflows by firetaker 2 years ago
This is really cool, I’ll check out the master branch