argo-workflows: Wait container cannot exit if main container is OOMKilled

What happened/what you expected to happen?

wait container cannot exit if main container is OOMKilled

Use Case: Start a workflow, then let the main container oom, and find that the wait container does not exit.

截屏2022-06-30 上午10 47 17

截屏2022-06-30 上午10 48 03

Reason: I read the source code and found that the wait container decides whether to exit by checking if the /var/run/argo/ctr/main/exitcode file exists, and the main container calls the defer function in the NewEmissaryCommand function to create the file. But when the main container is OOMKilled, the main container crashes directly, the defer function will not be called, and the wait container will not exit.

What version are you running? v3.3.5

Diagnostics

Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: dag-test2
  namespace: icdc-test
spec:
  arguments: {}
  entrypoint: scheduler-alpha-entrypoint
  podSpecPatch: '{"terminationGracePeriodSeconds": 0, "enableServiceLinks": false}'
  serviceAccountName: argo-workflow
  templates:
    - name: A
      container:
        image: ubuntu
        resources:
          requests:
            cpu: 1
            memory: 10Ki
          limits:
            cpu: 1
            memory: 50Mi
        command: ["/bin/bash"]
        args: ["-c","for x in {1..200}; do echo 'Round $x'; bash -c 'for b in {0..99999999}; do a=$b$a; done'; done"]
    - name: scheduler-alpha-entrypoint
      dag:
        tasks:
          - arguments: { }
            name: C
            template: A
      inputs: {}
      metadata: {}
      outputs: {}

# Logs from the workflow controller:
kubectl logs -n argo deploy/workflow-controller | grep ${workflow} 

# If the workflow's pods have not been created, you can skip the rest of the diagnostics.

# The workflow's pods that are problematic:
kubectl get pod -o yaml -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

# Logs from in your workflow's wait container, something like:
kubectl logs -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 32 (18 by maintainers)

Commits related to this issue

Most upvoted comments

This is really cool, I’ll check out the master branch