argo-workflows: Pod OOM sometimes causes a workflow to get stuck in "Running" state

Checklist

  • Double-checked my configuration.
  • Tested using the latest version.
  • Used the Emissary executor.

Summary

This follows up on #7457.

What happened/what you expected to happen?

A small percentage of our pods fail with an OOMKilled status and should be restarted (per our workflow config) but aren’t. The pod (and workflow) hangs around indefinitely without appearing to do anything. It seems like the controller loses track of the pod until it’s either manually deleted or its hosting node is shut down. If either of those things happens, then the pod is restarted and the workflow usually succeeds.

It’s expected that sometimes we’ll run out of memory when running this workload. What’s surprising is that most OOMKilled pods restart automatically, but some don’t.

For a better sense of proportion, we run roughly 90,000 workflows per day. A few dozen of them get stuck in this state and have to be fixed manually.

The expected behavior is that Argo restarts the OOMKilled pod automatically until the retry limit is reached.

What version are you running?

3.3.2 for both the controller and executor.

Diagnostics

We’re running on EKS 1.21 and use cluster-autoscaler. I don’t think the problem is related to scaling, though, since we see the issue in pools of one node as well as pools of many nodes.

The stuck pods look like this:

$ kubectl -n argo-workflows get po | grep OOMKilled
my-workflow-8acc183e-ffrf2-1452349024        1/2     OOMKilled         0          38m

Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.

The below workflow matches our config as closely as possible – TTL, retries, etc. The resource limits are slightly lower. If you run it enough times, it should be possible to reproduce the bug.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  namespace: argo-workflows
  name: oom-test
spec:
  ttlStrategy:
    secondsAfterSuccess: 60
    secondsAfterFailure: 300
  entrypoint: oom-test
  templates:
    - name: oom-test
      retryStrategy:
        limit: "3"
        retryPolicy: Always
      container:
        image: ruby:3.0-alpine
        imagePullPolicy: IfNotPresent
        env:
        command: ["ruby", "-e", "foo = []; while true; foo << \"abcdefgh\"; end"]
        resources:
          limits:
            cpu: 500m
            memory: 128Mi
# Logs from the workflow controller:
$ kubectl logs -n argo deploy/workflow-controller | grep ${workflow} 
Found 3 pods, using pod/workflow-controller-6dd5db7b67-7wbkr
time="2022-04-22T19:21:28.908Z" level=info msg="Processing workflow" namespace=argo-workflows workflow=my-workflow-8acc183e-ffrf2
time="2022-04-22T19:21:28.908Z" level=info msg="Task-result reconciliation" namespace=argo-workflows numObjs=0 workflow=my-workflow-8acc183e-ffrf2
time="2022-04-22T19:21:28.908Z" level=info msg="node unchanged" nodeID=my-workflow-8acc183e-ffrf2-1452349024
time="2022-04-22T19:21:28.909Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=my-workflow-8acc183e-ffrf2
time="2022-04-22T19:21:28.909Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=my-workflow-8acc183e-ffrf2

# The workflow's pods that are problematic:
$ kubectl -n argo-workflows describe po my-workflow-8acc183e-ffrf2-1452349024
Name:         my-workflow-8acc183e-ffrf2-1452349024
Namespace:    argo-workflows
Priority:     0
Node:         ip-10-161-69-226.ec2.internal/10.161.69.226
Start Time:   Fri, 22 Apr 2022 14:39:19 -0400
Labels:       workflows.argoproj.io/completed=false
              workflows.argoproj.io/workflow=my-workflow-8acc183e-ffrf2
Annotations:  cluster-autoscaler.kubernetes.io/safe-to-evict: true
              cni.projectcalico.org/containerID: 4295e8d5db4862b7ec687add0197b23e86795f2cc52885eb279976b00a64d98c
              cni.projectcalico.org/podIP: 192.168.80.248/32
              cni.projectcalico.org/podIPs: 192.168.80.248/32
              kubectl.kubernetes.io/default-container: main
              kubernetes.io/psp: eks.privileged
              workflows.argoproj.io/node-id: my-workflow-8acc183e-ffrf2-1452349024
              workflows.argoproj.io/node-name: my-workflow-8acc183e-ffrf2(0)
Status:       Running
IP:           192.168.80.248
IPs:
  IP:           192.168.80.248
Controlled By:  Workflow/my-workflow-8acc183e-ffrf2
Init Containers:
  init:
    Container ID:  containerd://c7096309368e2dfc3ec809ec7dfbe9f5c164b3780b7fe0fe125023efddf93d1b
    Image:         quay.io/argoproj/argoexec:v3.3.2
    Image ID:      quay.io/argoproj/argoexec@sha256:817a790f1bbd2fd31b71c2126e5489e06b49a98936352e4851fd0f325e3332c3
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
      init
      --loglevel
      info
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 22 Apr 2022 14:39:48 -0400
      Finished:     Fri, 22 Apr 2022 14:39:48 -0400
    Ready:          True
    Restart Count:  0
    Environment:
      ARGO_POD_NAME:                      my-workflow-8acc183e-ffrf2-1452349024 (v1:metadata.name)
      ARGO_POD_UID:                        (v1:metadata.uid)
      ARGO_CONTAINER_RUNTIME_EXECUTOR:    emissary
      GODEBUG:                            x509ignoreCN=0
      ARGO_WORKFLOW_NAME:                 my-workflow-8acc183e-ffrf2
      ARGO_CONTAINER_NAME:                init
      ARGO_TEMPLATE:                      ...
      ARGO_NODE_ID:                       my-workflow-8acc183e-ffrf2-1452349024
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      0001-01-01T00:00:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p6tnt (ro)
Containers:
  wait:
    Container ID:  containerd://51b5e17ea47c919f6d4cc60ef359587c1da93d3e007b23312e43637c88abb78e
    Image:         quay.io/argoproj/argoexec:v3.3.2
    Image ID:      quay.io/argoproj/argoexec@sha256:817a790f1bbd2fd31b71c2126e5489e06b49a98936352e4851fd0f325e3332c3
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
      wait
      --loglevel
      info
    State:          Running
      Started:      Fri, 22 Apr 2022 14:39:55 -0400
    Ready:          True
    Restart Count:  0
    Environment:
      ARGO_POD_NAME:                      my-workflow-8acc183e-ffrf2-1452349024 (v1:metadata.name)
      ARGO_POD_UID:                        (v1:metadata.uid)
      ARGO_CONTAINER_RUNTIME_EXECUTOR:    emissary
      GODEBUG:                            x509ignoreCN=0
      ARGO_WORKFLOW_NAME:                 my-workflow-8acc183e-ffrf2
      ARGO_CONTAINER_NAME:                wait
      ARGO_TEMPLATE:                      ...
      ARGO_NODE_ID:                       my-workflow-8acc183e-ffrf2-1452349024
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      0001-01-01T00:00:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /mainctrfs/tmp from ephemeral-work-volume (rw)
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p6tnt (ro)
  main:
    Container ID:  containerd://6abf9a62c12400543e2e358c074af5bbe0249f2781af699da2025fd19dee61f1
    Image:         ....dkr.ecr.us-east-1.amazonaws.com/argo-workflow-images:my-app-my-workflow-default-8acc183e
    Image ID:      ....dkr.ecr.us-east-1.amazonaws.com/argo-workflow-images@sha256:8bc8de084174cd09b7f9271e0468416d8354057f5227dea72028c656e79be981
    Port:          <none>
    Host Port:     <none>
    Command:
      /var/run/argo/argoexec
      emissary
      --
      ...
    Args:
      ...
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 22 Apr 2022 14:39:55 -0400
      Finished:     Fri, 22 Apr 2022 14:41:01 -0400
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  512Mi
    Requests:
      cpu:     250m
      memory:  512Mi
    Environment:
      AWS_REGION:                         us-east-1
      AWS_ACCOUNT_NUMBER:                 ...
      AWS_SCHEME:                         https://
      ENVIRONMENT:                        production
      LOG_LEVEL:                          info
      RUNNING_ON_ARGO_WORKER:             true
      ARGO_CONTAINER_NAME:                main
      ARGO_TEMPLATE:                      [...]
      ARGO_NODE_ID:                       my-workflow-8acc183e-ffrf2-1452349024
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      0001-01-01T00:00:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /tmp from ephemeral-work-volume (rw)
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p6tnt (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  var-run-argo:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  ephemeral-work-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-p6tnt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              ...
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  39m   default-scheduler  Successfully assigned argo-workflows/my-workflow-8acc183e-ffrf2-1452349024 to ip-10-161-69-226.ec2.internal
  Normal  Pulled     39m   kubelet            Container image "quay.io/argoproj/argoexec:v3.3.2" already present on machine
  Normal  Created    39m   kubelet            Created container init
  Normal  Started    39m   kubelet            Started container init
  Normal  Pulled     39m   kubelet            Container image "quay.io/argoproj/argoexec:v3.3.2" already present on machine
  Normal  Created    39m   kubelet            Created container wait
  Normal  Started    39m   kubelet            Started container wait
  Normal  Pulled     39m   kubelet            Container image "....dkr.ecr.us-east-1.amazonaws.com/argo-workflow-images:my-app-my-workflow-default-8acc183e" already present on machine
  Normal  Created    39m   kubelet            Created container main
  Normal  Started    39m   kubelet            Started container main

# Logs from in your workflow's wait container, something like:
$ kubectl -n argo-workflows logs my-workflow-8acc183e-ffrf2-1452349024 wait
time="2022-04-22T18:39:55.721Z" level=info msg="Starting Workflow Executor" executorType=emissary version=v3.3.2
time="2022-04-22T18:39:55.724Z" level=info msg="Creating a emissary executor"
time="2022-04-22T18:39:55.725Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2022-04-22T18:39:55.725Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo-workflows podName=my-workflow-8acc183e-ffrf2-1452349024 template="..."
time="2022-04-22T18:39:55.725Z" level=info msg="Starting deadline monitor"
time="2022-04-22T18:44:55.726Z" level=info msg="Alloc=4954 TotalAlloc=10471 Sys=19922 NumGC=5 Goroutines=7"
time="2022-04-22T18:49:55.726Z" level=info msg="Alloc=4976 TotalAlloc=10582 Sys=19922 NumGC=7 Goroutines=7"
time="2022-04-22T18:54:55.726Z" level=info msg="Alloc=4957 TotalAlloc=10692 Sys=19922 NumGC=10 Goroutines=7"
time="2022-04-22T18:59:55.726Z" level=info msg="Alloc=4978 TotalAlloc=10800 Sys=19922 NumGC=12 Goroutines=7"
time="2022-04-22T19:04:55.726Z" level=info msg="Alloc=4958 TotalAlloc=10909 Sys=19922 NumGC=15 Goroutines=7"
time="2022-04-22T19:09:55.726Z" level=info msg="Alloc=4979 TotalAlloc=11017 Sys=19922 NumGC=17 Goroutines=7"
time="2022-04-22T19:14:55.726Z" level=info msg="Alloc=4959 TotalAlloc=11126 Sys=19922 NumGC=20 Goroutines=7"
time="2022-04-22T19:19:55.725Z" level=info msg="Alloc=4979 TotalAlloc=11235 Sys=19922 NumGC=22 Goroutines=7"
time="2022-04-22T19:24:55.726Z" level=info msg="Alloc=4959 TotalAlloc=11343 Sys=19922 NumGC=25 Goroutines=7"
time="2022-04-22T19:29:55.726Z" level=info msg="Alloc=4979 TotalAlloc=11451 Sys=19922 NumGC=27 Goroutines=7"
time="2022-04-22T19:34:55.725Z" level=info msg="Alloc=4959 TotalAlloc=11560 Sys=19922 NumGC=30 Goroutines=7"

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 3
  • Comments: 19 (6 by maintainers)

Commits related to this issue

Most upvoted comments

Tip of master is published as :latest. You can run that.