argo-workflows: Multi-step workflow does not terminate (wait container does not exist with Docker executor in v3.0)

Summary

Executing a multi-step workflow does not terminate or proceed to the next step even after the pod has terminated.

The issue started at https://github.com/hyfen-nl/PIVT/issues/106, where the developer identified this as a potential Argo issue. The logs below are based on the example at https://argoproj.github.io/argo-workflows/examples/#steps.

Diagnostics

What Kubernetes provider are you using? Digital Ocean

What version of Argo Workflows are you running? v3.0.7

Paste a workflow that reproduces the bug, including status:
kubectl get wf -o yaml ${workflow} 

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  creationTimestamp: "2021-06-02T07:19:32Z"
  generateName: steps-
  generation: 3
  labels:
    workflows.argoproj.io/phase: Running
  name: steps-5xt4d
  namespace: default
  resourceVersion: "3619084"
  uid: bc955ea0-7a0c-4a36-8a37-d3fb10e27615
spec:
  arguments: {}
  entrypoint: hello-hello-hello
  templates:
  - inputs: {}
    metadata: {}
    name: hello-hello-hello
    outputs: {}
    steps:
    - - arguments:
          parameters:
          - name: message
            value: hello1
        name: hello1
        template: whalesay
    - - arguments:
          parameters:
          - name: message
            value: hello2a
        name: hello2a
        template: whalesay
      - arguments:
          parameters:
          - name: message
            value: hello2b
        name: hello2b
        template: whalesay
  - container:
      args:
      - '{{inputs.parameters.message}}'
      command:
      - cowsay
      image: docker/whalesay
      name: ""
      resources: {}
    inputs:
      parameters:
      - name: message
    metadata: {}
    name: whalesay
    outputs: {}
status:
  artifactRepositoryRef:
    default: true
  conditions:
  - status: "True"
    type: PodRunning
  finishedAt: null
  nodes:
    steps-5xt4d:
      children:
      - steps-5xt4d-3743377224
      displayName: steps-5xt4d
      finishedAt: null
      id: steps-5xt4d
      name: steps-5xt4d
      phase: Running
      progress: 0/1
      startedAt: "2021-06-02T07:19:32Z"
      templateName: hello-hello-hello
      templateScope: local/steps-5xt4d
      type: Steps
    steps-5xt4d-293443185:
      boundaryID: steps-5xt4d
      displayName: hello1
      finishedAt: null
      hostNodeName: hlf-pool1-8rnem
      id: steps-5xt4d-293443185
      inputs:
        parameters:
        - name: message
          value: hello1
      name: steps-5xt4d[0].hello1
      phase: Running
      progress: 0/1
      startedAt: "2021-06-02T07:19:32Z"
      templateName: whalesay
      templateScope: local/steps-5xt4d
      type: Pod
    steps-5xt4d-3743377224:
      boundaryID: steps-5xt4d
      children:
      - steps-5xt4d-293443185
      displayName: '[0]'
      finishedAt: null
      id: steps-5xt4d-3743377224
      name: steps-5xt4d[0]
      phase: Running
      progress: 0/1
      startedAt: "2021-06-02T07:19:32Z"
      templateScope: local/steps-5xt4d
      type: StepGroup
  phase: Running
  progress: 0/1
  startedAt: "2021-06-02T07:19:32Z"
Paste the logs from the workflow controller:
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

time="2021-06-02T07:19:32.754Z" level=info msg="Processing workflow" namespace=default workflow=steps-5xt4d
time="2021-06-02T07:19:32.769Z" level=info msg="Updated phase  -> Running" namespace=default workflow=steps-5xt4d
time="2021-06-02T07:19:32.780Z" level=info msg="Steps node steps-5xt4d initialized Running" namespace=default workflow=steps-5xt4d
time="2021-06-02T07:19:32.786Z" level=info msg="StepGroup node steps-5xt4d-3743377224 initialized Running" namespace=default workflow=steps-5xt4d
time="2021-06-02T07:19:32.790Z" level=info msg="Pod node steps-5xt4d-293443185 initialized Pending" namespace=default workflow=steps-5xt4d
time="2021-06-02T07:19:32.808Z" level=info msg="Created pod: steps-5xt4d[0].hello1 (steps-5xt4d-293443185)" namespace=default workflow=steps-5xt4d
time="2021-06-02T07:19:32.808Z" level=info msg="Workflow step group node steps-5xt4d-3743377224 not yet completed" namespace=default workflow=steps-5xt4d
time="2021-06-02T07:19:32.852Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=3619053 workflow=steps-5xt4d
time="2021-06-02T07:19:42.869Z" level=info msg="Processing workflow" namespace=default workflow=steps-5xt4d
time="2021-06-02T07:19:42.874Z" level=info msg="Updating node steps-5xt4d-293443185 status Pending -> Running" namespace=default workflow=steps-5xt4d
time="2021-06-02T07:19:42.882Z" level=info msg="Workflow step group node steps-5xt4d-3743377224 not yet completed" namespace=default workflow=steps-5xt4d
time="2021-06-02T07:19:42.930Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=3619084 workflow=steps-5xt4d
time="2021-06-02T07:19:52.913Z" level=info msg="Processing workflow" namespace=default workflow=steps-5xt4d
time="2021-06-02T07:19:52.916Z" level=info msg="Workflow step group node steps-5xt4d-3743377224 not yet completed" namespace=default workflow=steps-5xt4d
time="2021-06-02T07:39:52.914Z" level=info msg="Processing workflow" namespace=default workflow=steps-5xt4d
time="2021-06-02T07:39:52.915Z" level=info msg="Workflow step group node steps-5xt4d-3743377224 not yet completed" namespace=default workflow=steps-5xt4d
time="2021-06-02T07:59:52.914Z" level=info msg="Processing workflow" namespace=default workflow=steps-5xt4d
time="2021-06-02T07:59:52.915Z" level=info msg="Workflow step group node steps-5xt4d-3743377224 not yet completed" namespace=default workflow=steps-5xt4d
time="2021-06-02T08:19:52.919Z" level=info msg="Processing workflow" namespace=default workflow=steps-5xt4d
time="2021-06-02T08:19:52.919Z" level=info msg="Workflow step group node steps-5xt4d-3743377224 not yet completed" namespace=default workflow=steps-5xt4d
time="2021-06-02T08:39:52.919Z" level=info msg="Processing workflow" namespace=default workflow=steps-5xt4d
time="2021-06-02T08:39:52.920Z" level=info msg="Workflow step group node steps-5xt4d-3743377224 not yet completed" namespace=default workflow=steps-5xt4d
time="2021-06-02T08:59:52.920Z" level=info msg="Processing workflow" namespace=default workflow=steps-5xt4d
time="2021-06-02T08:59:52.921Z" level=info msg="Workflow step group node steps-5xt4d-3743377224 not yet completed" namespace=default workflow=steps-5xt4d
time="2021-06-02T09:19:52.920Z" level=info msg="Processing workflow" namespace=default workflow=steps-5xt4d
time="2021-06-02T09:19:52.921Z" level=info msg="Workflow step group node steps-5xt4d-3743377224 not yet completed" namespace=default workflow=steps-5xt4d
Paste the logs from your workflow's wait container:
kubectl logs -c wait -l workflows.argoproj.io/workflow=${workflow}

time="2021-06-02T09:24:46.518Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=steps-5xt4d-293443185"
time="2021-06-02T09:24:47.681Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=steps-5xt4d-293443185"
time="2021-06-02T09:24:48.847Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=steps-5xt4d-293443185"
time="2021-06-02T09:24:50.027Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=steps-5xt4d-293443185"
time="2021-06-02T09:24:51.191Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=steps-5xt4d-293443185"
time="2021-06-02T09:24:52.360Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=steps-5xt4d-293443185"
time="2021-06-02T09:24:53.565Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=steps-5xt4d-293443185"
time="2021-06-02T09:24:54.767Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=steps-5xt4d-293443185"
time="2021-06-02T09:24:55.891Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=steps-5xt4d-293443185"
time="2021-06-02T09:24:56.933Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=steps-5xt4d-293443185"

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 4
  • Comments: 22 (11 by maintainers)

Commits related to this issue

Most upvoted comments

I am using 3.2.6 version and I get the same behavior. I opened a discussion on this issue --> https://github.com/argoproj/argo-workflows/discussions/7480

It looks to me that your main container exited quickly, <1s?

Maybe it’s because it’s simply executing a hello-world statement. The reason I’m using this to test is because it’s much simpler than the original workflow I was using. In that workflow I was encountering the same issue, in which the workflow would stop at a step and not proceed to the next.