argo-workflows: Wait container stuck in a loop while main container finished successfully

Summary

Wait container is occasionally stuck, workflow keeps running until it times out, while the main pod has finished successfully.

wait container log:

time="2021-06-09T18:01:47.217Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=workflows --filter=label=io.kubernetes.pod.name=workflow-analytics-pipeline-cron-1623258000-2189159064"
time="2021-06-09T18:01:47.256Z" level=info msg="ignoring container \"main\" created at 2021-06-09 17:00:04 +0000 UTC, too long before process started"
time="2021-06-09T18:01:47.257Z" level=info msg="ignoring container \"wait\" created at 2021-06-09 17:00:03 +0000 UTC, too long before process started"
time="2021-06-09T18:01:48.257Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=workflows --filter=label=io.kubernetes.pod.name=workflow-analytics-pipeline-cron-1623258000-2189159064"
time="2021-06-09T18:01:48.307Z" level=info msg="ignoring container \"main\" created at 2021-06-09 17:00:04 +0000 UTC, too long before process started"
time="2021-06-09T18:01:48.307Z" level=info msg="ignoring container \"wait\" created at 2021-06-09 17:00:03 +0000 UTC, too long before process started"

pod is in Completed status, main container is in terminated - Completed (exit code: 0) status, wait container is in running, ready status. And the workflow (it’s a multi-step DAG workflow) hangs on running (eventually terminating after exceeding active deadline). The main container is a very simple bash step of recording timestamp (sh -c 'date +%s > /tmp/timestamp.txt) that takes perhaps seconds to complete.

Diagnostics

Running self-managed Kubernetes on AWS. Argo version 3.1.0-rc12 Docker executer

We’ve had this issue from time to time, we can’t pinpoint the version in which we started seeing it. Examining controller logs with the wf name in them:

time="2021-06-09T17:00:00.066Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.105Z" level=info msg="Updated phase -> Running" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.108Z" level=info msg="Retry nodeworkflow-analytics-pipeline-cron-1623258000 initialized Running" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.109Z" level=info msg="All of nodeworkflow-analytics-pipeline-cron-1623258000(0).generate-timestamp dependencies [] completed" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.109Z" level=info msg="DAG node workflow-analytics-pipeline-cron-1623258000-1105574991 initialized Running" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.110Z" level=info msg="Pod node workflow-analytics-pipeline-cron-1623258000-2189159064 initialized Pending" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.110Z" level=info msg="Retry node workflow-analytics-pipeline-cron-1623258000-2617208257 initialized Running" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.134Z" level=info msg="Created pod: workflow-analytics-pipeline-cron-1623258000(0).generate-timestamp(0) (workflow-analytics-pipeline-cron-1623258000-2189159064)" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.157Z" level=info msg="Workflow update successful" namespace=workflows phase=Running resourceVersion=165964949 workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:10.068Z" level=info msg="Updating node workflow-analytics-pipeline-cron-1623258000-2189159064 message: ContainerCreating" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:10.084Z" level=info msg="Workflow update successful" namespace=workflows phase=Running resourceVersion=165965126 workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:20.086Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:32.164Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:32.169Z" level=info msg="Updating node workflow-analytics-pipeline-cron-1623258000-2189159064 status Pending -> Running" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:32.416Z" level=info msg="Workflow update successful" namespace=workflows phase=Running resourceVersion=165965421 workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:42.418Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:20:42.420Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:40:42.421Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T18:00:00.027Z" level=info msg="stopping 'workflow-analytics-pipeline-cron-1623258000'" namespace=workflows workflow=workflow-analytics-pipeline-cron
time="2021-06-09T18:00:10.049Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T18:00:10.055Z" level=info msg="Applying shutdown deadline for pod workflow-analytics-pipeline-cron-1623258000-2189159064" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 13
  • Comments: 17 (6 by maintainers)

Most upvoted comments

not sure why this was closed, the issue hasn’t been solved - it has been circumvented by using k8sapin executor. Using the default Docker executor still has this issue.

@raviatmaryh thank you ,the issue has been solve.