argo-workflows: Workflow Failed: wait container is forcefully stopped

Checklist

Double-checked my configuration.
Tested using the latest version.
Used the Emissary executor.

Summary

We have updated Argo Workflow to the version 3.3.8 and lately, we noticed some of the Workflows are failing due to the Pods being terminated.

When I analyse the Pod, I noticed the following:

- containerID: containerd://d10fcac735c905dc65717148e04437e32148bea560bb66e9ed35d718471f1bf8
   image: argoproj/workflow-controller:v3.3.8-linux-amd64
   imageID: argoproj/workflow-controller:v3.3.8-linux-amd64@sha256:1ecc5b305fb784271797c14436c30b4b6d23204d603d416fdb0382f62604325f
   lastState: {}
   name: wait
   ready: false
   restartCount: 0
   started: false
   state:
     terminated:
       containerID: containerd://d10fcac735c905dc65717148e04437e32148bea560bb66e9ed35d718471f1bf8
       exitCode: 2
       finishedAt: "2022-08-10T07:06:08Z"
       reason: Error
       startedAt: "2022-08-10T07:06:07Z"

Also, when I check the logs of the aforementioned wait container, I don’t see anything else than the following two lines:

fatal: bad g in signal handler
fatal: bad g in signal handler

It’s not a consistent error, and I have checked that it happens around 1/12 times, but it’s weird and I don’t see the way of fixing it.

What version are you running?

I’m running Argo Workflows 3.3.8, deployed by Helm in our cluster.

Diagnostics

Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.

Any Workflow can reproduce this bug on our cluster at a point. Even the one in the examples.

# Logs from the workflow controller:
time="2022-08-10T08:51:07.741Z" level=info msg="Queueing Failed workflow metadata/metadata-process-6phwc for delete in 2m57s due to TTL"

# Logs from in your workflow's wait container, something like:
fatal: bad g in signal handler
fatal: bad g in signal handler

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 23 (13 by maintainers)

Most upvoted comments

@isubasinghe it was a problem with the 3.3.x version.

I don’t know what changed, but once I upgraded to 3.4.x, the error stopped happening.

BTW, I’m closing this, as the upgrade to the latest version solved the weird issue.

dacamposol on Oct 27, 2022

Do you mean with the version 3.4.0?

no. I mean the :latest image tag

alexec on Sep 9, 2022