argo-workflows: Pod failed with error: Pod was active on the node longer than the specified deadline -remain in status Running
Pre-requisites
- I have double-checked my configuration
- I can confirm the issues exists when I tested with
:latest
- I’d like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
I set a timeout of 10 seconds in a template. activeDeadlineSeconds: 10 After 10 seconds the pod received an error: the pod was active on the node for more than the specified deadline After a few minutes the pod is deleted in kubernetes, but its remain in status Pending or Running. The flow is blocked and no continue to the next template. I expected that the pod get failed status
In previous version the same workflow work ok.
Version
V3.4.2
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don’t enter a workflows that uses private images.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: hello-world- # Name of this Workflow
spec:
entrypoint: engine
podGC:
strategy: OnPodSuccess
templates:
- name: whalesay # Defining the "whalesay" template
activeDeadlineSeconds: 10
container:
image: docker/whalesay
command: ['sh','-c']
args: ["cowsay hello world && sleep 600"] # This template runs "cowsay" in the "whalesay" image with arguments "hello world"
resources:
requests:
memory: "3Gi"
cpu: "2000m"
limits:
memory: "3Gi"
cpu: "2000m"
- name: engine
parallelism: 7000
steps:
- - name: whalesay
template: whalesay
withSequence:
count: 1
Logs from the workflow controller
kubectl logs -n argo deploy/workflow-controller | grep ${workflow} workflow.log
Logs from in your workflow’s wait container
kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded wait.log
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 11
- Comments: 15 (4 by maintainers)
Any updates on this?
any one working on this issue? since latest version having fixes for all vulnerabilities because of workflows failure issue not able to upgrade to latest
We are also facing the same issue with v3.4.7. Is there any ETA to fix this issue?
@sarabala1979 Hi, I got the same issue with
timeout
. The pod status isDeadlineExceeded
but the workflow step phase is stillRunning
.I’m also experiencing this issue.
Example Workflow
My suspicion is that the deadlineExceeded node isn’t having it’s phase updated correctly here: https://github.com/argoproj/argo-workflows/blob/master/workflow/controller/steps.go#L249-L258 I think ErrDeadlineExceeded should have the same if not similar logic to ErrTimeout. Equivalent section of dag.go
Using timeout instead of activeDeadlineSeconds did however work
Using timeout instead
— EDIT — Update to this, it seems a longer timeout ends up with the same behaviour as activeDeadlineSeconds. So it remains in running and doesn’t exit
I also tried “timeout” , and the behavior was the same as “activedeadlineseconds” . The pods still in status Running and the never changes to Failed/Error.