argo-workflows: Pod failed with error: Pod was active on the node longer than the specified deadline -remain in status Running

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I’d like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I set a timeout of 10 seconds in a template. activeDeadlineSeconds: 10 After 10 seconds the pod received an error: the pod was active on the node for more than the specified deadline After a few minutes the pod is deleted in kubernetes, but its remain in status Pending or Running. The flow is blocked and no continue to the next template. I expected that the pod get failed status

In previous version the same workflow work ok.

Version

V3.4.2

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don’t enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-  # Name of this Workflow
spec:
  entrypoint: engine
  podGC:
    strategy: OnPodSuccess
  templates:
  - name: whalesay            # Defining the "whalesay" template
    activeDeadlineSeconds: 10
    container:
      image: docker/whalesay
      command: ['sh','-c']
      args: ["cowsay hello world && sleep 600"]   # This template runs "cowsay" in the "whalesay" image with arguments "hello world"
      resources:
        requests:
          memory: "3Gi"
          cpu: "2000m"
        limits:
          memory: "3Gi"
          cpu: "2000m"
  - name: engine
    parallelism: 7000
    steps:
      - - name: whalesay
          template: whalesay
          withSequence:
            count:  1

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow} workflow.log

Logs from in your workflow’s wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded wait.log

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 11
  • Comments: 15 (4 by maintainers)

Most upvoted comments

Any updates on this?

any one working on this issue? since latest version having fixes for all vulnerabilities because of workflows failure issue not able to upgrade to latest

We are also facing the same issue with v3.4.7. Is there any ETA to fix this issue?

@sarabala1979 Hi, I got the same issue with timeout. The pod status is DeadlineExceeded but the workflow step phase is still Running.

I’m also experiencing this issue.

Example Workflow
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: active-deadline-test
spec:
  entrypoint: active-deadline-test
  templates:
  - name: active-deadline-test
    parallelism: 10
    steps:
    - - name: active-deadline-test-timeout
        inline:
          activeDeadlineSeconds: '5'
          script:
              image: alpine:{{.Chart.AppVersion}}
              command: [bin/bash]
              source: |
                sleep 100s

My suspicion is that the deadlineExceeded node isn’t having it’s phase updated correctly here: https://github.com/argoproj/argo-workflows/blob/master/workflow/controller/steps.go#L249-L258 I think ErrDeadlineExceeded should have the same if not similar logic to ErrTimeout. Equivalent section of dag.go

Using timeout instead of activeDeadlineSeconds did however work

Using timeout instead
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: active-deadline-test
spec:
  entrypoint: active-deadline-test
  templates:
  - name: active-deadline-test
    parallelism: 10
    dag:
      tasks:
        - name: test-timeout-set
          template: test-timeout
          arguments:
            parameters:
              - name: timeout
                value: '5s'
        - name: test-timeout-unset
          template: test-timeout
        - name: test-timeout-set-empty
          template: test-timeout
          arguments:
            parameters:
              - name: timeout
                value: ''
        - name: test-timeout-set-zero
          template: test-timeout
          arguments:
            parameters:
              - name: timeout
                value: '0s'

  - name: test-timeout
    inputs:
      parameters:
        - name: timeout
          default: ''
    timeout: '{{`{{inputs.parameters.timeout}}`}}'
    script:
        image: alpine:{{.Chart.AppVersion}}
        command: [bin/bash]
        source: |
          sleep 100s

— EDIT — Update to this, it seems a longer timeout ends up with the same behaviour as activeDeadlineSeconds. So it remains in running and doesn’t exit

I also tried “timeout” , and the behavior was the same as “activedeadlineseconds” . The pods still in status Running and the never changes to Failed/Error.