argo-workflows: Exceeded Quota Causes Failed Workflows

Checklist:

  • I’ve included the version.
  • I’ve included reproduction steps.
  • I’ve included the workflow YAML.
  • I’ve included the logs.

What happened: Workflow failed due to exceeding CPU quota and also due to exceeding memory quota

What you expected to happen: Pod should stay in pending state until it is able to get the necessary resources.

How to reproduce it (as minimally and precisely as possible):

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: cpu-limit-
spec:
  serviceAccountName: argo
  entrypoint: wait

  templates:
  - name: wait
    resubmitPendingPods: True
    script:
      image: alpine:latest
      command: [sh, -c]
      args: ["sleep 30s"]
      resources:
        requests:
          cpu: 200m
        limits:
          cpu: 200m
for i in {1..20}
do
    argo submit test-workflow.yaml
done

*replace cpu limit and request with memory to test memory

Anything else we need to know?:

Environment:

  • Argo version:
$ argo version
argo: 2.8.2+8a151ae.dirty
  BuildDate: 2020-06-18T23:50:58Z
  GitCommit: 8a151aec6538c9442cf2380c2544ba3efb60ff60
  GitTreeState: dirty
  GitTag: 2.8.2
  GoVersion: go1.13
  Compiler: gc
  Platform: linux/amd64
  • Kubernetes version :
$ kubectl version -o yaml
clientVersion:
  buildDate: 2020-01-29T21:26:39Z
  compiler: gc
  gitCommit: d4cacc0
  gitTreeState: clean
  gitVersion: v1.10.0+d4cacc0
  goVersion: go1.14beta1
  major: "1"
  minor: 10+
  platform: linux/amd64
serverVersion:
  buildDate: 2020-05-04T12:54:43Z
  compiler: gc
  gitCommit: a3ec9df
  gitTreeState: clean
  gitVersion: v1.16.2
  goVersion: go1.12.12
  major: "1"
  minor: 16+
  platform: linux/amd64

Other debugging information (if applicable):

  • workflow result:
$ argo --loglevel DEBUG get <workflowname>
DEBU[0000] CLI version                                   version="{2.8.2+8a151ae.dirty 2020-06-18T23:50:58Z 8a151aec6538c9442cf2380c2544ba3efb60ff60 2.8.2 dirty go1.13 gc linux/amd64}"
DEBU[0000] Client options                                opts="{{ false false} 0x1574670 0xc000117900}"
Name:                cpu-limit-r4jsz
Namespace:           thoth-test-core
ServiceAccount:      argo
Status:              Error
Message:             pods "cpu-limit-r4jsz" is forbidden: exceeded quota: thoth-test-core-quota, requested: limits.memory=3048Mi, used: limits.memory=30096Mi, limited: limits.memory=32Gi
Conditions:          
 Completed           True
Created:             Thu Jul 30 14:11:30 -0400 (11 minutes ago)
Started:             Thu Jul 30 14:11:30 -0400 (11 minutes ago)
Finished:            Thu Jul 30 14:11:31 -0400 (11 minutes ago)
Duration:            1 second

STEP                TEMPLATE  PODNAME          DURATION  MESSAGE
 ⚠ cpu-limit-r4jsz  wait      cpu-limit-r4jsz  0s        pods "cpu-limit-r4jsz" is forbidden: exceeded quota: thoth-test-core-quota, requested: limits.memory=3048Mi, used: limits.memory=30096Mi, limited: limits.memory=32Gi 

Related #3419 #3490

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 17 (10 by maintainers)

Most upvoted comments

I’ll take a look