argo-workflows: About failures due to exceeded resource quota

Motivation

Hello there! Some time back I asked a question on the Slack channel and still haven’t got any advice on the problem.

I need some help and advice about the way Argo handles resource quotas. We’re hitting the problem repeatedly in our namespace, a workflow fails because of quota limits and is not retried later on.

An example Workflow result:

pods "workflow-test-1578387522-82e06118" is forbidden: exceeded quota: diamand-quota, requested: limits.cpu=2, used: limits.cpu=23750m, limited: limits.cpu=24

Is there any advice with respect to the Workflow reconciliation? Any existing solutions? Does / should workflow-controller take care of that?

Summary

All in all, I need to know

a) whether the problem is on our site b) whether there is an easy solution how to get around failed Workflows due to resource quotas c) whether somebody else is hitting the issue d) whether there are any plans from the Argo site regarding this and/or how can I contribute

I am ready and willing to go ahead and see to the implementation myself, just not experienced enough to be able to tell whether this is something that can be implemented and how to go about this. again, any pointers are welcome! 😃

Cheers, Marek

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 2
Comments: 30 (23 by maintainers)

Most upvoted comments

v2.11

alexec on Aug 27, 2020

We are considering making this default behaviour in v2.11. Thoughts?

alexec on Aug 18, 2020

@CermakM let me backport it to 2.6.1, and i’ll open a pull request.

jamhed on Mar 5, 2020

@jamhed Just tested it out. Works like a charm! What a relief to see that … are there any plans for merging this to the upstream?!

@simster7 🙏

CermakM on Mar 5, 2020

Thank you. I wanted to verify it worked better.

alexec on Aug 26, 2020

I’ve created another test image: argoproj/workflow-controller:fix-3791.

Can you please try it out to confirm it fixes your problem?

We’ve installed your build argoproj/workflow-controller:fix-3791 (sha256:2cc4166ce). I can confirm the workflow acts much more stable with respect to resources in comparison to v2.9.5 (I haven’t tested newer releases than that).

Is there anything to observe in logs (I didn’t see any relevant messages)?

fridex on Aug 26, 2020

Thank you. I’ve created a new image for testing if you would like to try it: argoproj/workflow-controller:fix-3791 .

alexec on Aug 18, 2020

@CermakM @simster7 https://github.com/argoproj/argo/pull/2385

jamhed on Mar 7, 2020

@CermakM, this is pretty much the changes you need to do in argo helm chart:

images:
  namespace: jamhed
  tag: v2.4.3-1

jamhed on Feb 26, 2020

@CermakM jamhed/workflow-controller:v2.4.3-1 If it can’t schedule the pod, it just keeps it in Pending state. Details: https://github.com/argoproj/argo/compare/v2.4.3...jamhed:v2.4.3-1

jamhed on Feb 13, 2020

@xogeny I’m going to implement it, if you’d like you can be one of beta-testers 😃 Basically the idea is to alter how pods are scheduled, not to fail when there are no resources, but rather indicate a “Pending” state.

jamhed on Feb 3, 2020

I second that, we use argo workflows with gpus, and as of now having a limit on namespace causes workflow to fail. To make it work one needs to put number of retries to unlimited, and this is very bad – suppose there is an error in the pod itself. I’d like argo controller to be a bit more mindful wrt resource allocation and take pod requests and namespace limits into consideration.

jamhed on Jan 24, 2020