argo-workflows: About failures due to exceeded resource quota
Motivation
Hello there! Some time back I asked a question on the Slack channel and still haven’t got any advice on the problem.
I need some help and advice about the way Argo handles resource quotas. We’re hitting the problem repeatedly in our namespace, a workflow fails because of quota limits and is not retried later on.
An example Workflow result:
pods "workflow-test-1578387522-82e06118" is forbidden: exceeded quota: diamand-quota, requested: limits.cpu=2, used: limits.cpu=23750m, limited: limits.cpu=24
Is there any advice with respect to the Workflow reconciliation? Any existing solutions? Does / should workflow-controller take care of that?
Summary
All in all, I need to know
a) whether the problem is on our site b) whether there is an easy solution how to get around failed Workflows due to resource quotas c) whether somebody else is hitting the issue d) whether there are any plans from the Argo site regarding this and/or how can I contribute
I am ready and willing to go ahead and see to the implementation myself, just not experienced enough to be able to tell whether this is something that can be implemented and how to go about this. again, any pointers are welcome! 😃
Cheers, Marek
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 2
- Comments: 30 (23 by maintainers)
v2.11
We are considering making this default behaviour in v2.11. Thoughts?
@CermakM let me backport it to 2.6.1, and i’ll open a pull request.
@jamhed Just tested it out. Works like a charm! What a relief to see that … are there any plans for merging this to the upstream?!
@simster7 🙏
Thank you. I wanted to verify it worked better.
We’ve installed your build argoproj/workflow-controller:fix-3791 (sha256:2cc4166ce). I can confirm the workflow acts much more stable with respect to resources in comparison to
v2.9.5
(I haven’t tested newer releases than that).Is there anything to observe in logs (I didn’t see any relevant messages)?
Thank you. I’ve created a new image for testing if you would like to try it:
argoproj/workflow-controller:fix-3791
.@CermakM @simster7 https://github.com/argoproj/argo/pull/2385
@CermakM, this is pretty much the changes you need to do in argo helm chart:
@CermakM jamhed/workflow-controller:v2.4.3-1 If it can’t schedule the pod, it just keeps it in Pending state. Details: https://github.com/argoproj/argo/compare/v2.4.3...jamhed:v2.4.3-1
@xogeny I’m going to implement it, if you’d like you can be one of beta-testers 😃 Basically the idea is to alter how pods are scheduled, not to fail when there are no resources, but rather indicate a “Pending” state.
I second that, we use argo workflows with gpus, and as of now having a limit on namespace causes workflow to fail. To make it work one needs to put number of retries to unlimited, and this is very bad – suppose there is an error in the pod itself. I’d like argo controller to be a bit more mindful wrt resource allocation and take pod requests and namespace limits into consideration.