kubernetes: Add max retries when pod's restartPolicy is RestartPolicyOnFailure
If pod’s restartPolicy is RestartPolicyOnFailure, its containers will be restarted constantly if they fail. However it might be meaningless if the failure is caused by system failure or some of user’s program internal errors. It will fail even it retries a lot of time. At the same time, we could not recognize which cases to restart or not. I propose to add max retires for the policy RestartPolicyOnFailure to pod spec, pod will fail after max retries.
/kind feature /sig apps
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 17
- Comments: 63 (29 by maintainers)
Finally I implemented it. I add
maxRetiresin PodSpec.e.g.
The above pod will retry 3 times, and failed finally. Without
maxRetries, the pod will always retry, however it will not succeed and retries take a lot of time and resource.BTW: I implemented another version using annotation
kubernetes.io/maxRetriesForOnFailurePolicy. Maybe we could use it for alpha feature?Sorry all, I’ll write a KEP for it.
hi @filipre, This feature just aims to solve problems similar to yours, the pod might fail because all kinds of errors, it is meaningless to always retry. This feature could reduce costs.
I’m glad to work on a KEP then, thanks. Hope to finish it in release 1.25.
can use
.spec.backoffLimithttps://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policycan anyone confirm if this feature has been implemented cuz I dont see any word regarding this on official documentation of kubernetes. If its not implemented then I request the owners to kindly look into implementing this.
What is the status of this feature/PR? This is something we are really interested in.
Just sent a PR for it, implemented it using pod annotations. cc @filipre
/reopen
I’d be interested whether this feature would improve performance or reduce costs within the cluster. Sometimes we had services that restarted a thousand times before we noticed that there is something wrong.
This issue shouldn’t go stale before receiving at least some feedback. /remove-lifecycle stale