enhancements: Retriable and non-retriable Pod failures for Jobs
Enhancement Description
-
One-line enhancement description (can be used as a release note): An API to influence retries based on exit codes and/or pod deletion reasons.
-
Kubernetes Enhancement Proposal: https://git.k8s.io/enhancements/keps/sig-apps/3329-retriable-and-non-retriable-failures
-
Discussion Link: https://github.com/kubernetes/kubernetes/issues/17244
-
Primary contact (assignee): @alculquicondor
-
Responsible SIGs: apps, api-machinery, scheduling
-
Enhancement target (which target equals to which milestone):
- Alpha release target (x.y): 1.25
- Beta release target (x.y): 1.26
- Stable release target (x.y):
-
Alpha
- KEP (
k/enhancements) update PR(s): - Code (
k/k) update PR(s):- kubernetes/kubernetes#111070
- kubernetes/kubernetes#111084
- kubernetes/kubernetes#111091
- kubernetes/kubernetes#110959
- kubernetes/kubernetes#111113
- kubernetes/kubernetes#111475
- Docs (
k/website) update PR(s): kubernetes/website#35219
- KEP (
-
Beta
- KEP (
k/enhancements) update PR(s): - Code (
k/k) update PR(s):- https://github.com/kubernetes/kubernetes/pull/112360
- https://github.com/kubernetes/kubernetes/pull/113324
- https://github.com/kubernetes/kubernetes/pull/113304
- https://github.com/kubernetes/kubernetes/pull/113360
- https://github.com/kubernetes/kubernetes/pull/113812
- https://github.com/kubernetes/kubernetes/pull/113580
- https://github.com/kubernetes/kubernetes/pull/113856
- https://github.com/kubernetes/kubernetes/pull/113860
- https://github.com/kubernetes/kubernetes/pull/113927
- kubernetes/kubernetes#114770
- kubernetes/kubernetes#114914
- kubernetes/kubernetes#115056
- https://github.com/kubernetes/kubernetes/pull/115331
- https://github.com/kubernetes/kubernetes/pull/116554
- https://github.com/kubernetes/kubernetes/pull/117586
- https://github.com/kubernetes/kubernetes/pull/117015
- https://github.com/kubernetes/kubernetes/pull/121103
- Docs (
k/website) update(s):
- KEP (
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 2
- Comments: 83 (71 by maintainers)
skimming the discussion, my takeaways are:
"What we should be doing for this is a set of single-choice cases, where one case is "just X", and another is "just Y" and a third is "X and Y".) doesn’t seem like it scales wellHi @alculquicondor @mimowo, Enhancements team here again 👋
Checking in as we approach Code Freeze at 01:00 UTC on Wednesday, 3rd August 2022.
Please ensure that the following items are completed before the code-freeze:
Let me know if there are any additional k/k PRs besides the ones listed above
Currently, the status of the enhancement is marked as
at-riskThanks 😃
We will also have an improvement to this KEP as part of: https://github.com/kubernetes/kubernetes/pull/117015 in 1.28.
Yes, these are the pieces of work that need to be done to support fully restart policy
NeverandOnFailure:restartPolicy=Never(this KEP, currently Beta without blocking issues open)maxRestartTimes(started as a new KEP)restartPolicy=OnFailure(not started, but seems a small follow-up KEP); plus change to Job’s backoffLimit when maxRestartTimes is specifiedcontainerRestartPolicyin pod spec (not started, AFAIK purely node changes) (enabling API similar to the one in https://github.com/kubernetes/enhancements/issues/3329#issuecomment-1571643421)Note that:
IMO doing (1.) - (4.) under this KEP would prolong its graduation substantially, and different points have different priorities, so it is useful to de-couple. Recently we got this slack ask to support
OnFailure. We will need to keep track of how much demand there is for (3.) and (4.) and prioritize accordingly, but de-coupling allows that.I was hand-waving that a Pod can fail or it can perma-fail. A perma-fail means “give up”. This would give Pods the ability to use
restartPolicy=OnFailure(e.g. OOM) and still recognize that some failures are intentional and should not be restarted. It seems unfortunate to give up onrestartPolicywhen what you really want is a slightly different policy. I’m trying to emphasize that, in general, we want to pass information down as far as possible and enable the most-local control loop to make correct decisions. In this case, we can tell kubelet “restart this pod until it exits cleanly, or until it hits a specific case”.You still need the higher-level control-plane (Job) to react to those specific cases. It just doesn’t need to be in the loop for ALL POSSIBLE exit codes.
We can do it all in the higher level control plane, but WHY? I am reminded of another system I worked on which did exactly this, and ultimately had to add the ability to make more localized retry decisions in order to scale.
@npolshakova I added the KEP update PR to the description
I took another skim thru this and I noticed that the failure-policy covers the whole pod, but speaks in terms like “exit code”:
Pods don’t have exit codes - containers do. What happens when there’s more than one container?
I still wonder if this would better to delegate some aspects of this to the Pod definition - like:
or even:
We could even consider
restartPolicy: Bespoketo trigger this processing.hi @mimowo and @alculquicondor, I have found an issue with API-initiated eviction when used together with PodDisruptionConditions https://github.com/kubernetes/kubernetes/issues/116552, I am trying to fix it in https://github.com/kubernetes/kubernetes/pull/116554
+1 for handling OOM failures in a direct way. First, let me say that the current feature already works great, especially for situations when the team that is running the job also controls the container (since we can programmatically return certain exit codes). However, for the situation where they are different (say you have a separate infra team), it would be great to be able to specify to not retry any OOM failures.
There is one more k/e PR with a purpose to align the KEP with the decisions taken during the implementation phase. Not sure if it should be blocking for the Code Freeze. Anyway, could you @alculquicondor please add the KEP update to the list of PRs and review / approve.
Hello @alculquicondor and @mimowo 👋 1.26 Release Docs shadow here!
This enhancement is marked as ‘Needs Docs’ for 1.26 release. Please follow the steps detailed in the documentation to open a PR against dev-1.26 branch in the k/website repo. This PR can be just a placeholder at this time, and must be created by November 9. Also, take a look at Documenting for a release to familiarize yourself with the docs requirement for the release.
Thank you!
Thanks @soltysh for bringing this to my notice (not sure how I missed this sorry for the error), everything is up-to-date! I’ve updated the KEP status to
trackedfor 1.26 release cycle 😃@Atharva-Shinde the PRR has been approved at the correct beta level in https://github.com/kubernetes/enhancements/pull/3463/files so not quite sure what else do you expect?
/milestone v1.26 /label lead-opted-in
(For sig-node, we see this is not attempting to derive any intelligence from kubelet/runtime initiated conditions)
Thanks @mimowo I’ve updated my comment 😃
Yes, the enhancement is marked as
trackedfor 1.25 release cycle. Thank you so much! 🙂