enhancements: Retriable and non-retriable Pod failures for Jobs

Enhancement Description

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 2
  • Comments: 83 (71 by maintainers)

Most upvoted comments

skimming the discussion, my takeaways are:

  • there’s information that could be useful (signal information, etc) that isn’t currently available from container runtimes or the pod API today
    • if that information was made available in the future, it could be added as one of the things the job API considers to make it more natural to express some of the scenarios motivating this feature
    • however, even if that information was made available in the future, we would still need the ability to consider exit codes
    • starting with expressing job retry policy in terms of the information we currently have from container runtimes seems sensible
  • there’s two levels of retry that could happen… node-local (defined in the pod, informing kubelet behavior), and job-level (defined in the job, informing job controller behavior)
    • node-local retry could gain sophistication in the future
    • even if node-local retry did get expanded/enhanced, job-level retry would still be necessary (nodes go away, or become disqualified from running a particular workload via taint or resource availability or preemption or …)
    • starting with expressing job retry policy to inform what the job controller is already doing seems sensible
  • for the structure of the failure policy rules, I don’t see those as union types, but as policy matching requirements
    • I think it is more powerful and more succinct to keep the requirements distinct and let multiple be specified than to require a new grouping… Tim’s comment ("What we should be doing for this is a set of single-choice cases, where one case is "just X", and another is "just Y" and a third is "X and Y".) doesn’t seem like it scales well
    • as a thought experiment, imagine signal info gets added in the future, so we could match on exit codes, conditions, or signals, and for some reason, it became important to match on combinations of these things. To allow matching combinations with the “single-choice” approach, you’d need 7 fields:
      • onExitCodes
      • onConditions
      • onSignals
      • onExitCodesAndSignals
      • onExitCodesAndConditions
      • onExitCodesAndConditionsAndSignals
      • onConditionsAndSignals

Hi @alculquicondor @mimowo, Enhancements team here again 👋

Checking in as we approach Code Freeze at 01:00 UTC on Wednesday, 3rd August 2022.

Please ensure that the following items are completed before the code-freeze:

Let me know if there are any additional k/k PRs besides the ones listed above

Currently, the status of the enhancement is marked as at-risk

Thanks 😃

We will also have an improvement to this KEP as part of: https://github.com/kubernetes/kubernetes/pull/117015 in 1.28.

The max-restarts KEP isn’t the same as restart-rules, though, right? They all seem complementary but not the same.

Yes, these are the pieces of work that need to be done to support fully restart policy Never and OnFailure:

  1. restartPolicy=Never (this KEP, currently Beta without blocking issues open)
  2. maxRestartTimes (started as a new KEP)
  3. enable restartPolicy=OnFailure (not started, but seems a small follow-up KEP); plus change to Job’s backoffLimit when maxRestartTimes is specified
  4. containerRestartPolicy in pod spec (not started, AFAIK purely node changes) (enabling API similar to the one in https://github.com/kubernetes/enhancements/issues/3329#issuecomment-1571643421)

Note that:

  • (2.) is a prerequisite for (3.)
  • (3.) and (4.) and independent technically

IMO doing (1.) - (4.) under this KEP would prolong its graduation substantially, and different points have different priorities, so it is useful to de-couple. Recently we got this slack ask to support OnFailure. We will need to keep track of how much demand there is for (3.) and (4.) and prioritize accordingly, but de-coupling allows that.

Not sure I understand what failWithPrejudice is.

I was hand-waving that a Pod can fail or it can perma-fail. A perma-fail means “give up”. This would give Pods the ability to use restartPolicy=OnFailure (e.g. OOM) and still recognize that some failures are intentional and should not be restarted. It seems unfortunate to give up on restartPolicy when what you really want is a slightly different policy. I’m trying to emphasize that, in general, we want to pass information down as far as possible and enable the most-local control loop to make correct decisions. In this case, we can tell kubelet “restart this pod until it exits cleanly, or until it hits a specific case”.

You still need the higher-level control-plane (Job) to react to those specific cases. It just doesn’t need to be in the loop for ALL POSSIBLE exit codes.

We can do it all in the higher level control plane, but WHY? I am reminded of another system I worked on which did exactly this, and ultimately had to add the ability to make more localized retry decisions in order to scale.

@npolshakova I added the KEP update PR to the description

I took another skim thru this and I noticed that the failure-policy covers the whole pod, but speaks in terms like “exit code”:

  podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        operator: NotIn
        values: [40,41,42]

Pods don’t have exit codes - containers do. What happens when there’s more than one container?

I still wonder if this would better to delegate some aspects of this to the Pod definition - like:

Pod.spec: {
    restartPolicy: Never
    containers:
    - name: myapp
      onExit:
      - statusCode:
        valueRange:
          from: 40
          to: 42
        action:
          failWithPrejudice{}

or even:

Pod.spec: {
    restartPolicy: Always
    containers:
    - name: myapp
      onExit:
      - statusCode:
        valueRange:
          from: 42
        action:
          fail{}
      - signal:
        name: SIGBUS
        action:
          failWithPrejudice{}

We could even consider restartPolicy: Bespoke to trigger this processing.

hi @mimowo and @alculquicondor, I have found an issue with API-initiated eviction when used together with PodDisruptionConditions https://github.com/kubernetes/kubernetes/issues/116552, I am trying to fix it in https://github.com/kubernetes/kubernetes/pull/116554

However, handling OOM failures is considered as a potential future work, which fits into the framework of pod failure conditions. In particular, we considered introducing a new pod condition, called ResourceExhausted to indicate the reason. We discuss it more here: https://github.com/kubernetes/enhancements/blob/9aa33b83c3fbaa73e4fa2b96335b5c7e613a2b51/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md#resource-limits-exceeded.

+1 for handling OOM failures in a direct way. First, let me say that the current feature already works great, especially for situations when the team that is running the job also controls the container (since we can programmatically return certain exit codes). However, for the situation where they are different (say you have a separate infra team), it would be great to be able to specify to not retry any OOM failures.

There is one more k/e PR with a purpose to align the KEP with the decisions taken during the implementation phase. Not sure if it should be blocking for the Code Freeze. Anyway, could you @alculquicondor please add the KEP update to the list of PRs and review / approve.

Hey @mimowo and @alculquicondor ,

As the Code freeze is just a day away, just wanted to confirm that there are no open PRs in the K/K repo or any repo in general for this enhancement other than the ones outlined in the issue description? Please get the open PRs merged before the code freeze, so that the enhancement can be marked tracked.

Hello @alculquicondor and @mimowo 👋 1.26 Release Docs shadow here!

This enhancement is marked as ‘Needs Docs’ for 1.26 release. Please follow the steps detailed in the documentation to open a PR against dev-1.26 branch in the k/website repo. This PR can be just a placeholder at this time, and must be created by November 9. Also, take a look at Documenting for a release to familiarize yourself with the docs requirement for the release.

Thank you!

Thanks @soltysh for bringing this to my notice (not sure how I missed this sorry for the error), everything is up-to-date! I’ve updated the KEP status to tracked for 1.26 release cycle 😃

@Atharva-Shinde the PRR has been approved at the correct beta level in https://github.com/kubernetes/enhancements/pull/3463/files so not quite sure what else do you expect?

/milestone v1.26 /label lead-opted-in

(For sig-node, we see this is not attempting to derive any intelligence from kubelet/runtime initiated conditions)

Thanks @mimowo I’ve updated my comment 😃

Yes, the enhancement is marked as tracked for 1.25 release cycle. Thank you so much! 🙂