kubernetes: RFE: ability to define special exit code to terminate existing job

This came out during my discussion with @erictune later today. Our current implementation of jobs assures we always restart pods to fulfill JobSpec.Completions. What if, at a certain point in time, we know the job won’t be able to finish successfully ever. We should be able to mark the job for termination without actually removing it from the system, because we want to keep old data. This might be very useful with ScheduledJobs (#11980).

@pmorie @sdminonne wdyt?

About this issue

  • Original URL
  • State: closed
  • Created 9 years ago
  • Comments: 29 (23 by maintainers)

Most upvoted comments

/close With the feature already being beta, we can mark this issue as closed.

@jensentanlo could you create a new issue for what you hit?

My suspicion would be that the controller immediately creates a replacement pod when a pod has a deletion timestamp. But, once the pod actually finishes as succeeded, the controller would mark the job as succeeded and delete the extra pod.

We could add a field .spec.container[].abortOnExitCode, which is an integer in the 1-127 range. When a container that specifies this field exits with this exit code, then the entire pod terminates abnormally regardless of the restart policy.

Communicating a reason/message upward is orthogonal: you might want to communicate a message any time a container exits, with or without this special exit code.

@gaocegege, I’m revisiting this feature request again.

How do you determine which exit codes are acceptable to retry? Are these “retryable” exit-codes coming from the tensorflow framework?

We have a customer that would like to abort on any container error, which they consider “user error”. But if the pod was killed due to Eviction or Shutdown, they would like retries. This is visible in the pod.status.reason. So the request is more about the “reason” than exit codes.

Can you clarify your expectations so we can find a solution that works for both requests?

@pacoxu Hi Paco, I think we still need this feature. I am trying to use jobs to run distributed deep learning training jobs. In such a use case, we need to know if the failure can be fixed by retrying.