kubernetes: Maximum number of failures or failure backoff policy for Jobs

A max number of failures or failure backoff policy for Jobs would be useful.

Imagine you have an ETL job in production that’s failing due to some pathological input. By the time you spot it, it’s already been rescheduled thousands of times. As an example, I had 20 broken jobs that kept getting rescheduled; killing them took forever – and crashed the Kubernetes dashboard in the process.

Today, “restartPolicy” is not respected by Jobs because the goal is to achieve successful completion. (Strangely, “restartPolicy: Never” is still valid YAML.) This means failed jobs keep getting rescheduled. When you go to delete them, you have to delete all the pods they’ve been scheduled on. Deletes are rate limited, and in v1.3+, the verbose “you’re being throttled” messages are hidden from you when you run the kubectl command to delete a job. So it just looks like it’s taking forever! This is not a pleasant UX if you have a runaway job in prod or if you’re testing out a new job and it takes minutes to clean up after a broken test.

What are your thoughts on specifying a maximum number of failures/retries or adding a failure backoff restart policy?

(cc @thockin , who suggested I file this feature request)


Ninja edit: per this SO thread, the throttling issue is avoidable by using the OnFailure restart policy to keep rescheduling to the same pods, rather than to new pods – i.e. this prevents the explosion of the number of pods. And deadlines can help weed out failures after a certain amount of time.

However, suppose my ETL job takes an hour to run properly but may fail within seconds if the input data is bad. I’d rather specify a maximum number of retries than a high deadline.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 29
  • Comments: 31 (20 by maintainers)

Commits related to this issue

Most upvoted comments

hi @soltysh @nickschuch a bunch of people on our side (CERN) are also interested in this ( @diegodelemos @rochaporto) and we’d be interested to help.

@thockin , @maximz re your twitter convo, the docs (http://kubernetes.io/docs/user-guide/jobs/) imply that restartPolicy = Never is supported for jobs, which I found misleading.

  1. The first example shows that restartPolicy = Never
  2. The Pod Template section reads, “Only a RestartPolicy equal to Never or OnFailure are allowed.”
  3. The Handling Pod and Container failures section reads, “Therefore, your program needs to handle the the case when it is restarted locally, or else specify .spec.template.containers[].restartPolicy = “Never””

If restartPolicy = Never shouldn’t be allowed and the goal here is that jobs run until completion, then the docs need to change along with the feature request of max failures / retries.

How about the status about this issue? I’m really want to use this feature, If I want to run a job, I just want to know success or failure with no more try. if always retrying, it will always be failed.

I’ve created this proposal to address the issue: https://github.com/kubernetes/community/issues/583

@nickschuch will you be able to commit yourself and have it done in time for 1.7?

Yep, absolutely. I’ve been looking for where to best contribute my time on K8s. I want to see this through.

I’d prefer to have it addressed once and for all (the API part, not the implementation), and not one step at a time, because this might cause problems later on.

I agree, it will make it easier for me to understand this issue.

How do we go about getting agreement? (sorry, still new to development in the K8s community)

I support adding a backoff policy to Jobs. I think exponential backoff with a max of like 5min would be fine. I don’t think it would be a breaking change to introduce a backoff policy by default. I suspect some users might want fast retry, but we can wait on adding a field until they do.

I also support having a “max failed pods to keep around”, which would cause the job controller to garbage collect some failed pods before creating new ones. At a minimum, keeping the first and the last failed pod would be useful for debugging. But keeping like 1000 failed pods is usually not useful. Especially if parallelism is 1. I’m not sure if we can change this to be a default, but we can definitely make it a knob.

I’d want to discuss it a bit more, but I’d also be open to a field whose meaning is “the job can move to a failure state after this many pod failures”. We already have a way to fail after a certain amount of time.

I am not very enthusiatic about a “max pod failures before job fails” feature. In particular, I don’t know that we can easily guarantee that a job only ever tries one time.

@Yancey1989 you’re welcome to submit a patch as well 😃

@diegodelemos you’re welcome to submit a patch with an appropriate fix at any time 😃