kubernetes: Maximum number of failures or failure backoff policy for Jobs
A max number of failures or failure backoff policy for Jobs would be useful.
Imagine you have an ETL job in production that’s failing due to some pathological input. By the time you spot it, it’s already been rescheduled thousands of times. As an example, I had 20 broken jobs that kept getting rescheduled; killing them took forever – and crashed the Kubernetes dashboard in the process.
Today, “restartPolicy” is not respected by Jobs because the goal is to achieve successful completion. (Strangely, “restartPolicy: Never” is still valid YAML.) This means failed jobs keep getting rescheduled. When you go to delete them, you have to delete all the pods they’ve been scheduled on. Deletes are rate limited, and in v1.3+, the verbose “you’re being throttled” messages are hidden from you when you run the kubectl
command to delete a job. So it just looks like it’s taking forever! This is not a pleasant UX if you have a runaway job in prod or if you’re testing out a new job and it takes minutes to clean up after a broken test.
What are your thoughts on specifying a maximum number of failures/retries or adding a failure backoff restart policy?
(cc @thockin , who suggested I file this feature request)
Ninja edit: per this SO thread, the throttling issue is avoidable by using the OnFailure restart policy to keep rescheduling to the same pods, rather than to new pods – i.e. this prevents the explosion of the number of pods. And deadlines can help weed out failures after a certain amount of time.
However, suppose my ETL job takes an hour to run properly but may fail within seconds if the input data is bad. I’d rather specify a maximum number of retries than a high deadline.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 29
- Comments: 31 (20 by maintainers)
Commits related to this issue
- Merge pull request #583 from soltysh/job_failure_policy Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in https://github.com/kubernetes/... — committed to kubernetes/community by deleted user 7 years ago
- Merge pull request #48075 from clamoriniere1A/feature/job_failure_policy Automatic merge from submit-queue (batch tested with PRs 51335, 51364, 51130, 48075, 50920) [API] Feature/job failure policy ... — committed to kubernetes/kubernetes by deleted user 7 years ago
- Merge pull request #583 from soltysh/job_failure_policy Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in https://github.com/kubernetes/... — committed to justaugustus/enhancements by deleted user 7 years ago
- jobs: clarify that there is no `restartPolicy` for the job itself Sometimes, as it happened to me, a Pod's `restartPolicy` is mistakenly taken as the corresponding Job's restart policy. That was... — committed to jgehrcke/website by jgehrcke 4 years ago
- jobs: clarify that there is no `restartPolicy` for the job itself (#18605) Sometimes, as it happened to me, a Pod's `restartPolicy` is mistakenly taken as the corresponding Job's restart policy. ... — committed to kubernetes/website by jgehrcke 4 years ago
- jobs: clarify that there is no `restartPolicy` for the job itself (#18605) Sometimes, as it happened to me, a Pod's `restartPolicy` is mistakenly taken as the corresponding Job's restart policy. ... — committed to wawa0210/website by jgehrcke 4 years ago
- Merge pull request #583 from soltysh/job_failure_policy Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in https://github.com/kubernetes/... — committed to kubernetes/design-proposals-archive by deleted user 7 years ago
- Merge pull request #583 from soltysh/job_failure_policy Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in https://github.com/kubernetes/... — committed to MadhavJivrajani/design-proposals by deleted user 7 years ago
- Merge pull request #583 from soltysh/job_failure_policy Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in https://github.com/kubernetes/... — committed to MadhavJivrajani/design-proposals by deleted user 7 years ago
- Merge pull request #583 from soltysh/job_failure_policy Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in https://github.com/kubernetes/... — committed to MadhavJivrajani/design-proposals by deleted user 7 years ago
- Merge pull request #583 from soltysh/job_failure_policy Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in https://github.com/kubernetes/... — committed to kubernetes/design-proposals-archive by deleted user 7 years ago
- Merge pull request #583 from soltysh/job_failure_policy Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in https://github.com/kubernetes/... — committed to kubernetes/design-proposals-archive by deleted user 7 years ago
hi @soltysh @nickschuch a bunch of people on our side (CERN) are also interested in this ( @diegodelemos @rochaporto) and we’d be interested to help.
@thockin , @maximz re your twitter convo, the docs (http://kubernetes.io/docs/user-guide/jobs/) imply that restartPolicy = Never is supported for jobs, which I found misleading.
If restartPolicy = Never shouldn’t be allowed and the goal here is that jobs run until completion, then the docs need to change along with the feature request of max failures / retries.
How about the status about this issue? I’m really want to use this feature, If I want to run a job, I just want to know success or failure with no more try. if always retrying, it will always be failed.
I’ve created this proposal to address the issue: https://github.com/kubernetes/community/issues/583
Yep, absolutely. I’ve been looking for where to best contribute my time on K8s. I want to see this through.
I agree, it will make it easier for me to understand this issue.
How do we go about getting agreement? (sorry, still new to development in the K8s community)
I support adding a backoff policy to Jobs. I think exponential backoff with a max of like 5min would be fine. I don’t think it would be a breaking change to introduce a backoff policy by default. I suspect some users might want fast retry, but we can wait on adding a field until they do.
I also support having a “max failed pods to keep around”, which would cause the job controller to garbage collect some failed pods before creating new ones. At a minimum, keeping the first and the last failed pod would be useful for debugging. But keeping like 1000 failed pods is usually not useful. Especially if parallelism is 1. I’m not sure if we can change this to be a default, but we can definitely make it a knob.
I’d want to discuss it a bit more, but I’d also be open to a field whose meaning is “the job can move to a failure state after this many pod failures”. We already have a way to fail after a certain amount of time.
I am not very enthusiatic about a “max pod failures before job fails” feature. In particular, I don’t know that we can easily guarantee that a job only ever tries one time.
@Yancey1989 you’re welcome to submit a patch as well 😃
@diegodelemos you’re welcome to submit a patch with an appropriate fix at any time 😃