kubernetes: Backoff Limit for Job does not work on Kubernetes 1.10.0
Is this a BUG REPORT or FEATURE REQUEST?: /kind bug
What happened:
.spec.backoffLimit for a Job does not work on Kubernetes 1.10.0
What you expected to happen:
.spec.backoffLimit can limit the number of time a pod is restarted when running inside a job
How to reproduce it (as minimally and precisely as possible):
Use this resource file:
apiVersion: batch/v1
kind: Job
metadata:
name: error
spec:
backoffLimit: 1
template:
metadata:
name: job
spec:
restartPolicy: Never
containers:
- name: job
image: ubuntu:16.04
args:
- sh
- -c
- sleep 5; false
If the job was created in Kubernetes 1.9, it will soon fail:
....
status:
conditions:
- lastProbeTime: 2018-04-11T10:20:31Z
lastTransitionTime: 2018-04-11T10:20:31Z
message: Job has reach the specified backoff limit
reason: BackoffLimitExceeded
status: "True"
type: Failed
failed: 2
startTime: 2018-04-11T10:20:00Z
...
While creating the job in Kubernetes 1.10, it will be restarted infinitely:
...
status:
active: 1
failed: 8
startTime: 2018-04-11T10:37:48Z
...
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version): 1.10.0 - Cloud provider or hardware configuration: AWS
- OS (e.g. from /etc/os-release): Ubuntu 16.04
- Kernel (e.g.
uname -a):Linux - Install tools: Kubeadm
- Others:
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 31
- Comments: 54 (27 by maintainers)
Links to this issue
Commits related to this issue
- Bug 1585648- Set timeout for ASB migration job (workaround for kubernetes/kubernetes#62382) — committed to fabianvf/openshift-ansible by fabianvf 6 years ago
- Merge pull request #63650 from soltysh/issue62382 Automatic merge from submit-queue (batch tested with PRs 64009, 64780, 64354, 64727, 63650). If you want to cherry-pick this change to another branch... — committed to kubernetes/kubernetes by deleted user 6 years ago
- Merge pull request #64813 from cblecker/automated-cherry-pick-of-#58972-#63650-upstream-release-1.10 Automatic merge from submit-queue. Automated cherry pick of #58972: Fix job's backoff limit for r... — committed to kubernetes/kubernetes by deleted user 6 years ago
I’m planning to cut 1.10.5 tomorrow, assuming all tests will be green. In case of any last minute problems it may slip by a day or two, but so far everything looks good for tomorrow.
This bug killed my test cluster over the long weekend (with 20000+ pods). Luckily we don’t yet use CronJobs in prod.
@dims I think that this issue should be fixed in 1.10 regardless of 1.11 as this issue causes serious cluster stability, for a fairly common scenario. Specifically for us, migration to 1.11 adds complexity and risk due to the change in the way nvidia GPUs are managed, so it will take us some time to upgrade to 1.11.
I’ve opened #63650 to address the issue. Sorry for the troubles y’all.
This is a regression in functionality, and when the proper fix is determined, it should be cherry picked back to 1.10. We need it fixed in master first, however.
I’m going to escalate this to sig-apps to see if we can get the right resources.
killed my cluster too 😦 still existing in kubernetes 1.10.2
Thankyou to all involved! I really appreciate this being rolled back into 1.10.
This fix has been cherry picked back to 1.10, and should show up in the next patch release (1.10.5).
Unless I’m mistaken, the
Jobobjects are considered stable APIs right? (since they are under thebatch/v1group). If so, I also think this should be fixed in the next 1.10 point release.In general, like @aalubin we would also rather to be able to use
Jobs without being forced to upgrade to 1.11.I had to add activeDeadlineSeconds to all Jobs and Hooks, as workaround…
In version “v1.10.11” the
backoffLimitthough set to 4 for a Job’s pod, the pod eviction happens for 5 times and at few times it was for 3 times. Also to note that we also have appliedcompletions: 1andparallelism: 1Below is the version details of Kubernetes that we are usingThough we have set the backoffLimit of a Job to 4 the pod eviction is still being done more than 4 times i.e. 5 times at one of the instances its three times as well. The eviction of the pod happens for a geniune reason i.e. the disk space of the pod node exhausts when the application within the pod executes, we know the reason why that’s happening. The point is that the Job’s-Pod should be attempted 4 times and not 5 times. Please note that strangely we don’t get to see this error in another environment where we have v1.10.05. providing the version details of that environment below
There the pod eviction happens as per the stated literature of Eviction with BackOffLimit. We have no issues there. Now when we have upgraded to v1.10.11 recently [ About two days back ] we are getting the reported problem.
Would appreciate if someone in this forum can advise on the root cause and a solution. Do we have any flaw with version “v1.10.5” like security loophole.
Just update to something newer than 1.10.0. That version is ages old and I’m sure it is fixed within the latest 1.10.x release. (Which is also out of support so update to 1.12/1.13 ASAP.)
@aalubin y, guess my question is … is it fixed in master? (and hence already fixed for 1.11), if not, do we need to get this fixed in master/1.11 and backport to 1.10