kubernetes: Job backoffLimit does not cap pod restarts when restartPolicy: OnFailure
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
When creating a job with backoffLimit: 2 and restartPolicy: OnFailure, the pod with the configuration included below continued to restart and was never marked as failed:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
failed-job-t6mln 0/1 Error 4 57s
$ kubectl describe job failed-job
Name: failed-job
Namespace: default
Selector: controller-uid=58c6d945-be62-11e7-86f7-080027797e6b
Labels: controller-uid=58c6d945-be62-11e7-86f7-080027797e6b
job-name=failed-job
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"batch/v1","kind":"Job","metadata":{"annotations":{},"name":"failed-job","namespace":"default"},"spec":{"backoffLimit":2,"template":{"met...
Parallelism: 1
Completions: 1
Start Time: Tue, 31 Oct 2017 10:38:46 -0700
Pods Statuses: 1 Running / 0 Succeeded / 0 Failed
Pod Template:
Labels: controller-uid=58c6d945-be62-11e7-86f7-080027797e6b
job-name=failed-job
Containers:
nginx:
Image: nginx:1.7.9
Port: <none>
Command:
bash
-c
exit 1
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 1m job-controller Created pod: failed-job-t6mln
What you expected to happen:
We expected only 2 attempted pod restarts before the job was marked as failed. The pod kept restarting and job.pod.status was never set to failed, remained active.
How to reproduce it (as minimally and precisely as possible):
apiVersion: batch/v1
kind: Job
metadata:
name: failed-job
namespace: default
spec:
backoffLimit: 2
template:
metadata:
name: failed-job
spec:
containers:
- name: nginx
image: nginx:1.7.9
command: ["bash", "-c", "exit 1"]
restartPolicy: OnFailure
Create the above job and observe the number of pod restarts.
Anything else we need to know?:
The backoffLimit flag works as expected when restartPolicy: Never.
Environment:
- Kubernetes version (use
kubectl version): 1.8.0 using minikube
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 17
- Comments: 21 (14 by maintainers)
Commits related to this issue
- Merge pull request #58972 from soltysh/issue54870 Automatic merge from submit-queue (batch tested with PRs 61962, 58972, 62509, 62606). If you want to cherry-pick this change to another branch, pleas... — committed to kubernetes/kubernetes by deleted user 6 years ago
- k8s: use restartpolicy never * This will cause that for each retry a new pod will be started instead of trying with the same all the time. Even though it seems to be fixed https://github.com/kube... — committed to diegodelemos/reana-job-controller by deleted user 6 years ago
- installation: use kubernetes official client * Upgrades code to support Kubernetes 1.9.4 (1.10 compatible already). * Uses restartpolicy never. This will cause that for each retry a new pod will b... — committed to diegodelemos/reana-job-controller by deleted user 6 years ago
- installation: use kubernetes official client * Upgrades code to support Kubernetes 1.9.4 (1.10 compatible already). * Uses restartpolicy never. This will cause that for each retry a new pod will b... — committed to diegodelemos/reana-job-controller by deleted user 6 years ago
- installation: use kubernetes official client * Upgrades code to support Kubernetes 1.9.4 (1.10 compatible already). * Uses restartpolicy never. This will cause that for each retry a new pod will b... — committed to diegodelemos/reana-job-controller by deleted user 6 years ago
- installation: use kubernetes official client * Upgrades code to support Kubernetes 1.9.4 (1.10 compatible already). * Uses restartpolicy never. This will cause that for each retry a new pod will b... — committed to diegodelemos/reana-job-controller by deleted user 6 years ago
- installation: use kubernetes official client * Upgrades code to support Kubernetes 1.9.4 (1.10 compatible already). * Uses restartpolicy never. This will cause that for each retry a new pod will b... — committed to diegodelemos/reana-job-controller by deleted user 6 years ago
- installation: use kubernetes official client * Upgrades code to support Kubernetes 1.9.4 (1.10 compatible already). * Uses restartpolicy never. This will cause that for each retry a new pod will b... — committed to diegodelemos/reana-job-controller by deleted user 6 years ago
- installation: use kubernetes official client * Upgrades code to support Kubernetes 1.9.4 (1.10 compatible already). * Uses restartpolicy never. This will cause that for each retry a new pod will b... — committed to diegodelemos/reana-job-controller by deleted user 6 years ago
- installation: use kubernetes official client * Upgrades code to support Kubernetes 1.9.4 (1.10 compatible already). * Uses restartpolicy never. This will cause that for each retry a new pod will b... — committed to diegodelemos/reana-job-controller by deleted user 6 years ago
- installation: use kubernetes official client * Upgrades code to support Kubernetes 1.9.4 (1.10 compatible already). * Uses restartpolicy never. This will cause that for each retry a new pod will b... — committed to diegodelemos/reana-job-controller by deleted user 6 years ago
- Set restartPolicy and backOffLimit in example jobs This should prevent them from re-starting when the workflow fails Note that on kubernetes before 1.12, https://github.com/kubernetes/kubernetes/issu... — committed to Duke-GCB/calrissian by dleehr 5 years ago
I think the backoffLimit doesn’t work at all. My kubernetes version is 1.10.3
@innovia I ran a simplified version of your manifest against my Kubernetes 1.8.2 cluster with the command field replaced by
/bin/bash exit 1so that every run fails, the schedule set to every 5 minutes, and the back-off limit reduced to 2. Here’s my manifest:Initially, I get a series of three failing pods:
Then it takes pretty much 5 minutes for the next batch to start up and fail:
Interestingly, I only get two new containers instead of three. The next batch though comes with three again:
and so does the final batch I tested:
Apart from the supposed off-by-one error, things seem to work for me.