kubernetes: Job backoffLimit does not cap pod restarts when restartPolicy: OnFailure

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: When creating a job with backoffLimit: 2 and restartPolicy: OnFailure, the pod with the configuration included below continued to restart and was never marked as failed:

$ kubectl get pods
NAME               READY     STATUS    RESTARTS   AGE
failed-job-t6mln   0/1       Error     4          57s
$ kubectl describe job failed-job
Name:           failed-job
Namespace:      default
Selector:       controller-uid=58c6d945-be62-11e7-86f7-080027797e6b
Labels:         controller-uid=58c6d945-be62-11e7-86f7-080027797e6b
                job-name=failed-job
Annotations:    kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"batch/v1","kind":"Job","metadata":{"annotations":{},"name":"failed-job","namespace":"default"},"spec":{"backoffLimit":2,"template":{"met...
Parallelism:    1
Completions:    1
Start Time:     Tue, 31 Oct 2017 10:38:46 -0700
Pods Statuses:  1 Running / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  controller-uid=58c6d945-be62-11e7-86f7-080027797e6b
           job-name=failed-job
  Containers:
   nginx:
    Image:  nginx:1.7.9
    Port:   <none>
    Command:
      bash
      -c
      exit 1
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  1m    job-controller  Created pod: failed-job-t6mln

What you expected to happen:

We expected only 2 attempted pod restarts before the job was marked as failed. The pod kept restarting and job.pod.status was never set to failed, remained active.

How to reproduce it (as minimally and precisely as possible):

 apiVersion: batch/v1
 kind: Job
 metadata:
   name: failed-job
   namespace: default
 spec:
   backoffLimit: 2
   template:
     metadata:
       name: failed-job
     spec:
       containers:
       - name: nginx
         image: nginx:1.7.9
         command: ["bash", "-c", "exit 1"]
       restartPolicy: OnFailure

Create the above job and observe the number of pod restarts.

Anything else we need to know?:

The backoffLimit flag works as expected when restartPolicy: Never.

Environment:

  • Kubernetes version (use kubectl version): 1.8.0 using minikube

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 17
  • Comments: 21 (14 by maintainers)

Commits related to this issue

Most upvoted comments

I think the backoffLimit doesn’t work at all. My kubernetes version is 1.10.3

[root@k8s-m1 k8s]# cat job.yml 
apiVersion: batch/v1
kind: Job
metadata:
  name: test-job
spec:
  backoffLimit: 2
  template:
    spec:
      containers:
      - name: pi
        image: busybox
        command: ["exit 1"]
      restartPolicy: Never
[root@k8s-m1 k8s]# kubectl apply -f job.yml 
job.batch "test-job" created
[root@k8s-m1 k8s]# kubectl get pod
NAME             READY     STATUS               RESTARTS   AGE
pod-test         1/1       Running              0          6h
test-affinity    1/1       Running              0          6h
test-job-g4r8w   0/1       ContainerCreating    0          1s
test-job-tqbvl   0/1       ContainerCannotRun   0          3s
[root@k8s-m1 k8s]# kubectl get pod
NAME             READY     STATUS               RESTARTS   AGE
pod-test         1/1       Running              0          6h
test-affinity    1/1       Running              0          6h
test-job-2h5vp   0/1       ContainerCannotRun   0          13s
test-job-2klzs   0/1       ContainerCannotRun   0          15s
test-job-4c2xz   0/1       ContainerCannotRun   0          24s
test-job-6bcqc   0/1       ContainerCannotRun   0          8s
test-job-6r7wq   0/1       ContainerCannotRun   0          11s
test-job-7cz7c   0/1       Terminating          0          8s
test-job-7mkdn   0/1       Terminating          0          16s
test-job-88ws8   0/1       ContainerCannotRun   0          26s
test-job-bqfk4   0/1       ContainerCannotRun   0          22s
test-job-jh7dp   0/1       ContainerCannotRun   0          4s
test-job-k2c4r   0/1       ContainerCannotRun   0          18s
test-job-qfj7m   0/1       ContainerCannotRun   0          6s
test-job-r8794   0/1       ContainerCreating    0          1s
test-job-r9gz6   0/1       Terminating          0          6s
test-job-w4f9r   0/1       Terminating          0          1s
[root@k8s-m1 k8s]# kubectl delete  -f job.yml 
job.batch "test-job" deleted

@innovia I ran a simplified version of your manifest against my Kubernetes 1.8.2 cluster with the command field replaced by /bin/bash exit 1 so that every run fails, the schedule set to every 5 minutes, and the back-off limit reduced to 2. Here’s my manifest:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      backoffLimit: 2
      template:
        spec:
          containers:
          - name: cron-job
            image: ubuntu
            command:
            - "/bin/bash"
            - "-c"
            - "exit 1"
          restartPolicy: Never

Initially, I get a series of three failing pods:

Mon Dec 25 23:15:28 CET 2017
NAME                     READY     STATUS    RESTARTS   AGE
hello-1514240100-7mktd   0/1       Error     0          19s
hello-1514240100-bf6rw   0/1       Error     0          5s
hello-1514240100-h44kz   0/1       Error     0          15s

Then it takes pretty much 5 minutes for the next batch to start up and fail:

Mon Dec 25 23:20:23 CET 2017
NAME                     READY     STATUS    RESTARTS   AGE
hello-1514240400-cj4bn   0/1       Error     0          4s
hello-1514240400-shnph   0/1       Error     0          14s

Interestingly, I only get two new containers instead of three. The next batch though comes with three again:

Mon Dec 25 23:25:16 CET 2017
NAME                     READY     STATUS    RESTARTS   AGE
hello-1514240700-jr9sp   0/1       Error     0          3s
hello-1514240700-jxck9   0/1       Error     0          13s
hello-1514240700-x26kg   0/1       Error     0          16s

and so does the final batch I tested:

Mon Dec 25 23:30:16 CET 2017
NAME                     READY     STATUS    RESTARTS   AGE

hello-1514241000-mm8c9   0/1       Error     0          2s
hello-1514241000-pqx8f   0/1       Error     0          16s
hello-1514241000-qjtn5   0/1       Error     0          12s

Apart from the supposed off-by-one error, things seem to work for me.