kubernetes: Backoff Limit for Job does not work on Kubernetes 1.10.0

Is this a BUG REPORT or FEATURE REQUEST?: /kind bug

What happened:

.spec.backoffLimit for a Job does not work on Kubernetes 1.10.0

What you expected to happen:

.spec.backoffLimit can limit the number of time a pod is restarted when running inside a job

How to reproduce it (as minimally and precisely as possible):

Use this resource file:

apiVersion: batch/v1
kind: Job
metadata:
  name: error
spec:
  backoffLimit: 1
  template:
    metadata:
      name: job
    spec:
      restartPolicy: Never
      containers:
        - name: job
          image: ubuntu:16.04
          args:
            - sh
            - -c
            - sleep 5; false

If the job was created in Kubernetes 1.9, it will soon fail:

....
status:
    conditions:
    - lastProbeTime: 2018-04-11T10:20:31Z
      lastTransitionTime: 2018-04-11T10:20:31Z
      message: Job has reach the specified backoff limit
      reason: BackoffLimitExceeded
      status: "True"
      type: Failed
    failed: 2
    startTime: 2018-04-11T10:20:00Z
...

While creating the job in Kubernetes 1.10, it will be restarted infinitely:

...
status:
  active: 1
  failed: 8
  startTime: 2018-04-11T10:37:48Z
...

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.10.0
Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): Ubuntu 16.04
Kernel (e.g. uname -a): Linux
Install tools: Kubeadm
Others:

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 31
Comments: 54 (27 by maintainers)

Links to this issue

No Re:invent EKS announcements?

Commits related to this issue

Bug 1585648- Set timeout for ASB migration job (workaround for kubernetes/kubernetes#62382) — committed to fabianvf/openshift-ansible by fabianvf 6 years ago
Merge pull request #63650 from soltysh/issue62382 Automatic merge from submit-queue (batch tested with PRs 64009, 64780, 64354, 64727, 63650). If you want to cherry-pick this change to another branch... — committed to kubernetes/kubernetes by deleted user 6 years ago
Merge pull request #64813 from cblecker/automated-cherry-pick-of-#58972-#63650-upstream-release-1.10 Automatic merge from submit-queue. Automated cherry pick of #58972: Fix job's backoff limit for r... — committed to kubernetes/kubernetes by deleted user 6 years ago

Most upvoted comments

I’m planning to cut 1.10.5 tomorrow, assuming all tests will be green. In case of any last minute problems it may slip by a day or two, but so far everything looks good for tomorrow.

+18

MaciekPytel on Jun 19, 2018

This bug killed my test cluster over the long weekend (with 20000+ pods). Luckily we don’t yet use CronJobs in prod.

drdivano on May 7, 2018

@dims I think that this issue should be fixed in 1.10 regardless of 1.11 as this issue causes serious cluster stability, for a fairly common scenario. Specifically for us, migration to 1.11 adds complexity and risk due to the change in the way nvidia GPUs are managed, so it will take us some time to upgrade to 1.11.

aalubin on Jun 3, 2018

I’ve opened #63650 to address the issue. Sorry for the troubles y’all.

soltysh on May 10, 2018

This is a regression in functionality, and when the proper fix is determined, it should be cherry picked back to 1.10. We need it fixed in master first, however.

I’m going to escalate this to sig-apps to see if we can get the right resources.

cblecker on Jun 3, 2018

killed my cluster too 😦 still existing in kubernetes 1.10.2

nerumo on May 7, 2018

Thankyou to all involved! I really appreciate this being rolled back into 1.10.

nickschuch on Jun 19, 2018

This fix has been cherry picked back to 1.10, and should show up in the next patch release (1.10.5).

cblecker on Jun 18, 2018

Unless I’m mistaken, the Job objects are considered stable APIs right? (since they are under the batch/v1 group). If so, I also think this should be fixed in the next 1.10 point release.

In general, like @aalubin we would also rather to be able to use Jobs without being forced to upgrade to 1.11.

arikkfir on Jun 3, 2018

I had to add activeDeadlineSeconds to all Jobs and Hooks, as workaround…

xray33 on Apr 25, 2018

In version “v1.10.11” the backoffLimit though set to 4 for a Job’s pod, the pod eviction happens for 5 times and at few times it was for 3 times. Also to note that we also have applied completions: 1 and parallelism: 1 Below is the version details of Kubernetes that we are using

[xuser@tyl20ks9px3as41 ~]$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:38:32Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.12", GitCommit:"c757b93cf034d49af3a3b8ecee3b9639a7a11df7", GitTreeState:"clean", BuildDate:"2018-12-19T11:04:29Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Though we have set the backoffLimit of a Job to 4 the pod eviction is still being done more than 4 times i.e. 5 times at one of the instances its three times as well. The eviction of the pod happens for a geniune reason i.e. the disk space of the pod node exhausts when the application within the pod executes, we know the reason why that’s happening. The point is that the Job’s-Pod should be attempted 4 times and not 5 times. Please note that strangely we don’t get to see this error in another environment where we have v1.10.05. providing the version details of that environment below

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:46:00Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:34:22Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

There the pod eviction happens as per the stated literature of Eviction with BackOffLimit. We have no issues there. Now when we have upgraded to v1.10.11 recently [ About two days back ] we are getting the reported problem.

Would appreciate if someone in this forum can advise on the root cause and a solution. Do we have any flaw with version “v1.10.5” like security loophole.

shanit-saha on Feb 6, 2019

Just update to something newer than 1.10.0. That version is ages old and I’m sure it is fixed within the latest 1.10.x release. (Which is also out of support so update to 1.12/1.13 ASAP.)

CallMeFoxie on Feb 4, 2019

@aalubin y, guess my question is … is it fixed in master? (and hence already fixed for 1.11), if not, do we need to get this fixed in master/1.11 and backport to 1.10

dims on Jun 3, 2018