kubernetes: Infinite ImagePullBackOff CronJob results in resource leak

What happened: A CronJob without a ConcurrencyPolicy or history limit that uses an image that doesn’t exist will slowly consume almost all cluster resources. In our cluster we started hitting the pod limit on all of our nodes, and began losing our ability to schedule new pods.

What you expected to happen: Even without a ConcurrencyPolicy, CronJob should probably have the same behavior as most of the other pod schedulers. If I try to start a deployment with X replicas and I get ImagePullBackOff on one of the containers in a pod, the deployment won’t keep trying to schedule more pods on different nodes until it consumes all cluster resources.

This is especially bad with CronJob, because unlike Deployment where an upper limit for horizontal scalability has to be set, CronJob with no history limit and ConcurrencyPolicy will slowly consume all resources on a cluster.

While this is up for debate, I would personally say when a scheduled Job has the ImagePullBackOff error, it shouldn’t try to keep scheduling new pods. It should probably kill the pod trying to pull an image and make a new one, or wait for the pod to successfully pull the image.

Worst case scenario it will consume all cluster resources, best case scenario there is a thundering herd of CronJobs all rushing to completion when the image becomes available.

How to reproduce it (as minimally and precisely as possible):

apiVersion: batch/v1beta1
kind: CronJob
spec:
  schedule: "* * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: darrienglasser.com/busybox:does-not-exist

Deploy the above and wait. Your cluster will collapse over time.

Anything else we need to know?: No

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-25T15:53:57Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.2", GitCommit:"cff46ab41ff0bb44d8584413b598ad8360ec1def", GitTreeState:"clean", BuildDate:"2019-01-10T23:28:14Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: On prem. A number of high powered nodes with Xeons. (256Gi+ memory and the latest Xeon Gold processors)

  • OS (e.g: cat /etc/os-release):

core@k8s-node [23:36:33]~ $ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1911.4.0
VERSION_ID=1911.4.0
BUILD_ID=2018-11-26-1924
PRETTY_NAME="Container Linux by CoreOS 1911.4.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
  • Kernel (e.g. uname -a):
Linux k8s-node 4.14.81-coreos #1 SMP Mon Nov 26 18:51:57 UTC 2018 x86_64 Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz GenuineIntel GNU/Linux

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 8
  • Comments: 56 (12 by maintainers)

Most upvoted comments

Hello! What are your thoughts on adding a field to the cronjob spec to limit the number of concurrent jobs spawned by the cronjob? For example, the spec could be concurrentJobsAllowed with a default of say, 100. This would provide a general failsafe to situations like this where you would not want one cronjob to accidentally spawn an unexpected number of jobs. One could use this field to specify to what degree of concurrency one cronjob should allow, thus preventing situations like this where users end up shooting themselves in the foot.

This hit us too. This being a new development cluster and not production, monitoring was not fully configured. Eventually we had 10.000 pods (which spinned up 100+ extra nodes) from one CronJob before we noticed.

It would have been great if pods that have never started successfully would be removed before creating new pods.

Hello! I’m a CS student at UT Austin. For a class project we’re tackling some open source issues. I’m going to try to see if I can make progress on this.

@k8s-triage-robot: Closing this issue, marking it as “Not Planned”.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Changing defaults has been done in previous occasions as well. Things need to be deprecated/communicated properly, but it’s not impossible (see this example

This GitHub issue has enough watchers / people affected, and it’s causing real problems, which should be a strong argument for actually fixing it.

There’s very few reports about how this would break existing workloads.

@alculquicondor what’s the risk you see in adressing that footgun?

I haven’t worked on K8s in years now, but I still get emails from it sometimes (thanks stale issue bot and other commenters). As the original filer I can tell you the issue is that this bug is very easy to hit. The problem being that the default configuration of a YAML file without extra parameters can easily take down a whole cluster.

These two fields:

  • concurrencyPolicy
  • activeDeadlineSeconds

Will most certainly fix the problem, but especially when you are in an org where lots of folks are making new YAML files, modifying old ones, or have an unstable docker registry and old images disappear over time, you get issues like this. Not everyone uses rock stable infra, and that is supposed to be the point of K8s.

We had a YAML template for folks in the org to follow at Arista, but most folks did not use it. Baking in sane defaults is really important, otherwise it is very easy for folks to accidentally take down a K8s cluster, making the life of the cluster maintainers a mess.

At the very least a configuration parameter for the K8s cluster itself that could fix this would have been nice for me when I worked on K8s. We want to actively remove footguns if we can.

At Meta we don’t even use K8s nowadays though (well, more or less) and even so it is not my job to care about it, but this issue still gave me a lot of pain a few years ago and I would hate to see it close because there is a workaround.

Anyways I’ll remove lifecycle stale for the last time to give members a chance to review, but it will probably be my last response here unless I have to (god willing) work on K8s again.

/remove-lifecycle stale

@DarrienG: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.