kubernetes: Make CrashLoopBackoff timing tuneable, or add mechanism to exempt some exits
Is this a BUG REPORT or FEATURE REQUEST?: Feature request
/kind feature
What happened:
As part of a development workflow, I intentionally killed a container in a pod with restartPolicy: Always
. The plan was to do this repeatedly, as a quick way to restart the container and clear old state (and, in Minikube, to load image changes).
The container went into a crash-loop backoff, making this anything but a quick option.
What you expected to happen: I expected there so be some configuration allowing me to disable, or at least tune the timing of, the CrashLoopBackoff.
How to reproduce it (as minimally and precisely as possible):
Create a pod with restartPolicy: Always
, and intentionally exit a container repeatedly.
Anything else we need to know?: I see that the backoff timing parameters are hard-coded constants here: https://github.com/kubernetes/kubernetes/blob/5f920426103085a28069a1ba3ec9b5301c19d075/pkg/kubelet/kubelet.go#L121 https://github.com/kubernetes/kubernetes/blob/5f920426103085a28069a1ba3ec9b5301c19d075/pkg/kubelet/kubelet.go#L155
One might reasonably expect these to be configurable at least at the kubelet level - say, by a setting like these. That would be sufficient for my use-case (local development with fast restarts), and presumably useful as an advanced configuration setting for production workloads.
A more aggressive change would allow tuning per-pod.
There are other options for my target workflow:
- Put the pod in a Deployment or similar, kubectl delete the pod, let Kubernetes schedule another, work with the new pod. However, this is much slower than a container restart without backoff (and ironically causes more kubelet load than the backoff avoids). It also relies on using kubectl/the Kubernetes API to do the restart, as opposed to just exiting the container.
- Run the server process as a secondary process in the container rather than the primary process. This means the server can be started/stopped without container backoff, but is trickier to implement and doesn’t offer the same isolation guarantees as exiting the container and starting fresh. It also means I probably can’t use the same image I deploy to production (because I probably don’t want this extra restart-support stuff floating around in the production image).
Environment:
- Kubernetes version (use
kubectl version
): v1.8.0 - Cloud provider or hardware configuration: Minikube 0.23.0 with Virtualbox driver on OSX
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 302
- Comments: 94 (13 by maintainers)
imo, this lack of configurability of CrashLoopBackoff is a huge design flaw in K8s. There are many use cases where a normal pod exit and restart is needed. K8s cannot handle that use case as is. Why is there no input from the K8s team on this 5 year old issue?
It would be ideal if the backoff could be disregarded in cases where the container exited with code 0. One could argue that containers exiting with such a code are not stuck in a “crash loop”; they have merely exited after successfully completing their work, and you want another one to start. This is tantamount to an infinite
Job
(no completion-count target).This isn’t just an issue for dev, it’s also important for some production workloads.
We have some workloads that deliberately exit whenever they get any sort of error, including bad input data, expecting that they’ll be restarted in a clean state so that they can continue. Bad input data is ~2% of input, each unit takes ~5 seconds, so those workloads seem to spend more time in CrashLoopBackoff than they do processing jobs. Especially since bad input data tends to be clustered.
+1
I understand there could be risk associated with implementing this in a highly flexible manner, i.e. allowing rapid, infinite polling the master, but some degree of configurability would avoid the need to add too much job management logic to the average app container, which is something of an antipattern.
I propose allowing a maxBackoffInterval configuration option on the container spec with a hard minimum of, say, 10 seconds, and a default of nil. That would allows the user to set a reasonable ceiling on the delay between runs, enabling the operator to add (bounded) job control capabilities that would have known worst-case effects on master health in the naive implementation; future iterations of the feature could implement the functionality in a manner that requires less coordination with master, potentially enabling more granular control of CrashLoopBackoff behavior.
/remove-lifecycle rotten
I also would like to adjust the CrashLoopBackoff timings: to make them shorter.
@mcfedr Such a design of K8S would probably strain the masters too much due to continual thrashing of container state. To solve your problem, how about something like this, where your Dockerfile runs a script that continually restarts the process you want to remain alive;
CMD bash /code/start.sh
Where
start.sh
is something to the effect ofIn other words, upon python exiting, just start it again.
/remove-lifecycle stale
246 is nothing compared to k8s’s user base.
Sure in absolute terms but most people won’t act even to upvote. The fact that it’s top 5 says a lot more than the absolute number.
I disagree with the labeling of this as a feature request. This is clearly a bug. Such a significant factor as an ever increasing delay in restarting containers that may be exiting for business-valid reasons is clearly a monumental oversight.
@sporkmonger change the command to
while yourbinary ; sleep .1; done
? 😃We too are in dire need of this. We are using Kubernetes to run our Azure DevOps agents and use a setting which exits the process with status code of 0 so that you get a guaranteed fresh container for builds and deploys. However this quick causes problems since Kubernetes will put these pods into crashloopbackoff status. we need this to be configurable!
This lack of feature has been a big pain point for my application, which relies on destroying containers after processing and starting up clean ones for security purposes.
We have this issue with workers that are restarted automatically every 5 minutes to clear any bad database connections, etc. The process quits it itself automatically and then should just restart with no delay. Would be nice to be able to just disable this backoff.
Edit: We ended up using the solution from here, which allows us to restart the script every 5 minute and not worry about CrashLoopBackOff.
I’d be happy if the amount of time that the pod had spent running was subtracted from the next CrashLoopBackoff. Our use case is an Azure Functions Runtime app, where I have basically implemented a single function inside a vanilla container and an app wrapper that I would prefer not to touch. Due to a workaround for a deadlock condition (installed by MS, not us), the container times out waiting for messages from Azure and exits every 7 to 15 minutes and we need the container to restart immediately and continue processing messages. Unfortunately, K8s still interprets this as a valid reason to use a CrashLoopBackoff even though the restarts are happening at a frequency well outside the hard-coded MaxContainerBackOff limit of 5 minutes.
Such important parameter - is NOT tweakable!!?? I want a restart counter, so i cannot use a bash script. I am returning exit code 0, so pod goes “completed”, then it goes “crashloopbackoff”. I want to be able to set max backoff to 10 seconds for my particular task, and have exact restarts counter in get pods output. Who is responsible for implementation of that stuff?
Our workaround has been to run our own service which polls the pod list looking for CrashLoopBackoff pods in particular deployments and deletes them when they appear. This is certainly not ideal.
I still think that the most obvious thing is that if the pod has been pinging as ready and healthy for longer than the CrashLoopBackOff would pause, then the backoff should be skipped.
/remove-lifecycle stale
https://github.com/ankilosaurus/kube_remediator does that https://github.com/kubernetes-sigs/descheduler also offers pod deletion on too many restarts
On Fri, Dec 18, 2020 at 2:55 PM svaranasi-traderev notifications@github.com wrote:
I’d like to mention that
ImagePullBackOff
should also be configurable.Generally, in certain scenarios (eg. local/dev/test cluster), all the performance of master etc. issues are simply irrelevant, and what really matters is only developer productivity time. If k8s is waiting for something it should be as snappy as possible.
In my use-case, in development setup, we allow docker images builds to start in parallel to
helm deploy
, so that a lot of deployment can happen while images are being build, saving time. But sometimes when the image takes a bit longer than usual, the image backoff slows everything down considerably.The backoff is used in Kubelet#syncPod
To make the MaxContainerBackOff flexible, maybe we can add a new Annotation, say, io.kubernetes.container.maxBackoff (expressed in Seconds), so that kubelet can utilize to customize the backoff period.
Here is my own “small team” scenario:
As k8 does not have any kind of depencency, if some frontend pods are in CrashLoopBackoff because other pod isn’t ready (eg buggy backend service), when the backend comes up again, the complete app will take 5 minutes more to be available. In this case It will be useful to kill the buggy backend, let the Deployment create new one (probably pulling a bug fixed image) and just wait 1 minute for frontend to reconnect.
@dkrieger Are you defending the current state of the software? Maybe this is ok as an untuneable parameter for legitimate crash cases, but for situations where a container is exiting with code 0 it seems clearly incorrect, or at best highly unintuitive, to apply this “crash backoff” policy.
@errogaht Just wrap the worker command into an infinite loop in bash, no need to reboot the whole container. Optionally add
set -e
to properly fail on actual errorsplease make such improvements. we are using PHP enqueue queue workers and it has built in feature to fail process after time limit passed or memory limit passed, because PHP has some problems with memory leak and we cant determine where memory leak is happened and it is normal way to just restart worker, but in the case - kubernetes increasing restart time, so a lot of messages just waiting when worker will be online… causing service outage. of course i want to restart pod without any pause
👍 to this idea as well. Some commentary from Stack Overflow users: https://stackoverflow.com/questions/41108713/configuring-kubernetes-restart-policy#comment83373490_41108798
I have an application that wants similar semantics to the SO comment.
The need to restart pods as part of a normal lifecycle seems important if you need to clear state between operations. An annotation or some form of configuration parameter for the pod to adjust what constitutes an unplanned number of restarts within a period of time would be nice. Although, any of the other suggested approaches would work as long as the net impact is that crashloopbackoffs only occur when pod restarts exceed a defined “normal” operating pattern.
Manipulating the job to not exit in some cases doesn’t make any sense and can be a security vulnerability. E.g. we’ve set CI runner to run as ever restarting deployment. The restarting with clean “storage” is a key security feature to prevent data leaking between jobs or block an attack from being persisted. The pod restart itself has lots of useful implications for security and reproducibility. Preventing people from taking advantage of them doesn’t make sense.
I’m glad this works for your workload and usage model. If Kubernetes were only for that type of workload that would be great, or even if it were only for workloads WE wrote, that would be great.
But Kubernetes is for ALL workloads, including unmodifiable images, for coding or licensing reasons. We shouldn’t decide how Kubernetes should work based on “it works on my machine”!
This feature should be switchable or configurable for situations where we can’t change the code to work without exiting.
long-term-issue (note to self)
Snippet showing the above idea (still need to plug custom max backoff to the flowcontrol.Backoff instance):
I would even prefer to set this “MaxContainerBackOff and backOffPeriod” of adjust “crashBackOfPeriod” per container. Some tasks are isolated processes on purpose, that you really don’t like to repeat within a container like this, think of privacy concerns:
but rather prefer to specify within your Dockerfile:
@dag24 this straining of masters depends on the lifecycle and scale of your tasks and so it must be configurable both ways.
I have a use-case where init containers are running into a rate-limit enforced by an external system if they go into crash loop, which just ensures the loop continues. Would like to be able to adjust crash back off to prevent hitting that externally enforced rate-limit without needing to resort to something hacky like e.g. a sleep inside the init container.
Lot’s of strong feelings on this one. I just wanted to chime in on what I think might be palatable.
First, let’s acknowledge that real users are really struggling with this. The current design was intended to balance the needs of the few (crashy apps) with the needs of the many (everyone else). Crashing and restarting in a really fast loop usually indicates that something is wrong. But clearly not ALWAYS.
There are a lot of good ideas in this thread. Some that stand out to me, with some commentary.
restart-on-exit-0 -- mycommand -arg -arg -arg
. I don’t love this but it is totally back-compatible.NOTE: These are not all mutually exclusive!
It’s important to repeat - this is designed to protect the system from badly behaving apps. We don’t want to remove that safety. If we add an explicit API to bypass it, we can at least let cluster admins install policies that say “oh no you don’t”. But APIs have costs that we pay forever.
So my strong preference is to do something smart without any API surface, and only if we can’t make that work to add API. Some combination of options 1, 2, and 4 seem like we could put a real dent in the problem, and then see what’s left.
If you scroll up, multiple ready-made solutions have been linked that will delete pods in crashloopbackoff state. The objection that logs won’t be preserved that followed is not really compelling; if you’re not aggregating logs in your cluster, that’s the problem to solve. There are already multiple low effort paths for you to accomplish what you want to accomplish.
If this were a rampant high impact problem, there would be popular CRDs and operators for dealing with this, because k8s was thoughtfully designed for extensibility. There probably are some in the wild. For every use case described in here, there are plenty of other people with the same use case who have found suitable means to accomplish their ends with the current core k8s APIs. They’ve probably never even seen this gh issue.
Can you provide a concrete example of where a docker image cannot be FROM’d? I don’t believe that’s true. Even still you could use volumes to inject a self-contained init process, then use
cmd
/args
. Open/closed-source is completely irrelevant as nothing I’m proposing involves touching source code. You’re describing contrived what-ifs that aren’t relevant to the proposition of wrapping an image’s default entrypoint with an init processNo commercial license for a container image would be intentionally written such that it prevents running on kubernetes, unless it’s intentionally blocking kubernetes usage in which case it’s a moot point.
We have even more painful scenario and reason MaxContainerBackOff and backOffPeriod mast be tunable in per-node way. We have two ingress nodes running Calico and nginx-ingress pods. Besides, we have Keepalived on ingress nodes watching for specified pods to be in the Running state and moving VIP (external real IP) according to the condition. In case all ingress nodes goes down and stay unresponsive for some time and then one of them returns, Kubernetes will try to restart calico-node pod (the only available restartPolicy in Calico DS is “Always”), but pod will stay in CrashLoopBackOff state up to 5 minutes leaving the whole cluster unavailable from outside while Kubernetes simply doing nothing and waiting for timeout to expire. Instead, it is vital to push pods as hard as possible to go through internal cycle ASAP in that particular scenario.
I guess a more direct way to achieve what I am looking for would be a
kubectl restart pod_name -c container_name
that was explicitly exempted from crash-loop backoff (see https://github.com/kubernetes/kubernetes/issues/24957#issuecomment-221173305 for related discussion) or some other way to indicate that we’re bringing the container down on purpose and are not in an uncontrolled crash loop.But, the 5 minutes max backoff/10 minutes backoff reset for image pull and crash backoff seems far too high for development environments regardless. I’d like to tune those down significantly on my Minikube anyway.
This is currently the 5th most upvoted open issue in this repository, so I think your assertion here is false.
This is a false premise, and I’ll explain why.
cmd
andargs
are not part of the artifact, and that’s all you need to control how the application is initialized. There’s no such thing as a manifest you’re not able to control, be it for coding or licensing reasons. If you’re talking about a 3rd party helm chart, it would be a bug in the chart (or a missing feature) if it doesn’t have an init process suitable to your needs. The way your process is initialized is akin to the env vars you choose to set, it is not part of your application’s source, whether it’s transparent and modifiable or opaque and unmodifiable.I’ll go a step further and challenge your claim that you cannot legally build your image from another, potentially proprietary image, but I’d like you to describe your particular situation if that is wrong.
The bottom line is there is no scenario where you’re “allowed” to configure a yet-to-be implemented crashloopbackoff setting but you’re “not allowed” to control the process by which your app is executed, excluding literally a person or group of people you work with that exert arbitrary constraints not founded in technical or legal underpinnings.
I have same problem, we process user input that may put the app in an invalid state, and we need to restart it immediately. We designed the system with this behavior (each pod process one thing at a time reading work from a queue), but the fact that it take more and more time to restart each time is very annoying.
I want to see this implemented as well. I understand that people can misuse it which would cause them obviously problems, but in certain cases this would make things much better. I just don’t understand why is this not configurable. We don’t even have shells in our containers anymore to implement a workaround 😐.
I have similar requirement. why not directly file a PR @tedyu ?
It is a subjective matter defining normal pattern. From the above comments, some use cases would restart container constantly (https://github.com/kubernetes/kubernetes/issues/57291#issuecomment-561634523).
My suggestion from https://github.com/kubernetes/kubernetes/issues/57291#issuecomment-553520647 can be refined by putting lower bound on the backoff interval - say 10 seconds.
This way, there is limit on the frequency of polling the master.
Having the ability to adjust this would make for better resiliency testing. In my case, I am starting a Deployment with a bunch of replicas with intentionally misconfigured memory limits to force the Pods to be
OOMKilled
, and I want it to spin like that for a while.@eroldan I have been using init containers to make sure that backend services are running/updated before launching frontends. Of course this wont prevent issues if the backend is going down during work, but maybe you can change the frontend livelyness check to report healthy even when backend is not working.