kubernetes: Make CrashLoopBackoff timing tuneable, or add mechanism to exempt some exits

Is this a BUG REPORT or FEATURE REQUEST?: Feature request

/kind feature

What happened: As part of a development workflow, I intentionally killed a container in a pod with restartPolicy: Always. The plan was to do this repeatedly, as a quick way to restart the container and clear old state (and, in Minikube, to load image changes). The container went into a crash-loop backoff, making this anything but a quick option.

What you expected to happen: I expected there so be some configuration allowing me to disable, or at least tune the timing of, the CrashLoopBackoff.

How to reproduce it (as minimally and precisely as possible): Create a pod with restartPolicy: Always, and intentionally exit a container repeatedly.

Anything else we need to know?: I see that the backoff timing parameters are hard-coded constants here: https://github.com/kubernetes/kubernetes/blob/5f920426103085a28069a1ba3ec9b5301c19d075/pkg/kubelet/kubelet.go#L121 https://github.com/kubernetes/kubernetes/blob/5f920426103085a28069a1ba3ec9b5301c19d075/pkg/kubelet/kubelet.go#L155

One might reasonably expect these to be configurable at least at the kubelet level - say, by a setting like these. That would be sufficient for my use-case (local development with fast restarts), and presumably useful as an advanced configuration setting for production workloads.

A more aggressive change would allow tuning per-pod.

There are other options for my target workflow:

Put the pod in a Deployment or similar, kubectl delete the pod, let Kubernetes schedule another, work with the new pod. However, this is much slower than a container restart without backoff (and ironically causes more kubelet load than the backoff avoids). It also relies on using kubectl/the Kubernetes API to do the restart, as opposed to just exiting the container.
Run the server process as a secondary process in the container rather than the primary process. This means the server can be started/stopped without container backoff, but is trickier to implement and doesn’t offer the same isolation guarantees as exiting the container and starting fresh. It also means I probably can’t use the same image I deploy to production (because I probably don’t want this extra restart-support stuff floating around in the production image).

Environment:

Kubernetes version (use kubectl version): v1.8.0
Cloud provider or hardware configuration: Minikube 0.23.0 with Virtualbox driver on OSX

About this issue

Original URL
State: open
Created 7 years ago
Reactions: 302
Comments: 94 (13 by maintainers)

Links to this issue

configuring kubernetes restart policy - Stack Overflow

Most upvoted comments

imo, this lack of configurability of CrashLoopBackoff is a huge design flaw in K8s. There are many use cases where a normal pod exit and restart is needed. K8s cannot handle that use case as is. Why is there no input from the K8s team on this 5 year old issue?

+42

randygeyer-ws on Mar 21, 2023

It would be ideal if the backoff could be disregarded in cases where the container exited with code 0. One could argue that containers exiting with such a code are not stuck in a “crash loop”; they have merely exited after successfully completing their work, and you want another one to start. This is tantamount to an infinite Job (no completion-count target).

+41

3dbrows on Feb 26, 2018

This isn’t just an issue for dev, it’s also important for some production workloads.

We have some workloads that deliberately exit whenever they get any sort of error, including bad input data, expecting that they’ll be restarted in a clean state so that they can continue. Bad input data is ~2% of input, each unit takes ~5 seconds, so those workloads seem to spend more time in CrashLoopBackoff than they do processing jobs. Especially since bad input data tends to be clustered.

+37

bryanlarsen on Jan 9, 2018

I understand there could be risk associated with implementing this in a highly flexible manner, i.e. allowing rapid, infinite polling the master, but some degree of configurability would avoid the need to add too much job management logic to the average app container, which is something of an antipattern.

I propose allowing a maxBackoffInterval configuration option on the container spec with a hard minimum of, say, 10 seconds, and a default of nil. That would allows the user to set a reasonable ceiling on the delay between runs, enabling the operator to add (bounded) job control capabilities that would have known worst-case effects on master health in the naive implementation; future iterations of the feature could implement the functionality in a manner that requires less coordination with master, potentially enabling more granular control of CrashLoopBackoff behavior.

+16

dkrieger on Mar 22, 2019

/remove-lifecycle rotten

I also would like to adjust the CrashLoopBackoff timings: to make them shorter.

+14

mattysweeps on Feb 6, 2019

@mcfedr Such a design of K8S would probably strain the masters too much due to continual thrashing of container state. To solve your problem, how about something like this, where your Dockerfile runs a script that continually restarts the process you want to remain alive;

CMD bash /code/start.sh

Where start.sh is something to the effect of

#!/bin/bash

while true; do
    python /code/app.py
done

In other words, upon python exiting, just start it again.

+14

3dbrows on Mar 30, 2018

/remove-lifecycle stale

+13

alexykot on Oct 4, 2020

246 is nothing compared to k8s’s user base.

Sure in absolute terms but most people won’t act even to upvote. The fact that it’s top 5 says a lot more than the absolute number.

+12

kribor on Sep 20, 2023

I disagree with the labeling of this as a feature request. This is clearly a bug. Such a significant factor as an ever increasing delay in restarting containers that may be exiting for business-valid reasons is clearly a monumental oversight.

+10

haqa on Oct 5, 2022

What’s the alternative?

@sporkmonger change the command to while yourbinary ; sleep .1; done? 😃

dpc on Dec 16, 2020

We too are in dire need of this. We are using Kubernetes to run our Azure DevOps agents and use a setting which exits the process with status code of 0 so that you get a guaranteed fresh container for builds and deploys. However this quick causes problems since Kubernetes will put these pods into crashloopbackoff status. we need this to be configurable!

ericrdgz on Feb 25, 2022

This lack of feature has been a big pain point for my application, which relies on destroying containers after processing and starting up clean ones for security purposes.

walteryoung on Oct 22, 2019

We have this issue with workers that are restarted automatically every 5 minutes to clear any bad database connections, etc. The process quits it itself automatically and then should just restart with no delay. Would be nice to be able to just disable this backoff.

Edit: We ended up using the solution from here, which allows us to restart the script every 5 minute and not worry about CrashLoopBackOff.

fredsted on Dec 4, 2019

I’d be happy if the amount of time that the pod had spent running was subtracted from the next CrashLoopBackoff. Our use case is an Azure Functions Runtime app, where I have basically implemented a single function inside a vanilla container and an app wrapper that I would prefer not to touch. Due to a workaround for a deadlock condition (installed by MS, not us), the container times out waiting for messages from Azure and exits every 7 to 15 minutes and we need the container to restart immediately and continue processing messages. Unfortunately, K8s still interprets this as a valid reason to use a CrashLoopBackoff even though the restarts are happening at a frequency well outside the hard-coded MaxContainerBackOff limit of 5 minutes.

derekrprice on Dec 4, 2019

Such important parameter - is NOT tweakable!!?? I want a restart counter, so i cannot use a bash script. I am returning exit code 0, so pod goes “completed”, then it goes “crashloopbackoff”. I want to be able to set max backoff to 10 seconds for my particular task, and have exact restarts counter in get pods output. Who is responsible for implementation of that stuff?

xakepp35 on Dec 14, 2021

Our workaround has been to run our own service which polls the pod list looking for CrashLoopBackoff pods in particular deployments and deletes them when they appear. This is certainly not ideal.

glasser on Dec 16, 2020

I still think that the most obvious thing is that if the pod has been pinging as ready and healthy for longer than the CrashLoopBackOff would pause, then the backoff should be skipped.

derekrprice on Apr 1, 2021

/remove-lifecycle stale

unixfox on Mar 19, 2021

https://github.com/ankilosaurus/kube_remediator does that https://github.com/kubernetes-sigs/descheduler also offers pod deletion on too many restarts

On Fri, Dec 18, 2020 at 2:55 PM svaranasi-traderev notifications@github.com wrote:

Our workaround has been to run our own service which polls the pod list looking for CrashLoopBackoff pods in particular deployments and deletes them when they appear. This is certainly not ideal.

That sounds like the best workaround in this thread. Would you mind sharing how the service works?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/57291#issuecomment-748358541, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACYZ5OTDVUB5IY5MHQ7RLSVPMWNANCNFSM4EITQIDQ .

grosser on Dec 18, 2020

I’d like to mention that ImagePullBackOff should also be configurable.

Generally, in certain scenarios (eg. local/dev/test cluster), all the performance of master etc. issues are simply irrelevant, and what really matters is only developer productivity time. If k8s is waiting for something it should be as snappy as possible.

In my use-case, in development setup, we allow docker images builds to start in parallel to helm deploy, so that a lot of deployment can happen while images are being build, saving time. But sometimes when the image takes a bit longer than usual, the image backoff slows everything down considerably.

dpc on Feb 21, 2020

The backoff is used in Kubelet#syncPod

To make the MaxContainerBackOff flexible, maybe we can add a new Annotation, say, io.kubernetes.container.maxBackoff (expressed in Seconds), so that kubelet can utilize to customize the backoff period.

tedyu on Feb 10, 2020

Here is my own “small team” scenario:

As k8 does not have any kind of depencency, if some frontend pods are in CrashLoopBackoff because other pod isn’t ready (eg buggy backend service), when the backend comes up again, the complete app will take 5 minutes more to be available. In this case It will be useful to kill the buggy backend, let the Deployment create new one (probably pulling a bug fixed image) and just wait 1 minute for frontend to reconnect.

eroldan on Apr 11, 2018

@dkrieger Are you defending the current state of the software? Maybe this is ok as an untuneable parameter for legitimate crash cases, but for situations where a container is exiting with code 0 it seems clearly incorrect, or at best highly unintuitive, to apply this “crash backoff” policy.

nediamond on Sep 19, 2023

@errogaht Just wrap the worker command into an infinite loop in bash, no need to reboot the whole container. Optionally add set -e to properly fail on actual errors

TamasSzigeti on Apr 14, 2023

please make such improvements. we are using PHP enqueue queue workers and it has built in feature to fail process after time limit passed or memory limit passed, because PHP has some problems with memory leak and we cant determine where memory leak is happened and it is normal way to just restart worker, but in the case - kubernetes increasing restart time, so a lot of messages just waiting when worker will be online… causing service outage. of course i want to restart pod without any pause

errogaht on Apr 14, 2023

👍 to this idea as well. Some commentary from Stack Overflow users: https://stackoverflow.com/questions/41108713/configuring-kubernetes-restart-policy#comment83373490_41108798

I have an application that wants similar semantics to the SO comment.

ScalaWilliam on Nov 15, 2020

The need to restart pods as part of a normal lifecycle seems important if you need to clear state between operations. An annotation or some form of configuration parameter for the pod to adjust what constitutes an unplanned number of restarts within a period of time would be nice. Although, any of the other suggested approaches would work as long as the net impact is that crashloopbackoffs only occur when pod restarts exceed a defined “normal” operating pattern.

inviscid on Dec 11, 2019

Manipulating the job to not exit in some cases doesn’t make any sense and can be a security vulnerability. E.g. we’ve set CI runner to run as ever restarting deployment. The restarting with clean “storage” is a key security feature to prevent data leaking between jobs or block an attack from being persisted. The pod restart itself has lots of useful implications for security and reproducibility. Preventing people from taking advantage of them doesn’t make sense.

kribor on Sep 20, 2023

Thought I’d weigh in here years after being on team “let’s get this merged” and seeing new people adding the same thing to the discussion that I originally did…

When deploying a workload that is supposed to be able to exit and restart, just accept that it’s an application-level concern and use an init process, even if it’s a simple bash script. It’s conceptually cleaner than treating it as an infrastructure-level concern. With the exception of Jobs, k8s workloads are assumed to run forever, and when they stop running, they’re assumed to have crashed. This is OK.

I’m glad this works for your workload and usage model. If Kubernetes were only for that type of workload that would be great, or even if it were only for workloads WE wrote, that would be great.

But Kubernetes is for ALL workloads, including unmodifiable images, for coding or licensing reasons. We shouldn’t decide how Kubernetes should work based on “it works on my machine”!

This feature should be switchable or configurable for situations where we can’t change the code to work without exiting.

haqa on Sep 19, 2023

long-term-issue (note to self)

dims on Aug 25, 2022

Snippet showing the above idea (still need to plug custom max backoff to the flowcontrol.Backoff instance):

diff --git a/pkg/kubelet/kuberuntime/kuberuntime_manager.go b/pkg/kubelet/kuberuntime/kuberuntime_manager.go
index 3d0eb561836..fd48bd5eca0 100644
--- a/pkg/kubelet/kuberuntime/kuberuntime_manager.go
+++ b/pkg/kubelet/kuberuntime/kuberuntime_manager.go
@@ -840,9 +840,19 @@ func (m *kubeGenericRuntimeManager) SyncPod(pod *v1.Pod, podStatus *kubecontaine
 // a detailed error message.
 func (m *kubeGenericRuntimeManager) doBackOff(pod *v1.Pod, container *v1.Container, podStatus *kubecontainer.PodStatus, backOff *flowcontrol.Backoff) (bool, string, error) {
        var cStatus *kubecontainer.ContainerStatus
+       maxBackoff := 300 * time.Second
        for _, c := range podStatus.ContainerStatuses {
                if c.Name == container.Name && c.State == kubecontainer.ContainerStateExited {
                        cStatus = c
+                       status, err := m.runtimeService.ContainerStatus(c.ID.ID)
+                       if err == nil {
+                               if maxBackoff_, err := getIntValueFromLabel(status.Annotations, containerMaxBackoffLabel); err == nil {
+                                       if maxBackoff_ < 10 {
+                                               maxBackoff_ = 10
+                                       }
+                                       maxBackoff = time.Duration(maxBackoff_) * time.Second
+                               }
+                       }
                        break
                }
        }
diff --git a/pkg/kubelet/kuberuntime/labels.go b/pkg/kubelet/kuberuntime/labels.go
index e3a67eb1aab..5c0e7c4ade9 100644
--- a/pkg/kubelet/kuberuntime/labels.go
+++ b/pkg/kubelet/kuberuntime/labels.go
@@ -33,6 +33,7 @@ const (
        podDeletionGracePeriodLabel    = "io.kubernetes.pod.deletionGracePeriod"
        podTerminationGracePeriodLabel = "io.kubernetes.pod.terminationGracePeriod"

+       containerMaxBackoffLabel               = "io.kubernetes.container.maxBackoff"
        containerHashLabel                     = "io.kubernetes.container.hash"
        containerRestartCountLabel             = "io.kubernetes.container.restartCount"
        containerTerminationMessagePathLabel   = "io.kubernetes.container.terminationMessagePath"

tedyu on Feb 10, 2020

I would even prefer to set this “MaxContainerBackOff and backOffPeriod” of adjust “crashBackOfPeriod” per container. Some tasks are isolated processes on purpose, that you really don’t like to repeat within a container like this, think of privacy concerns:

#!/bin/bash

while true; do
    python /code/isolated-task.py
done

but rather prefer to specify within your Dockerfile:

CMD ['isolated-task.py']

@dag24 this straining of masters depends on the lifecycle and scale of your tasks and so it must be configurable both ways.

scher200 on Oct 9, 2018

I have a use-case where init containers are running into a rate-limit enforced by an external system if they go into crash loop, which just ensures the loop continues. Would like to be able to adjust crash back off to prevent hitting that externally enforced rate-limit without needing to resort to something hacky like e.g. a sleep inside the init container.

sporkmonger on Jul 3, 2018

Lot’s of strong feelings on this one. I just wanted to chime in on what I think might be palatable.

First, let’s acknowledge that real users are really struggling with this. The current design was intended to balance the needs of the few (crashy apps) with the needs of the many (everyone else). Crashing and restarting in a really fast loop usually indicates that something is wrong. But clearly not ALWAYS.

There are a lot of good ideas in this thread. Some that stand out to me, with some commentary.

Revisit the default backoff curve. Perhaps it’s just too aggressive and should start slower and peak lower. E.g. it could start at 250ms and climb to 4s and stop there. Or 16s (still much less than 5m). Maybe it uses a 1.2x factor instead of 2x. Or maybe it flattens out again at 4s and you can restart 100 times before it starts climbing again. Or maybe its 100x at each step.
Don’t count a 0 exit as a crash, making it eligible for fast(er) restart. This could mean “immediate” or a very different backoff curve.
Codify the previous into a tool, e.g. restart-on-exit-0 -- mycommand -arg -arg -arg. I don’t love this but it is totally back-compatible.
Reset the crash counter when your app successfully passes a startupProbe or readinessProbe (or 3x readiness or something). Or reduce the count for every probe you pass (so starting up, serving for a bit, then exiting would always be fast restart, but crash-crash-crash would slow down quickly).

NOTE: These are not all mutually exclusive!

It’s important to repeat - this is designed to protect the system from badly behaving apps. We don’t want to remove that safety. If we add an explicit API to bypass it, we can at least let cluster admins install policies that say “oh no you don’t”. But APIs have costs that we pay forever.

So my strong preference is to do something smart without any API surface, and only if we can’t make that work to add API. Some combination of options 1, 2, and 4 seem like we could put a real dent in the problem, and then see what’s left.

thockin on Dec 8, 2023

If you need a CI job scheduler workload, that could be modeled as a deployment with a service account that’s allowed to schedule Kubernetes Jobs.

I presume you are kubernetes developer used to building apps interacting in complex ways with kubernetes api’s, hooks etc. I’m not, I’m a kubernetes user looking for an orchestration platform that solves complex infrastructure needs so I don’t need to code that logic myself. To me, creating some custom piece of complex code to schedule jobs, monitor their completion to start new ones as soon as the last one completes is just not worth the complexity when the use case can be 99% covered with existing kubernetes features. I don’t remember exactly why I arrived at deployment a few years ago when setting this up but looking at Job docs now I assume it was to enable constant number of runners and immediate restart on success. Achieving that with Job doesn’t seem possible (e.g. from job docs: “Only a RestartPolicy equal to Never or OnFailure is allowed”), without such custom scheduling as I think you’re suggesting. Deployments make constant rescheduling via replica count trivial, except for the issue of when jobs run “too fast” for kubernetes default liking. Basically Deployment is much closer to match the use case than Job. Unless there would be som other kind of Job scheduling allowing “restarting forever”.

If you scroll up, multiple ready-made solutions have been linked that will delete pods in crashloopbackoff state. The objection that logs won’t be preserved that followed is not really compelling; if you’re not aggregating logs in your cluster, that’s the problem to solve. There are already multiple low effort paths for you to accomplish what you want to accomplish.

If this were a rampant high impact problem, there would be popular CRDs and operators for dealing with this, because k8s was thoughtfully designed for extensibility. There probably are some in the wild. For every use case described in here, there are plenty of other people with the same use case who have found suitable means to accomplish their ends with the current core k8s APIs. They’ve probably never even seen this gh issue.

dkrieger on Sep 20, 2023

I’m glad this works for your workload and usage model. If Kubernetes were only for that type of workload that would be great, or even if it were only for workloads WE wrote, that would be great. But Kubernetes is for ALL workloads, including unmodifiable images, for coding or licensing reasons. We shouldn’t decide how Kubernetes should work based on “it works on my machine”! This feature should be switchable or configurable for situations where we can’t change the code to work without exiting.

This is a false premise, and I’ll explain why. cmd and args are not part of the artifact, and that’s all you need to control how the application is initialized. There’s no such thing as a manifest you’re not able to control, be it for coding or licensing reasons. If you’re talking about a 3rd party helm chart, it would be a bug in the chart (or a missing feature) if it doesn’t have an init process suitable to your needs. The way your process is initialized is akin to the env vars you choose to set, it is not part of your application’s source, whether it’s transparent and modifiable or opaque and unmodifiable. I’ll go a step further and challenge your claim that you cannot legally build your image from another, potentially proprietary image, but I’d like you to describe your particular situation if that is wrong. The bottom line is there is no scenario where you’re “allowed” to configure a yet-to-be implemented crashloopbackoff setting but you’re “not allowed” to control the process by which your app is executed, excluding literally a person or group of people you work with that exert arbitrary constraints not founded in technical or legal underpinnings.

Not all software is open source, or open license, or, in fact, open in any way. Much software is licensed very restrictively. Not all software is packaged with the tools you suggest, and not all software may legally be “from”'d. Are we saying, publicly here, that these applications are not able to be run reliably under Kubernetes? That kubernetes is not for these types of application licensing?

That would be a bold statement indeed.

Can you provide a concrete example of where a docker image cannot be FROM’d? I don’t believe that’s true. Even still you could use volumes to inject a self-contained init process, then use cmd/args. Open/closed-source is completely irrelevant as nothing I’m proposing involves touching source code. You’re describing contrived what-ifs that aren’t relevant to the proposition of wrapping an image’s default entrypoint with an init process

No commercial license for a container image would be intentionally written such that it prevents running on kubernetes, unless it’s intentionally blocking kubernetes usage in which case it’s a moot point.

dkrieger on Sep 19, 2023

We have even more painful scenario and reason MaxContainerBackOff and backOffPeriod mast be tunable in per-node way. We have two ingress nodes running Calico and nginx-ingress pods. Besides, we have Keepalived on ingress nodes watching for specified pods to be in the Running state and moving VIP (external real IP) according to the condition. In case all ingress nodes goes down and stay unresponsive for some time and then one of them returns, Kubernetes will try to restart calico-node pod (the only available restartPolicy in Calico DS is “Always”), but pod will stay in CrashLoopBackOff state up to 5 minutes leaving the whole cluster unavailable from outside while Kubernetes simply doing nothing and waiting for timeout to expire. Instead, it is vital to push pods as hard as possible to go through internal cycle ASAP in that particular scenario.

e-pirate on Oct 8, 2018

I guess a more direct way to achieve what I am looking for would be a kubectl restart pod_name -c container_name that was explicitly exempted from crash-loop backoff (see https://github.com/kubernetes/kubernetes/issues/24957#issuecomment-221173305 for related discussion) or some other way to indicate that we’re bringing the container down on purpose and are not in an uncontrolled crash loop.

But, the 5 minutes max backoff/10 minutes backoff reset for image pull and crash backoff seems far too high for development environments regardless. I’d like to tune those down significantly on my Minikube anyway.

jgiles on Dec 18, 2017

For every use case described in here, there are plenty of other people with the same use case who have found suitable means to accomplish their ends with the current core k8s APIs. They’ve probably never even seen this gh issue.

This is currently the 5th most upvoted open issue in this repository, so I think your assertion here is false.

mitar on Sep 20, 2023

I’m glad this works for your workload and usage model. If Kubernetes were only for that type of workload that would be great, or even if it were only for workloads WE wrote, that would be great.

But Kubernetes is for ALL workloads, including unmodifiable images, for coding or licensing reasons. We shouldn’t decide how Kubernetes should work based on “it works on my machine”!

This feature should be switchable or configurable for situations where we can’t change the code to work without exiting.

This is a false premise, and I’ll explain why.

cmd and args are not part of the artifact, and that’s all you need to control how the application is initialized. There’s no such thing as a manifest you’re not able to control, be it for coding or licensing reasons. If you’re talking about a 3rd party helm chart, it would be a bug in the chart (or a missing feature) if it doesn’t have an init process suitable to your needs. The way your process is initialized is akin to the env vars you choose to set, it is not part of your application’s source, whether it’s transparent and modifiable or opaque and unmodifiable.

I’ll go a step further and challenge your claim that you cannot legally build your image from another, potentially proprietary image, but I’d like you to describe your particular situation if that is wrong.

The bottom line is there is no scenario where you’re “allowed” to configure a yet-to-be implemented crashloopbackoff setting but you’re “not allowed” to control the process by which your app is executed, excluding literally a person or group of people you work with that exert arbitrary constraints not founded in technical or legal underpinnings.

dkrieger on Sep 19, 2023

I have same problem, we process user input that may put the app in an invalid state, and we need to restart it immediately. We designed the system with this behavior (each pod process one thing at a time reading work from a queue), but the fact that it take more and more time to restart each time is very annoying.

Socolin on Sep 19, 2022

I want to see this implemented as well. I understand that people can misuse it which would cause them obviously problems, but in certain cases this would make things much better. I just don’t understand why is this not configurable. We don’t even have shells in our containers anymore to implement a workaround 😐.

silveraid on Sep 8, 2022

I have similar requirement. why not directly file a PR @tedyu ?

rokii on Apr 7, 2020

crashloopbackoffs only occur when pod restarts exceed a defined "normal" operating pattern.

It is a subjective matter defining normal pattern. From the above comments, some use cases would restart container constantly (https://github.com/kubernetes/kubernetes/issues/57291#issuecomment-561634523).

My suggestion from https://github.com/kubernetes/kubernetes/issues/57291#issuecomment-553520647 can be refined by putting lower bound on the backoff interval - say 10 seconds.

This way, there is limit on the frequency of polling the master.

tedyu on Feb 10, 2020

Having the ability to adjust this would make for better resiliency testing. In my case, I am starting a Deployment with a bunch of replicas with intentionally misconfigured memory limits to force the Pods to be OOMKilled, and I want it to spin like that for a while.

texascloud on Nov 26, 2019

@eroldan I have been using init containers to make sure that backend services are running/updated before launching frontends. Of course this wont prevent issues if the backend is going down during work, but maybe you can change the frontend livelyness check to report healthy even when backend is not working.

mcfedr on Apr 13, 2018