kubernetes: deployments do not support (honor) container restartPolicy

Steps: Create deployment file and set restartPolicy to “Never” Result:

The Deployment "foo" is invalid.
spec.template.spec.restartPolicy: Unsupported value: "Never": supported values: Always

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 70
  • Comments: 93 (7 by maintainers)

Commits related to this issue

Most upvoted comments

If deploymets support only restartPolicy: Always, why do that parameter exists at all? It doesn’t make much sense to have a ‘parameter’ that can have only one value…

What is the alternative for running one-time or periodic tasks, like data import or backups?

Does anyone know why deployments only support restartPolicy == Always? It’s not clear from the documentation and I can’t spot the reason easily in the code.

I think there are legitimate use cases for “Never”. For example, I have a pod running on a node and it has a liveness probe. That node then fails in someway that makes it remain up but perform slowly. The liveness probe fails and the kubelet will now try to recreate the container. But the problem is with the node so the new container doesn’t become healthy. And my ReplicaSet doesn’t see the pod as failed so it doesn’t try to create a new pod elsewhere.

If I could set the restart policy to Never then I imagine in this scenario: liveness probe fails, container terminated, pod marked as failed, replica set creates a new pod (maybe or maybe not on the same node).

Or have I mis-understood something?

Sorry for a late follow up, but what if I’d like to deliberately kill the pods instead of restarting? I’ve got a deployment with pods that are somewhat buggy. They do not restore themselves well after one of the containers fails and thus my deployment becomes unstable after some time. How can I maintain a Deployment (and consequently a ReplicaSet) that would kill and recreate failing Pods rather than restarting them?

Another example that is difficult to handle: sidecars and sidecar proxies, in particular. If an application container does not properly clean up after itself (e.g. UDS sockets), then the pod stays alive since the sidecar is still running. The application keeps trying to restart, and keeps failing since it attempts to use some resources that have not been cleaned up.

I’m experiencing the same problem. Uh three years since the ticket opened. You need to fix the documentation!

+1 Our application is designed in such a way if container crash it will fail to start again. We need option “spec.template.spec.restartPolicy” : “Never”.

why closed,this issue is important

@ichekrygin Deployments only support restartPolicy = Always; so do Replication Controllers, Replica Sets, and Daemon Sets.

I’ll document this.

I also think it would be great to reopen this.

I genuinely cannot believe the amount of people requesting this and the stubbornness displayed in this issue.

There is a clear use-case for this feature.

I am very confused - can the maintainers explain clearly why this issue has been closed multiple times when there is clear community desire for this feature? We just ran into a cluster-level failure triggered by the fact that we could not have the calico-typha pod re-scheduled on a different host due to a random port conflict that happened (tldr: something on the host randomly used port 9093… causing calico to fail to start up and bind its metrics listener).

The desired behavior here seems very reasonable… can someone clarify why its still not implemented?

why is this issue closed? how do I get k8s to not restart a pod? what is the point of a parameter if it can only take one value?

I wonder this thing was raised 3 years back and why is documentation here https://v1-12.docs.kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy says

A PodSpec has a restartPolicy field with possible values Always, OnFailure, and Never. 

while I get

... is invalid: spec.template.spec.restartPolicy: Unsupported value: "Never": supported values: "Always"

kubectl version

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:44:30Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.8", GitCommit:"a89f8c11a5f4f132503edbc4918c98518fd504e3", GitTreeState:"clean", BuildDate:"2019-04-23T04:41:47Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

The Never value is required for some troubleshooting purposes so that logs are not lost during development stages when time crunching matters like fluentd configuration is not complete. Kindly add the restart policy Never support

So many years have passed and so many people propose this question,they just dont solve it ,i cant understand

I’d like to have the ability to set Never in order to be able to debug in a pod which is having some issues only in a kubernetes environment.

I think restartPolicy: recreate is more like it. It’s not about restarting, it’s more of ”terminate this pod and replace it with a new one” instead of trying to restart containers.

I feel like a broken record… but this is a really pretty big issue. There are tons of cases where you do not want to restart a broken pod, but replace it. I am really surprised at how little movement there is here. 😕

Just use a pod then, you don’t need a deployment

Edit: unless you do need a deployment for some reason idk. But in my experience to do what you described I would just use a pod instead of a deployment.

Please reopen because this is critical and particularly bad when combined with unstable volumne mount bug like this: https://github.com/kubernetes/kubernetes/issues/67643 or a dead fuse daemon.

If any volumne mount is broken for a pod, we have no choice but restarting the whole pod. We have put liveness probe and in code health check to commit suicide after write failure or any mountpoint abnormality. But if restartPolicy is always, the pod simply enters crashloop without restarting itself.

We used to be able to set “Never” on deployments and it works as expected. However recently we found that this has been enforced and our existing mechanism stops working.

Another possible solution is to add a time limit of crashloopbackoff and kill the pod after limit. We can write a script to do so but it would be better if it can be built into the official controller or kubelet.

@llech you can always use cron job. But I fully agree that if there is just one supported policy why does that parameter exists at all ? Is it placeholder for the future extensions ?

Or if you won’t fix it, could you please fix the documentation, specifically https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy ?

And if you’re going to say, “well, Pod Containers can have a non-Always restart policy, but not when they’re wrapped in a Deployment” then maybe a note to that effect on the restart-policy page would be helpful?

I mean is there some practical reason that the deployment can’t support never or is it just a principle or something? Can we please have it?

On Sat, Mar 11, 2023 at 9:53 PM Petrus Repo @.***> wrote:

I think I have a different use case than the ones I’ve read above.

In this use case your cluster should stream the k8s logs to a long term storage (which you probably should do anyway). A common pattern is to eg run Fluentbit as a DaemonSet and stream the container logs from the k8s nodes to eg Elasticsearch. This allows indexing, searching and browsing the logs with eg Kibana. It also decouples the access control of log viewing from access control if the k8s resources (kubectl logs).

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/24725#issuecomment-1465099392, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABLJIVRA3FXGFTDUMK24PTW3VQERANCNFSM4CB2J2KQ . You are receiving this because you commented.Message ID: @.***>

Compilling the various known use cases for “If Pod-XYZABC exits for any reason, do not restart it, provide me a new one”, either mentioned in other issues, or from my own personal experience (in providing infrasturcture consulting to startups) - so as to avoid confusing it with issues that would be solved with “restartTogether”, or other potential fixes.

  • Persistent errors that survive restarts

    • Due to possible volume driver issues : https://github.com/kubernetes/kubernetes/issues/67643
    • Due to mounting of host resources that has failed
      • mounting of a cache SSD, that is now phsically corrupted
      • mounting of GPU issues
    • Other wierd driver issues (I run x11 with selenium in docker, i cant explain why there is a 1 in 50 chance that a new container will persistently have an error with its graphic driver which survive restarts)
    • Applications logic failure, especially of legacy / buggy apps, despite removing all known file changes (just what is it modifying on the kernel level, wth?)
  • Security of container appspace

    • For QEMU/KVM (or VMWare, etc), “snapshot” volumes or images are used to gurantee that no changes persist between restarts [link-to-docs]. This helps gurantee that in event of a server runtime compromise, if files were to be injected into the container, this can be resolved by a simple “restart”. Compliance also really love this feature, and this was nearly a major show stopper for one known bank migration from VM to K8S
    • Strictly speaking this is not the same feature, but a pod recreation is effectively a clean restart

Note: For most of the above listed examples, this does not have issues with PVC itself or its data, so those PVC are reused across pods, also PVC is realtively easier to “reset” with an entrypoint script or sidecar. While it may overlap with “mount issues” in the driver.


In general, while most issues could probably be “fixed” by “fixing” the application software - especially applicaiton logic failure. It is important to recognise that a large percentage of people in charge of the company kubernetes infrastructure, may not have the ability to make such changes. They may not even have a say in the applications involved, as many enterprises have started forcing the move to docker/cloud “at all costs”.

  • Lack of domain expertise to make changes to volumedrivers / linux kernel code
  • Legacy applications, may not have any developers, or even the source code left
  • Legacy will be legacy

In all the above cases, implementing healthchecks does not fully “work”, while it does help K8S operators notice the issue (infinite restart loop), it means they will need manual intervention using vanilla k8s. And really prevented some operators from being able to “sleep in peace” knowing that if enough of their replicas enter a restart loop without their daily/weekly intervention, would mean a downtime (that was until i advised on automating the restarts on a schedule)


As a result of the above, currently in the field i have seen the following being done to work around this lack of “Replace pod on error” feature.

  • Custom Schedule / descheduler (what the bank did)
  • Service account integrated into a sidecar with a custom program doing the termination: https://github.com/kubernetes/kubernetes/issues/24725#issuecomment-724834010
  • Custom scripts (watchdog container) which detect such issues and do the respective kubectl/service account delete command
  • Less then ideal fixes
    • Oh it enters an infinite failure loop, just ignore it cause we have Y+ redudancy and this will be fixed when
      • We redeploy every 1 hour (or X minutes, number depends on frequency of issue and redundancy involved)
      • We run on GKE pre-emptible nodes, eventually that 24 hours would replace the host
    • Here at secure enterprise X, we do not allow anyone to provision service accounts + other restrictions
      • I know of 2 sysadmin who is running a kubectl with bash scripts to automate a fix for this issue.

Finally, in my opinion, its the ultimate expression of “Cattle” vs “Pets”, if there is any issue, i kill the pod, and get a new one. And if operators wants to live by this ideal to the extreme, so be it =)

In general most of the above is fixed if “restartPolicy: Never” is accepted, and new pods are created on failure, or as mentioned a new type of “ReplicaSet” is created to help this behaviour.

(to be clear, this should not be default due to potential increase resource drains, and such usage should be an decision of the operator)

Please re-open this, we need the pod not to restart, what’s the point of “Never” if we can’t use it?

I’m experiencing the same problem +1, we have some sidecars running in our pod which will keep running forever, and the we want the pod to fail and restart if the primary container failed.

@kubernetes/kubectl we should document what restart policy deployment support.

Better to provide some context here like why you need this capacity or what you have tried but not ideal for use case collections.

I would like to be able to do not restart deployment in order to understand what happened if failed and block the constant reboot that may happen, also during debugging for example.

Now I’m trying to do a workaround changing the Deployment into a Job because in that way it seems possible to do not restart on failure

Is the documentation supposed to be up to date? I feel like I’m missing something here: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy

Please don’t spam.

I also have a use case like @kachkaev where I would prefer to have a pod die and be rescheduled as a new pod than be restarted. This feature would be useful.

Just my 2 cents, I need restart never to run load testing tools. I can not launch as a pod (which supports restart never) and have > 1 replicas. Naturally I shifted to looking at a deployment but the same incompatibility exists. If my pods restart it will result in a never ending load test. In the case of my database load testing tools, they end up sampling rows out of a database to generate read operations. Over time that sample set gets larger and larger because the pods are restarting. This effectively halts the cluster.

I am looking forward to ephemeral containers for this use case, but I fear they may have the same drawbacks. I need a way to launch containers in parallel without restarts.

Mh, I must’ve missed that, sorry

I kinda understand that, but I don’t see a difference between restarting a Pod and just letting it die and creating a new one; the end result is the desired number of replicas

The same thing that happens when you delete a Pod or a node dies; the current number of replicas is below the desired one, therefore the ReplicaSet controller creates a new pod

Additionally, it doesn’t click for me why it’s necessary for Deployments to be restartPolicy: Always

I get that that’s your desired concept, but I don’t see any drawbacks to allowing other policies and leaving Always as the default and on the other side I see quite a few advantages, like that this portion of the community would benefit from this.

Writing a custom controller is not something you’d write in a day and push to production, but I’d think allowing other policies would be.

As for your examples;

I wouldn’t say that it’s in any way close to being akin to a Job, as Jobs run to completion and stop. Deployments keep a number of replicas up, forever.

Also, there is a very big difference between a Deployment with restartPolicy: Never and creating the Pods manually; the first one is automatic.

There are also advantages like https://github.com/kubernetes/kubernetes/issues/24725#issuecomment-561396435;

Please reopen because this is critical and particularly bad when combined with unstable volumne mount bug like this: #67643 or a dead fuse daemon.

If any volumne mount is broken for a pod, we have no choice but restarting the whole pod. We have put liveness probe and in code health check to commit suicide after write failure or any mountpoint abnormality. But if restartPolicy is always, the pod simply enters crashloop without restarting itself.

We used to be able to set “Never” on deployments and it works as expected. However recently we found that this has been enforced and our existing mechanism stops working.

It’s part of the nature of the deployment that it keeps things alive. If this is unfortunate for you, then deployment is not what you should use.

An interesting use case mentioned here is that some people need to tear down and recreate the whole pod. @diranged You may want that, too? But note that this could not be achieved with restartPolicy anyway. It would need a wholly new top-level setting in the YAML spec.

Ok, so I think its clear from the community that there is a desire to have a way of saying “If Pod-XYZABC exits for any reason, do not restart it. Let it die and go away.” I think beyond that, the mechanics of how we accomplish such a thing is up to the experts in the community who understand the tooling the most.

As a non expert, my mental model is that a Deployment handles replacing Pods that go away quite well, so if we could just tell the PodSpec to “go away when the processes exit”, that feels like the answer I think people are grasping for. However, perhaps that is simply not how the interaction between a Pod and Deployment and the Scheduler works… so maybe the parameter should be somewhere else.

I will note though that the community is confused by the appearance of the restartPolicy parameter on the PodSpec itself in this case (though, the argument that it’s just part of the PodSpec, so it can’t be hidden is reasonable). It just seems to me like the most intuitive thing is for the behavior to be the following:

  • DeploymentA creates PodA, PodB, PodC (with restartPolicy: Always)
  • PodC exits (either 0 or 1 exit code)
  • PodC is restarted by the Kubelet, and the Deployment is none the wiser…

Alternatively

  • DeploymentA creates PodA, PodB, and PodC with restartPolicy: Never
  • PodC exits (either 0/1 exit code)
  • PodC is then terminated by the Kubelet
  • DeploymentA then creates PodD to replace the PodC which went away

Please add onFailure and Never. +1 need it can be configured as onFailure or Never. Always is not fit for the real case here.

Well, to be fair, the parameter exists because it is inherited from pod spec. A deployment spec contains a pod spec with some constrains, one of them being that restartPolicy can only be Never.

I think the reasoning behind this is that deployments are not jobs. Like a container must be able to cope with the fact that it can be shut down at any time, a deployment must be able to cope with the fact that it is kept alive by restarts.

Probably not quite. Rolling restart seems to be something triggered by a user, while what i’m asking is pod recreation on failure (rather than a restart).

We would like this functionality too, particularly restartPolicy: OnFailure.

Our application sometimes needs to restart, but that triggers alerts because the Pod restarted. Currently, there is no metric failureRestart vs successRestart so we can’t distinguish between these two cases.

If the Pod just got deleted on container exit and then recreated that would solve that little annoyance for us.

But our maybe weird use-case aside, one big question which I didn’t find answered here is still open;

Why does this constraint exist in the first place?

Maybe if you could explain why the DeploymentSpec only allows PodSpecs with restartPolicy: Always we can accept that, but as it stands now I can only surmise that it’s just some arbitrary decision that has been made

I feel like a broken record… but this is a really pretty big issue. There are tons of cases where you do not want to restart a broken pod, but replace it. I am really surprised at how little movement there is here. 😕

Comments in this issue mix to separate aspects:

  • run-to-completion or fail-and-exit use cases (these are addressed by Jobs)
  • use cases that want to keep the application running, but require the restart of the entire pod instead of just the container The second case doesn’t seem to be addressed in any way yet. Another example for that is: a main container, that needs the action of an init container to be run on every start.

Sounds like this should get addressed by https://github.com/kubernetes/enhancements/pull/912. My workaround for now has been to set up a ServiceAccount that can delete pods, and then run kubectl delete pods $HOSTNAME from within my container.

I was also confused by the failed message which said ‘Unsupported value: “Never”: supported values: “Always”’. There is no any document saying that the flag can’t be supported by Deployment.

I think I have a different use case than the ones I’ve read above. I have a sidecar which is failing on startup that I need to troubleshoot. The kink is that it’s in an OpenShift environment where I don‘t have direct access via kubectl; I can only access the container logs via its web UI, and every time the container restarts the UI loses the old container’s log output. I tried setting the restart policy to “Never” to prevent this from happening and see if that would preserve the logs long enough for me to read them, but got the error that this isn’t supported which led me here.

Maybe if you could explain why the DeploymentSpec only allows PodSpecs with restartPolicy: Always we can accept that, but as it stands now I can only surmise that it’s just some arbitrary decision that has been made

@cwrau This has been explained in this thread. The deployment controller is designed for maintaining availability of long running services that are not routinely designed to exit. (Web/application servers, queue workers that poll for new work forever etc.)

Having a deployment with a restart policy of OnFailure is basically akin to having a Job these are designed to run to completion and not restart

Having a restart policy of Never and you may as well directly create the pods.

I’d suggest that if you need some other behavioral it’s a different class of thing and you should consider writing a custom controller to handle that.

Only a RestartPolicy equal to Never or OnFailure is allowed.

Only a .spec.template.spec.restartPolicy equal to Always is allowed, which is the default if not specified.

The doc says only Always is supported for Deployments

You might want to consider switching to a Job

Only a RestartPolicy equal to Never or OnFailure is allowed.

See https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#pod-template https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#pod-template

Jobs are immutable, I need a run-to-completion that can be shot again and again whenever I want.

+1 need it can be configured as onFailure or Never. Always is not fit for the real case here.

+1 need it can be configured as onFailure or Never. Always is not fit for the real case here.

This has me so confused. I am creating a deployment that automatically creates a replicaset - I am not creating the replicaset directly. How do I get that replicaset to have restartPolicy Never?

Edit: oh now I see that replica sets only support restartPolicy = Always

@ichekrygin Deployments only support restartPolicy = Always; so do Replication Controllers, Replica Sets, and Daemon Sets.

I’ll document this.

😦 even more confused now

Another edit: seems like I just need to use a job instead of a deployment

A final edit: since our applications are pretty ephemeral and can easily be recreated, we switched to naked pods instead of deployments or jobs and it is working great 😃

I have a similar requirement. @kachkaev

I’ll throw another rock into this huge pile.

So, I was working on upgrading Kubernetes (oh noes!) among mountains of things that don’t work, there’s this particular tidbit: after you’ve done running kubadm upgrade apply, you need to drain the node. And if you happen to have single pods on that node not managed by deployments / replica sets / daemon sets and so on – well, tough luck, instead of moving your pods to a different node they are going to be killed.

Sucks, right? – well, that’s not the end of the story… I was about to get smart and just automatically wrap any pod in a deployment object – how bad can it be, right? – and now I’m here – fun, fun, fun!

The pod consumes the node’s resources for having space for your containers. Killing the pod means losing this reservation.

I don’t think this is a good idea.

This is the whole point why we need this. We want the pod to lose the reservation. The pod blocks our limited GPU resources and we want them to be free for other pods. Having the deployments not exit out blocks our resources.

For me, I don’t get why we would want to have restarting pods and restarting containers in the pod. Especially if the pod has one container, why do we need to restart the container and why can’t it just be a new pod. A deployment starts a new pod when the pod fails. Why do we need to implement all the features on all the different objects? Doesn’t make sense to me. And somehow to prevent the never or on failure options it just because it does support always? That seems totally silly.

I kinda understand that, but I don’t see a difference between restarting a Pod and just letting it die and creating a new one; the end result is the desired number of replicas

The same thing that happens when you delete a Pod or a node dies; the current number of replicas is below the desired one, therefore the ReplicaSet controller creates a new pod As for your examples;

I wouldn’t say that it’s in any way close to being akin to a Job, as Jobs run to completion and stop. Deployments keep a number of replicas up, forever.

Also, there is a very big difference between a Deployment with restartPolicy: Never and creating the Pods manually; the first one is automatic.

I completely agree, considering this would be very much appreciated.

Yes; honestly that was the most frustrating part of this issue for me. If I can’t change it, remove it as a parameter. Having a parameter that can only be set to a single value is a frustrating UX experience. It’s like that South Park episode where Cartman buys an amusement park but no one is allowed to buy tickets to enter.

Deployment templates are pod specs, and this is good because you specify a pod after all. And a pod may have an arbitrary restartPolicy. But as part of a deployment spec, there is the additional constraint that restartPolicy can only be “always”.

The alternative would be to have pod-spec and pod-within-deployment-template-spec, which I would find even more confusing.

It’s part of the nature of the deployment that it keeps things alive. If this is unfortunate for you, then deployment is not what you should use.

An interesting use case mentioned here is that some people need to tear down and recreate the whole pod. @diranged You may want that, too? But note that this could not be achieved with restartPolicy anyway. It would need a wholly new top-level setting in the YAML spec.

I feel like a broken record… but this is a really pretty big issue. There are tons of cases where you do not want to restart a broken pod, but replace it. I am really surprised at how little movement there is here. 😕

Just use a pod then, you don’t need a deployment

That is not what we are saying… we are saying that there are deployment-like models where you cannot or do not want the pod to restart in a failure, you want a replacement pod that has a new identity.

* run-to-completion or fail-and-exit use cases (these are addressed by Jobs)

The issue with jobs for me is the fact one can’t run a job with the same name. Even if I use a plugin that adds random gibberish to the job name, they are going to be pilling up in the cluster. Lots of pollution after some time.

If your expectation is that your containers exit you should use a job. If you are concerned about these building up then use a label and delete the old jobs before you spin up a new one. Also time stamping is better than random gibberish.

+1 need it can be configured as onFailure or Never. Always is not fit for the real case here.