kubernetes: StatefulSet - can't rollback from a broken state

/kind bug

What happened:

I updated a StatefulSet with a non-existent Docker image. As expected, a pod of the statefulset is destroyed and can’t be recreated (ErrImagePull). However, when I change back the StatefulSet with an existing image, the StatefulSet doesn’t try to remove the broken pod to replace it by a good one. It keeps trying to pull the non-existing image. You have to delete the broken pod manually to unblock the situation.

About this issue

Original URL
State: open
Created 6 years ago
Reactions: 133
Comments: 63 (26 by maintainers)

Commits related to this issue

Support StatefuleSet upgrade with workaround This option gives us the option to workaround current StatefulSet limitations around updates See: https://github.com/kubernetes/kubernetes/issues/67250 By ... — committed to bank-vaults/bank-vaults by bonifaido 6 years ago
Support StatefuleSet upgrade with workaround This option gives us the option to workaround current StatefulSet limitations around updates See: https://github.com/kubernetes/kubernetes/issues/67250 By ... — committed to bank-vaults/bank-vaults by bonifaido 6 years ago
Fix statefulset can not rollback from a broken state docs: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback known issue: https://github.com/kubernetes/kubernetes... — committed to fyuan1316/kubeflow by fyuan1316 5 years ago
allows to changes rolling update strategy for statefulset applications with rollingUpdateStrategy: RollingUpdate, operator doens't perform statefulset update it delegates it to the kubernetes statefu... — committed to VictoriaMetrics/operator by f41gh7 3 years ago
allows to changes rolling update strategy for statefulset applications with rollingUpdateStrategy: RollingUpdate, operator doens't perform statefulset update it delegates it to the kubernetes statefu... — committed to VictoriaMetrics/operator by f41gh7 3 years ago
Relax zuul-scheduler pod failure when wrong config location This change ensures that the generate-tenant-config script when running via the zuul-scheduler pod's init-container won't fail when the con... — committed to softwarefactory-project/sf-operator by morucci a year ago
Always reconcile ingester StatefulSet Changes to a StatefulSet are not propagated to pods in a broken state (e.g. CrashLoopBackOff) See https://github.com/kubernetes/kubernetes/issues/67250 This is ... — committed to andreasgerstmayr/tempo-operator by andreasgerstmayr 10 months ago
Always reconcile ingester StatefulSet Changes to a StatefulSet are not propagated to pods in a broken state (e.g. CrashLoopBackOff) See https://github.com/kubernetes/kubernetes/issues/67250 This is ... — committed to andreasgerstmayr/tempo-operator by andreasgerstmayr 10 months ago
Always reconcile ingester StatefulSet Changes to a StatefulSet are not propagated to pods in a broken state (e.g. CrashLoopBackOff) See https://github.com/kubernetes/kubernetes/issues/67250 This is ... — committed to andreasgerstmayr/tempo-operator by andreasgerstmayr 10 months ago
Always reconcile ingester StatefulSet Changes to a StatefulSet are not propagated to pods in a broken state (e.g. CrashLoopBackOff) See https://github.com/kubernetes/kubernetes/issues/67250 This is ... — committed to andreasgerstmayr/tempo-operator by andreasgerstmayr 10 months ago
Always reconcile ingester StatefulSet Changes to a StatefulSet are not propagated to pods in a broken state (e.g. CrashLoopBackOff) See https://github.com/kubernetes/kubernetes/issues/67250 This is ... — committed to andreasgerstmayr/tempo-operator by andreasgerstmayr 10 months ago
Always reconcile ingester StatefulSet (#597) Changes to a StatefulSet are not propagated to pods in a broken state (e.g. CrashLoopBackOff) See https://github.com/kubernetes/kubernetes/issues/67250 ... — committed to grafana/tempo-operator by andreasgerstmayr 9 months ago

Most upvoted comments

/reopen

+20

MrTrustor on Feb 8, 2019

/assign I’ll take a look as if this is highly needed by the community.

+13

kerthcet on Jun 23, 2022

This is a real blocker for programmatic usage of Statefulsets (from the Operator for example). The real use case: the operator creates the statefulset with some memory/cpu limits which cannot be fulfilled. So 2 pods are running and the 3rd is staying Pending as it cannot be scheduled to the node as there no available resources. Trying to fix and change the specification to have smaller limits doesn’t help - the statefulset specification is updated but all the pods stay unchanged forever as the 3rd pod is Pending. The only way is to delete the pod manually which contradicts the nature of operators totally

antonlisovenko on Mar 22, 2019

How are people managing this in the meantime?

manually 😦

zerkms on Aug 8, 2021

This continues to bite us. How are people managing this in the meantime? The manual intervention of deleting the stuck pod is not very ideal.

watkinsmike on Aug 6, 2021

After some initial investigation, I believe the mitigation described above should be feasible:

For this specific incarnation of the general problem (a Pod that has never made it to Running state), we might be able to do something more automatic. Perhaps we can argue that it should always be safe to delete and replace such a Pod, as long as none of the containers (even init containers) ever ran. We’d also need to agree on whether this automatic fix-up can be enabled by default without being considered a breaking change, or whether it needs to be gated by a new field in StatefulSet that’s off by default (until apps/v2).

The specific incarnation we’re talking about is when all of the following are true:

There’s a Pod stuck Pending because it was created from a bad StatefulSet revision.
Deleting that stuck Pod (i.e. the workaround discussed above) would result in the Pod being replaced at a different revision (meaning we have reason to expect a different result; we won’t hot-loop).

In this situation, I think we can argue that it’s safe for StatefulSet to delete the Pending Pod for you, as long as we can ensure the Pod has not started running before we get around to deleting it. The argument would be, if the Pod never ran, then the application should not be affected one way or another if we delete it. We could potentially use the new ResourceVersion precondition on Delete to give a high confidence level that the Pod never started running.

There is still a very slight chance that a container started running and had some effect on the application already but the kubelet has not updated the Pod status yet. However, I would argue that the chance of that is small enough that we should take the risk in order to prevent StatefulSet from getting stuck in this common situation.

I probably won’t have time to work on the code for this any time soon since I’m about to change jobs. However, I’m willing to commit to being a reviewer if anyone is available to work on this.

enisoc on Mar 22, 2019

As far as I can tell, StatefulSet doesn’t make any attempt to support this use case, namely using a rolling update to fix a StatefulSet that’s in a broken state. If any of the existing Pods are broken, it appears that StatefulSet bails out before even reaching the rolling update code:

https://github.com/kubernetes/kubernetes/blob/30e4f528ed30a70bdb0c14b5cfe49d00a78194c2/pkg/controller/statefulset/stateful_set_control.go#L428-L435

I haven’t found any mention of this limitation in the docs, but it’s possible that it was a choice made intentionally to err on the side of caution (stop and make the human decide) since stateful data is at stake and stateful Pods often have dependencies on each other (e.g. they may form a cluster/quorum).

With that said, I agree it would be ideal if StatefulSet supported this, at least for clear cases like this one where deleting a Pod that’s stuck Pending is unlikely to cause any additional damage.

cc @kow3ns

enisoc on Aug 21, 2018

wu0407 on Dec 29, 2020

It is this fixed ?

skyfirst93 on Dec 15, 2020

+1. This is a pretty big landmine in using StatefulSet, if you ever make any mistake you’re stuck with just destroying your StatefulSet and starting over. IOW, if you ever make a mistake with StatefulSet, you need to cause an outage to recover 😦

dave-tock on Mar 11, 2019

This issue is similar to #78007

draveness on Jun 5, 2019