kubernetes: StatefulSet - can't rollback from a broken state
/kind bug
What happened:
I updated a StatefulSet with a non-existent Docker image. As expected, a pod of the statefulset is destroyed and can’t be recreated (ErrImagePull). However, when I change back the StatefulSet with an existing image, the StatefulSet doesn’t try to remove the broken pod to replace it by a good one. It keeps trying to pull the non-existing image. You have to delete the broken pod manually to unblock the situation.
Related Stackoverflow question
What you expected to happen:
When rolling back the bad config, I expected the StatefulSet to remove the broken pod and replace it by a good one.
How to reproduce it (as minimally and precisely as possible):
- Deploy the following StatefulSet:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 3 # by default is 1
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "standard"
resources:
requests:
storage: 10Gi
- Once the 3 pods are running, update the StatefulSet spec and change the image to
k8s.gcr.io/nginx-slim:foobar - Observe the new pod failing to pull the image.
- Roll back the change.
- Observe the broken pod not being deleted.
Anything else we need to know?:
- I observed this behaviour both on 1.8 and 1.10.
- This seems related to the discussion in #18568
Environment:
- Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.7", GitCommit:"dd5e1a2978fd0b97d9b78e1564398aeea7e7fe92", GitTreeState:"clean", BuildDate:"2018-04-19T00:05:56Z", GoVersion:"go1.9
.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.5-gke.3", GitCommit:"6265b9797fc8680c8395abeab12c1e3bad14069a", GitTreeState:"clean", BuildDate:"2018-07-19T23:02:51Z", GoVersi
on:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration: Google Kubernetes Engine
- OS (e.g. from /etc/os-release): COS
cc @joe-boyce
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 133
- Comments: 63 (26 by maintainers)
Commits related to this issue
- Support StatefuleSet upgrade with workaround This option gives us the option to workaround current StatefulSet limitations around updates See: https://github.com/kubernetes/kubernetes/issues/67250 By ... — committed to bank-vaults/bank-vaults by bonifaido 6 years ago
- Support StatefuleSet upgrade with workaround This option gives us the option to workaround current StatefulSet limitations around updates See: https://github.com/kubernetes/kubernetes/issues/67250 By ... — committed to bank-vaults/bank-vaults by bonifaido 6 years ago
- Fix statefulset can not rollback from a broken state docs: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback known issue: https://github.com/kubernetes/kubernetes... — committed to fyuan1316/kubeflow by fyuan1316 5 years ago
- allows to changes rolling update strategy for statefulset applications with rollingUpdateStrategy: RollingUpdate, operator doens't perform statefulset update it delegates it to the kubernetes statefu... — committed to VictoriaMetrics/operator by f41gh7 3 years ago
- allows to changes rolling update strategy for statefulset applications with rollingUpdateStrategy: RollingUpdate, operator doens't perform statefulset update it delegates it to the kubernetes statefu... — committed to VictoriaMetrics/operator by f41gh7 3 years ago
- Relax zuul-scheduler pod failure when wrong config location This change ensures that the generate-tenant-config script when running via the zuul-scheduler pod's init-container won't fail when the con... — committed to softwarefactory-project/sf-operator by morucci a year ago
- Always reconcile ingester StatefulSet Changes to a StatefulSet are not propagated to pods in a broken state (e.g. CrashLoopBackOff) See https://github.com/kubernetes/kubernetes/issues/67250 This is ... — committed to andreasgerstmayr/tempo-operator by andreasgerstmayr 10 months ago
- Always reconcile ingester StatefulSet Changes to a StatefulSet are not propagated to pods in a broken state (e.g. CrashLoopBackOff) See https://github.com/kubernetes/kubernetes/issues/67250 This is ... — committed to andreasgerstmayr/tempo-operator by andreasgerstmayr 10 months ago
- Always reconcile ingester StatefulSet Changes to a StatefulSet are not propagated to pods in a broken state (e.g. CrashLoopBackOff) See https://github.com/kubernetes/kubernetes/issues/67250 This is ... — committed to andreasgerstmayr/tempo-operator by andreasgerstmayr 10 months ago
- Always reconcile ingester StatefulSet Changes to a StatefulSet are not propagated to pods in a broken state (e.g. CrashLoopBackOff) See https://github.com/kubernetes/kubernetes/issues/67250 This is ... — committed to andreasgerstmayr/tempo-operator by andreasgerstmayr 10 months ago
- Always reconcile ingester StatefulSet Changes to a StatefulSet are not propagated to pods in a broken state (e.g. CrashLoopBackOff) See https://github.com/kubernetes/kubernetes/issues/67250 This is ... — committed to andreasgerstmayr/tempo-operator by andreasgerstmayr 10 months ago
- Always reconcile ingester StatefulSet (#597) Changes to a StatefulSet are not propagated to pods in a broken state (e.g. CrashLoopBackOff) See https://github.com/kubernetes/kubernetes/issues/67250 ... — committed to grafana/tempo-operator by andreasgerstmayr 9 months ago
/reopen
/assign I’ll take a look as if this is highly needed by the community.
This is a real blocker for programmatic usage of Statefulsets (from the Operator for example). The real use case: the operator creates the statefulset with some memory/cpu limits which cannot be fulfilled. So 2 pods are running and the 3rd is staying Pending as it cannot be scheduled to the node as there no available resources. Trying to fix and change the specification to have smaller limits doesn’t help - the statefulset specification is updated but all the pods stay unchanged forever as the 3rd pod is Pending. The only way is to delete the pod manually which contradicts the nature of operators totally
manually 😦
This continues to bite us. How are people managing this in the meantime? The manual intervention of deleting the stuck pod is not very ideal.
After some initial investigation, I believe the mitigation described above should be feasible:
The specific incarnation we’re talking about is when all of the following are true:
In this situation, I think we can argue that it’s safe for StatefulSet to delete the Pending Pod for you, as long as we can ensure the Pod has not started running before we get around to deleting it. The argument would be, if the Pod never ran, then the application should not be affected one way or another if we delete it. We could potentially use the new ResourceVersion precondition on Delete to give a high confidence level that the Pod never started running.
There is still a very slight chance that a container started running and had some effect on the application already but the kubelet has not updated the Pod status yet. However, I would argue that the chance of that is small enough that we should take the risk in order to prevent StatefulSet from getting stuck in this common situation.
I probably won’t have time to work on the code for this any time soon since I’m about to change jobs. However, I’m willing to commit to being a reviewer if anyone is available to work on this.
As far as I can tell, StatefulSet doesn’t make any attempt to support this use case, namely using a rolling update to fix a StatefulSet that’s in a broken state. If any of the existing Pods are broken, it appears that StatefulSet bails out before even reaching the rolling update code:
https://github.com/kubernetes/kubernetes/blob/30e4f528ed30a70bdb0c14b5cfe49d00a78194c2/pkg/controller/statefulset/stateful_set_control.go#L428-L435
I haven’t found any mention of this limitation in the docs, but it’s possible that it was a choice made intentionally to err on the side of caution (stop and make the human decide) since stateful data is at stake and stateful Pods often have dependencies on each other (e.g. they may form a cluster/quorum).
With that said, I agree it would be ideal if StatefulSet supported this, at least for clear cases like this one where deleting a Pod that’s stuck Pending is unlikely to cause any additional damage.
cc @kow3ns
+1
It is this fixed ?
+1. This is a pretty big landmine in using StatefulSet, if you ever make any mistake you’re stuck with just destroying your StatefulSet and starting over. IOW, if you ever make a mistake with StatefulSet, you need to cause an outage to recover 😦
This issue is similar to #78007