kubernetes: StatefulSet - can't rollback from a broken state

/kind bug

What happened:

I updated a StatefulSet with a non-existent Docker image. As expected, a pod of the statefulset is destroyed and can’t be recreated (ErrImagePull). However, when I change back the StatefulSet with an existing image, the StatefulSet doesn’t try to remove the broken pod to replace it by a good one. It keeps trying to pull the non-existing image. You have to delete the broken pod manually to unblock the situation.

Related Stackoverflow question

What you expected to happen:

When rolling back the bad config, I expected the StatefulSet to remove the broken pod and replace it by a good one.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy the following StatefulSet:
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: nginx # has to match .spec.template.metadata.labels
  serviceName: "nginx"
  replicas: 3 # by default is 1
  template:
    metadata:
      labels:
        app: nginx # has to match .spec.selector.matchLabels
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "standard"
      resources:
        requests:
          storage: 10Gi
  1. Once the 3 pods are running, update the StatefulSet spec and change the image to k8s.gcr.io/nginx-slim:foobar
  2. Observe the new pod failing to pull the image.
  3. Roll back the change.
  4. Observe the broken pod not being deleted.

Anything else we need to know?:

  • I observed this behaviour both on 1.8 and 1.10.
  • This seems related to the discussion in #18568

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.7", GitCommit:"dd5e1a2978fd0b97d9b78e1564398aeea7e7fe92", GitTreeState:"clean", BuildDate:"2018-04-19T00:05:56Z", GoVersion:"go1.9
.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.5-gke.3", GitCommit:"6265b9797fc8680c8395abeab12c1e3bad14069a", GitTreeState:"clean", BuildDate:"2018-07-19T23:02:51Z", GoVersi
on:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: Google Kubernetes Engine
  • OS (e.g. from /etc/os-release): COS

cc @joe-boyce

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 133
  • Comments: 63 (26 by maintainers)

Commits related to this issue

Most upvoted comments

/reopen

/assign I’ll take a look as if this is highly needed by the community.

This is a real blocker for programmatic usage of Statefulsets (from the Operator for example). The real use case: the operator creates the statefulset with some memory/cpu limits which cannot be fulfilled. So 2 pods are running and the 3rd is staying Pending as it cannot be scheduled to the node as there no available resources. Trying to fix and change the specification to have smaller limits doesn’t help - the statefulset specification is updated but all the pods stay unchanged forever as the 3rd pod is Pending. The only way is to delete the pod manually which contradicts the nature of operators totally

How are people managing this in the meantime?

manually 😦

This continues to bite us. How are people managing this in the meantime? The manual intervention of deleting the stuck pod is not very ideal.

After some initial investigation, I believe the mitigation described above should be feasible:

For this specific incarnation of the general problem (a Pod that has never made it to Running state), we might be able to do something more automatic. Perhaps we can argue that it should always be safe to delete and replace such a Pod, as long as none of the containers (even init containers) ever ran. We’d also need to agree on whether this automatic fix-up can be enabled by default without being considered a breaking change, or whether it needs to be gated by a new field in StatefulSet that’s off by default (until apps/v2).

The specific incarnation we’re talking about is when all of the following are true:

  1. There’s a Pod stuck Pending because it was created from a bad StatefulSet revision.
  2. Deleting that stuck Pod (i.e. the workaround discussed above) would result in the Pod being replaced at a different revision (meaning we have reason to expect a different result; we won’t hot-loop).

In this situation, I think we can argue that it’s safe for StatefulSet to delete the Pending Pod for you, as long as we can ensure the Pod has not started running before we get around to deleting it. The argument would be, if the Pod never ran, then the application should not be affected one way or another if we delete it. We could potentially use the new ResourceVersion precondition on Delete to give a high confidence level that the Pod never started running.

There is still a very slight chance that a container started running and had some effect on the application already but the kubelet has not updated the Pod status yet. However, I would argue that the chance of that is small enough that we should take the risk in order to prevent StatefulSet from getting stuck in this common situation.

I probably won’t have time to work on the code for this any time soon since I’m about to change jobs. However, I’m willing to commit to being a reviewer if anyone is available to work on this.

As far as I can tell, StatefulSet doesn’t make any attempt to support this use case, namely using a rolling update to fix a StatefulSet that’s in a broken state. If any of the existing Pods are broken, it appears that StatefulSet bails out before even reaching the rolling update code:

https://github.com/kubernetes/kubernetes/blob/30e4f528ed30a70bdb0c14b5cfe49d00a78194c2/pkg/controller/statefulset/stateful_set_control.go#L428-L435

I haven’t found any mention of this limitation in the docs, but it’s possible that it was a choice made intentionally to err on the side of caution (stop and make the human decide) since stateful data is at stake and stateful Pods often have dependencies on each other (e.g. they may form a cluster/quorum).

With that said, I agree it would be ideal if StatefulSet supported this, at least for clear cases like this one where deleting a Pod that’s stuck Pending is unlikely to cause any additional damage.

cc @kow3ns

+1

It is this fixed ?

+1. This is a pretty big landmine in using StatefulSet, if you ever make any mistake you’re stuck with just destroying your StatefulSet and starting over. IOW, if you ever make a mistake with StatefulSet, you need to cause an outage to recover 😦

This issue is similar to #78007