kubernetes: docs: StatefulSet pod is never evicted from shutdown node

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:

  1. StatefulSet at scale 1 is created.
  2. The only pod is placed and running on one of 2 worker nodes.
  3. The worker node with running pod shuts down and never starts up again.
  4. The pod never moved to other node.

What you expected to happen:

  1. The pod would move to other node after expiration of default tolerations “node.alpha.kubernetes.io/notReady:NoExecute for 300s” and “node.alpha.kubernetes.io/unreachable:NoExecute for 300s”
  2. “kubectl delete pod pod-on-shutdown-node” would induce the expected movement while the node is down – it did not happen either.

How to reproduce it (as minimally and precisely as possible):

  1. Create StatefulSet spec with one container and one replica in, say, sset.yml.
  2. Have kubernetes installation with 2 worker nodes.
  3. kubectl create -f sset.yml
  4. kubectl get pod, to check to see where the only pod is scheduled, say, node N.
  5. shutdown node N with “shutdown -h”.
  6. check to see that the pod did not move to other worker node in 10 minutes after node N halt time.

Anything else we need to know?:

  1. A Deployment behaves as indicated in the “What you expected to happen” section.

Environment:

  • Kubernetes version (use kubectl version): 1.8.1
  • Cloud provider or hardware configuration**: Virtual machines with Vagrant 2.0.0 and VirtualBox 5.1.28-117968 on Intel® Xeon® CPU E5-2690 v3 24 cores with Ubuntu 16.04 LTS
  • OS (e.g. from /etc/os-release): Ubuntu 16.04.3 LTS (VM)
  • Kernel (e.g. uname -a): 4.4.0-96-generic (VM)
  • Install tools: kubeadm 1.8.1-00
  • Others:

Edit: Goal if this issue is to update the documentation and clarify the expected behavior as per: https://github.com/kubernetes/kubernetes/issues/54368#issuecomment-339378597

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 25 (14 by maintainers)

Most upvoted comments

This is by design. When a node goes “down”, the master does not know whether it was a safe down (deliberate shutdown) or a network partition. If the master said “ok, the pod is deleted” then the pod could actually be running somewhere on the cluster, thus violating the guarantees of stateful sets only having one pod.

In your case, if you intend the node to be deleted, you must delete the node object. That will cause the master to understand that you wish the node to be gone, and delete the pods.

If you think that https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#pod-identity does not clearly explain this behavior, we should fix the documentation to describe the expected outcome.

You can also see https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/pod-safety.md for a more detailed explanation why this is by design.

Why do Deployment and ReplicationController behave differently from StatefulSet in the described scenario?

@at1984z This is because StatefulSet is designed to maintain a sticky identity for each of their Pods. These pods are created in ordered with the same spec, stable network identity and stable storage, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling.

But for deployments and RC, we don’t apply such restrictions.

How will “Taint based Evictions” and “Taint Nodes by Conditions” features (see https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/) introduced in 1.8 be implemented and used in light of implied inability of kubernetes to deal with network partitioning? Will the system operator have to monitor the nodes and set taints at will?

If the node is down or inaccessible, the master cannot receive heart beats from the node. New pods will not be scheduled to not ready nodes. Also those not ready nodes can not evict pods from master successfully.

            {
                "lastHeartbeatTime": "2017-10-26T02:41:02Z",
                "lastTransitionTime": "2017-10-26T02:41:46Z",
                "message": "Kubelet stopped posting node status.",
                "reason": "NodeStatusUnknown",
                "status": "Unknown",
                "type": "Ready"
            }

Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes.

old instance is not evicted from the lost node and it cannot be deleted either

@at1984z Yes. This is because pod.spec.terminationGracePeriodSeconds is set to a non-zero value, which keeps pods from being deleted gracefully.

Please refer to #54472 and my comment.