kubernetes: volume.kubernetes.io/selected-node never cleared for non-existent nodes on PVC without PVs

What happened:

  • PVC was created, & pod consuming this PVC. PVC is using the WaitForFirstConsumer policy.
  • controller with SA persistent-volume-binder in kube-system namespace edited the PVC, attached the following labels:
    volume.beta.kubernetes.io/storage-provisioner: ebs.csi.aws.com
    volume.kubernetes.io/selected-node: ip-10-0-92-212.ec2.internal
  • PVC is pending phase, it wasn’t bound to the PV.
  • The node in question was deleted, by cluster autoscaler 1.20 before the PV was provisioned and attached to the node.
  • The annotation remains
  • The https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler v1.20 cannot scale up the node since there’s the volume.kubernetes.io/selected-node annotation pointing to the deleted node. The cluster autoscaler sees this PVC bound to the Pod already bound to the node, but the node isn’t there.
  • The pod stays forever in pending state
  • Deleting this annotation allows cluster autoscaler to do its job, scaling up the cluster and the pod gets scheduled on a newly provisioned node.
  • The end.

What you expected to happen:

  • Upon node deletion, the volume.kubernetes.io/selected-node annotation should be cleared.

How to reproduce it (as minimally and precisely as possible):

  • You’re playing with race conditions, but the previously mention story should be sometimes replicable.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.6-eks-49a6c0", GitCommit:"49a6c0bf091506e7bafcdb1b142351b69363355a", GitTreeState:"clean", BuildDate:"2020-12-23T22:10:21Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
  • AWS EKS

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 35
  • Comments: 30 (19 by maintainers)

Most upvoted comments

Do you think it makes sense to return rescheduling for any other final error?

It makes sense removing this annotation if node in question does not exist anymore. If the underlying PV can be attached to some other node, like many cloud PVs can, then it should be allowed to do so.

I believe I had this issue: the PVC with annotation for dead node.

My workaround:

#!/bin/bash

declare -A nodes
while read node; do
  nodes["${node#node/}"]=exists
done < <(kubectl get nodes -o name)

kubectl get pvc -A -o json |
jq '.items[].metadata | [.namespace, .name, .annotations["volume.kubernetes.io/selected-node"]] | @tsv' -r |
while read namespace name node; do
  test -n "$node" || continue
  if ! [[ ${nodes[$node]-} == "exists" ]]; then
    kubectl annotate -n "${namespace}" "pvc/${name}" volume.kubernetes.io/selected-node-
  fi
done