kubernetes: volume.kubernetes.io/selected-node never cleared for non-existent nodes on PVC without PVs

What happened:

PVC was created, & pod consuming this PVC. PVC is using the WaitForFirstConsumer policy.
controller with SA persistent-volume-binder in kube-system namespace edited the PVC, attached the following labels:

    volume.beta.kubernetes.io/storage-provisioner: ebs.csi.aws.com
    volume.kubernetes.io/selected-node: ip-10-0-92-212.ec2.internal

PVC is pending phase, it wasn’t bound to the PV.
The node in question was deleted, by cluster autoscaler 1.20 before the PV was provisioned and attached to the node.
The annotation remains
The https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler v1.20 cannot scale up the node since there’s the volume.kubernetes.io/selected-node annotation pointing to the deleted node. The cluster autoscaler sees this PVC bound to the Pod already bound to the node, but the node isn’t there.
The pod stays forever in pending state
Deleting this annotation allows cluster autoscaler to do its job, scaling up the cluster and the pod gets scheduled on a newly provisioned node.
The end.

What you expected to happen:

Upon node deletion, the volume.kubernetes.io/selected-node annotation should be cleared.

How to reproduce it (as minimally and precisely as possible):

You’re playing with race conditions, but the previously mention story should be sometimes replicable.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.6-eks-49a6c0", GitCommit:"49a6c0bf091506e7bafcdb1b142351b69363355a", GitTreeState:"clean", BuildDate:"2020-12-23T22:10:21Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:
AWS EKS

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 35
Comments: 30 (19 by maintainers)

Most upvoted comments

Do you think it makes sense to return rescheduling for any other final error?

It makes sense removing this annotation if node in question does not exist anymore. If the underlying PV can be attached to some other node, like many cloud PVs can, then it should be allowed to do so.

+15

nmiculinic on May 27, 2021

I believe I had this issue: the PVC with annotation for dead node.

My workaround:

#!/bin/bash

declare -A nodes
while read node; do
  nodes["${node#node/}"]=exists
done < <(kubectl get nodes -o name)

kubectl get pvc -A -o json |
jq '.items[].metadata | [.namespace, .name, .annotations["volume.kubernetes.io/selected-node"]] | @tsv' -r |
while read namespace name node; do
  test -n "$node" || continue
  if ! [[ ${nodes[$node]-} == "exists" ]]; then
    kubectl annotate -n "${namespace}" "pvc/${name}" volume.kubernetes.io/selected-node-
  fi
done

dex4er on Dec 28, 2022