vsphere-csi-driver: "failed to set control flag keepAfterDeleteVm" after CSI migration is enabled

/kind bug

What happened: After I enabled CSI migration in a 1.25 cluster, I got some Pods pending with unattachable volumes. The CSI driver fails every ControllerPublish with failed to set control flag keepAfterDeleteVm for volumeID "bf019d24-72e2-4ba0-9b72-10df23a69156" with err: ServerFaultCode: The operation is not allowed in the current state.

What you expected to happen: Volumes are attached just fine.

How to reproduce it (as minimally and precisely as possible):

  1. Create 50 PVCs + Deployments that use them on a cluster without CSI migration.
  2. Enable CSI migration
    1. Enable it in kube-apiserver, then KCM and scheduler.
    2. Enable it in the CSI driver ConfigMap and restart the driver Pods.
    3. Drain nodes one by one and enable it in kubelet + restart kubelet.
  3. Check the Deployments from step 1.

Once out of 8 attempts I got failed to set control flag keepAfterDeleteVm for ~5 of the Deployments. Most of the Deployments were fine. The error repeats for at least 30 minutes for each ControllerPublish retry of each affected volume, then I gave up and finished my tests.

Anything else we need to know?:

The Deployment Pods were running just fine before CSI migration was enabled. There is no pending volume resize or anything that could change state of a volume, at least not in Kubernetes. I don’t have the cluster any longer, so I cannot check how the volumes look like in vCenter. What volume “state” would prevent a volume from being marked by keepAfterDeleteVm? And how could Kubernetes / the CSI driver / anything else put the volume to such a state?

Environment:

  • csi-vsphere version: 2.4.1
  • vsphere-cloud-controller-manager version: ?
  • Kubernetes version: v1.25.4
  • vSphere version: 7.0.3
  • OS (e.g. from /etc/os-release): Red Hat Enterprise Linux CoreOS 412.86.202301100600-0
  • Kernel (e.g. uname -a): 4.18.0-372.40.1.el8_6.x86_64
  • Install tools: OpenShift 4.12

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 26 (22 by maintainers)

Most upvoted comments

7.0p07 and 8.0u1 has the fix for allow setting control flag even when volume is attached to some VM, so we will not hit into error while setting the control flag even when disk is attached to some vm.

I will get back to you after checking with team regarding second issue - volume is lost from CNS.