gcp-compute-persistent-disk-csi-driver: `VolumeAttachment` is not able to detach from removed node after `gcloud compute instances delete` or `gcloud compute instances simulate-maintenance-event` command is run

What happened: After running any of the actions:

gcloud --project <project-id> compute instances delete <node-id-to-delete> --zone=<zone-id>
gcloud --project <project-id> compute instances simulate-maintenance-event <node-id-to-delete> --zone=<zone-id>
GCP triggers a preemption in a node running in a spot instance.

The following happens:

The selected <node-id-to-delete> gets removed from the GKE cluster as expected.
Pods that were running with a PVC attached in that removed node get evicted and scheduled into a new available node from the pool.
Pods get stuck initializing into the assigned node:
- Status: Pending
- State: Waiting
  - Reason: PodInitializing
- Events:

   Warning  FailedMount  96s (x6 over 71m)   kubelet  Unable to attach or mount volumes: unmounted volumes=[infrastructure-prometheus], unattached volumes │
│ =[config config-out prometheus-infrastructure-rulefiles-0 kube-api-access-cvf76 tls-assets infrastructure-prometheus web-config]: timed out waiting for t │
│ he condition

VolumeAttachment still shows as attached:true to the previous node that was deleted/removed, with a detachError:

apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
  annotations:
    csi.alpha.kubernetes.io/node-id: projects/<project-id>/zones/europe-west1-d/instances/<node-id-to-delete>
  creationTimestamp: "2022-05-13T01:42:43Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2022-05-13T07:22:53Z"
  finalizers:
  - external-attacher/pd-csi-storage-gke-io
  name: csi-51371274f186e0e259e907a06cfe6d4d5ff27c2079a097caf29c883424efe9ee
  resourceVersion: "320247861"
  uid: 14d1f4c7-f05e-4eb0-9a6b-4a8e1463e8df
spec:
  attacher: pd.csi.storage.gke.io
  nodeName: <node-id-to-delete>
  source:
    persistentVolumeName: pvc-65e91226-24d6-4308-b35b-d29b2026ffff
status:
  attached: true
  detachError:
    message: 'rpc error: code = Unavailable desc = Request queued due to error condition
      on node'
    time: "2022-05-13T10:42:38Z"

Pod gets stuck permanently in init state. The only way to fix it is to manually edit the VolumeAttachment and delete the entry: finalizers: external-attacher/pd-csi-storage-gke-io

What you expected to happen:

VolumeAttachment is detached from non-existing node after node is deleted with gcloud cli commands either compute instances delete or compute instances simulate-maintenance-event

Environment:

GKE Rev: v1.22.8-gke.200
csi-node-driver-registrar: v2.5.0-gke.1
gcp-compute-persistent-disk-csi-driver: v1.5.1-gke.0

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 7
Comments: 22 (10 by maintainers)

Most upvoted comments

+1 to @dllegru comments. We have rolled back the pd csi component to 1.3.4 on 1.22 clusters—look for the 0.11.4 component version stamped on the pdcsi-node pods (not the daemonset, just the pods).

The 0.11.7 component that has the 1.7.0 driver is enabled for newly created 1.22 clusters, and we’re rolling out automatic upgrades over the next week.

mattcary on Jun 6, 2022

We did not roll back the driver in 1.23 as there are new features in it that some customers are testing.

1.23 uses the 0.12 pdcsi component (yes, I know you were worried we wouldn’t have enough different version numbers floating around… 😛). The 0.12.2 component is the bad one, the 0.12.4 component is the good one with the 1.7.0 driver. New 1.23 clusters should already be getting the new component; the auto-upgrades are being rolled out at the same time as for 1.22.

mattcary on Jun 6, 2022

Something else: In the web console in the cluster settings in „Details“ the CSI driver can be disabled and this should then use the „gcePersistentDisk in-tree volume plugin“. I don’t know enough about GCE/GKE to understand the implications, but could this be a workaround?

No. In GKE 1.22+, CSI Migration is enabled, which means that the gce-pd provisioner uses the PD CSI driver as a back end. You either need to use the managed PD CSI driver or manually deploy one as @dllegru mentioned.

The upgrade config is rolling out and is about 15% done (it rolls out zone-by-zone of the course of a week). Unfortunately there’s no good way to tell when your zone has the upgrade, until your master is able to upgrade.

mattcary on Jun 8, 2022

Yup 😕

mattcary on Jun 6, 2022

@Keramblock @kekscode I’ve all my GKE clusters running onv1.22.8-gke.200 and GCP rolled back gcp-compute-persistent-disk-csi-driver to v1.3 and v1.4 a couple weeks ago, so I’m not having these issues anymore that were induced by v1.5. Also what I understood from GCP is that they rolled it back everywhere so theoretically you should not have v1.5 of the pd-driver running on your gke clusters, in case you do I would suggest to open a support ticket with them.

dllegru on Jun 6, 2022

Hi, @saikat-royc could you provide any ETA about GA for this fix? right now I need to manually fix Prometheus pvc every time spot instance with it is revoked by GCP and I believe that I have same issue.

Keramblock on Jun 6, 2022