gcp-compute-persistent-disk-csi-driver: `VolumeAttachment` is not able to detach from removed node after `gcloud compute instances delete` or `gcloud compute instances simulate-maintenance-event` command is run
What happened: After running any of the actions:
gcloud --project <project-id> compute instances delete <node-id-to-delete> --zone=<zone-id>
gcloud --project <project-id> compute instances simulate-maintenance-event <node-id-to-delete> --zone=<zone-id>
- GCP triggers a
preemption
in a node running in aspot instance
.
The following happens:
- The selected
<node-id-to-delete>
gets removed from the GKE cluster as expected. - Pods that were running with a PVC attached in that removed node get evicted and scheduled into a new available node from the pool.
Pods
get stuck initializing into the assigned node:- Status: Pending
- State: Waiting
- Reason: PodInitializing
- Events:
Warning FailedMount 96s (x6 over 71m) kubelet Unable to attach or mount volumes: unmounted volumes=[infrastructure-prometheus], unattached volumes │
│ =[config config-out prometheus-infrastructure-rulefiles-0 kube-api-access-cvf76 tls-assets infrastructure-prometheus web-config]: timed out waiting for t │
│ he condition
VolumeAttachment
still shows asattached:true
to the previous node that was deleted/removed, with adetachError
:
apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
annotations:
csi.alpha.kubernetes.io/node-id: projects/<project-id>/zones/europe-west1-d/instances/<node-id-to-delete>
creationTimestamp: "2022-05-13T01:42:43Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2022-05-13T07:22:53Z"
finalizers:
- external-attacher/pd-csi-storage-gke-io
name: csi-51371274f186e0e259e907a06cfe6d4d5ff27c2079a097caf29c883424efe9ee
resourceVersion: "320247861"
uid: 14d1f4c7-f05e-4eb0-9a6b-4a8e1463e8df
spec:
attacher: pd.csi.storage.gke.io
nodeName: <node-id-to-delete>
source:
persistentVolumeName: pvc-65e91226-24d6-4308-b35b-d29b2026ffff
status:
attached: true
detachError:
message: 'rpc error: code = Unavailable desc = Request queued due to error condition
on node'
time: "2022-05-13T10:42:38Z"
Pod
gets stuck permanently in init state. The only way to fix it is to manually edit theVolumeAttachment
and delete the entry:finalizers: external-attacher/pd-csi-storage-gke-io
What you expected to happen:
VolumeAttachment
is detached from non-existing node after node is deleted with gcloud cli
commands either compute instances delete
or compute instances simulate-maintenance-event
Environment:
- GKE Rev:
v1.22.8-gke.200
- csi-node-driver-registrar:
v2.5.0-gke.1
- gcp-compute-persistent-disk-csi-driver:
v1.5.1-gke.0
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 7
- Comments: 22 (10 by maintainers)
+1 to @dllegru comments. We have rolled back the pd csi component to 1.3.4 on 1.22 clusters—look for the 0.11.4 component version stamped on the pdcsi-node pods (not the daemonset, just the pods).
The 0.11.7 component that has the 1.7.0 driver is enabled for newly created 1.22 clusters, and we’re rolling out automatic upgrades over the next week.
We did not roll back the driver in 1.23 as there are new features in it that some customers are testing.
1.23 uses the 0.12 pdcsi component (yes, I know you were worried we wouldn’t have enough different version numbers floating around… 😛). The 0.12.2 component is the bad one, the 0.12.4 component is the good one with the 1.7.0 driver. New 1.23 clusters should already be getting the new component; the auto-upgrades are being rolled out at the same time as for 1.22.
No. In GKE 1.22+, CSI Migration is enabled, which means that the gce-pd provisioner uses the PD CSI driver as a back end. You either need to use the managed PD CSI driver or manually deploy one as @dllegru mentioned.
The upgrade config is rolling out and is about 15% done (it rolls out zone-by-zone of the course of a week). Unfortunately there’s no good way to tell when your zone has the upgrade, until your master is able to upgrade.
Yup 😕
@Keramblock @kekscode I’ve all my GKE clusters running on
v1.22.8-gke.200
and GCP rolled backgcp-compute-persistent-disk-csi-driver
tov1.3
andv1.4
a couple weeks ago, so I’m not having these issues anymore that were induced byv1.5
. Also what I understood from GCP is that they rolled it back everywhere so theoretically you should not havev1.5
of the pd-driver running on your gke clusters, in case you do I would suggest to open a support ticket with them.Hi, @saikat-royc could you provide any ETA about GA for this fix? right now I need to manually fix Prometheus pvc every time spot instance with it is revoked by GCP and I believe that I have same issue.