csi-driver: Volume deletion after release sometimes fails
Hi,
Sometimes the deletion of volumes with reclaimPolicy: Delete fails:
Name: pvc-4017a6e3-2af2-11e9-9340-96000019d538
Labels: <none>
Annotations: pv.kubernetes.io/provisioned-by: csi.hetzner.cloud
Finalizers: [kubernetes.io/pv-protection external-attacher/csi-hetzner-cloud]
StorageClass: kubermatic-fast
Status: Released
Claim: cluster-prow-e2e-q8lt7fwh/data-etcd-0
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 10Gi
Node Affinity: <none>
Message:
Source:
Type: CSI (a Container Storage Interface (CSI) volume source)
Driver: csi.hetzner.cloud
VolumeHandle: 1759327
ReadOnly: false
VolumeAttributes: storage.kubernetes.io/csiProvisionerIdentity=1549140804874-8081-csi.hetzner.cloud
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning VolumeFailedDelete 15m (x16 over 3h) csi.hetzner.cloud_hcloud-csi-controller-0_c502e394-272c-11e9-a078-9a17553d75a7 rpc error: code = Internal desc = volume with ID '1759327' is still attached to a server (service_error)
The associated claim does not exist anymore, neither does the namespace it was in:
k get pvc -n cluster-prow-e2e-q8lt7fwh data-etcd-0
Error from server (NotFound): namespaces "cluster-prow-e2e-q8lt7fwh" not found
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 26 (12 by maintainers)
I’ve just released version 1.1.3. Thank you for your help on this, @alvaroaleman!
We ran into the same problem and rc.2 seems to fix it. Will probably release 1.1.3 tomorrow.
I implemented “detach before delete” and pushed
hetznercloud/hcloud-csi-driver:1.1.3-rc.2. That should finally solve the issue. Please test.I pushed
hetznercloud/hcloud-csi-driver:1.1.3-rc.1which fixesDeleteVolumenot returning the correct error code when the volume is still in use. Please test that version.I am also experiencing this issue. I was testing yesterday (deleting & creating a lot of PVs), and just happened at the end of the day to check the number of hetzner volumes and all deleted volumes were still there.
I’ll see if 1.1.2 makes a difference. Thanks for the work on this @thcyron.
Will do
Great! That brings some light into the dark.
First call to delete the volume:
Fails because volume is still attached to the node. CO performs a detach call:
Fails because server is locked. Next call is delete again, not detach, i.e detaching is not retried:
In #4, we introduced special handling of server locked errors and return the gRPC error
abortedin that case. From the CSI spec:The second part worries me and tells me that returning
abortedin a server locked case is not the right thing to do since we actually want the call to be retried.Will do some more testing/investigation and probably revert #4.
I’ve just released version 1.1.1 which adds logging to
api.VolumeServiceso we don’t have to guess which code path was taken next time this problem occurs.