csi-driver: Volume deletion after release sometimes fails

Hi,

Sometimes the deletion of volumes with reclaimPolicy: Delete fails:

Name:            pvc-4017a6e3-2af2-11e9-9340-96000019d538
Labels:          <none>
Annotations:     pv.kubernetes.io/provisioned-by: csi.hetzner.cloud
Finalizers:      [kubernetes.io/pv-protection external-attacher/csi-hetzner-cloud]
StorageClass:    kubermatic-fast
Status:          Released
Claim:           cluster-prow-e2e-q8lt7fwh/data-etcd-0
Reclaim Policy:  Delete
Access Modes:    RWO
VolumeMode:      Filesystem
Capacity:        10Gi
Node Affinity:   <none>
Message:         
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            csi.hetzner.cloud
    VolumeHandle:      1759327
    ReadOnly:          false
    VolumeAttributes:      storage.kubernetes.io/csiProvisionerIdentity=1549140804874-8081-csi.hetzner.cloud
Events:
  Type     Reason              Age                From                                                                            Message
  ----     ------              ----               ----                                                                            -------
  Warning  VolumeFailedDelete  15m (x16 over 3h)  csi.hetzner.cloud_hcloud-csi-controller-0_c502e394-272c-11e9-a078-9a17553d75a7  rpc error: code = Internal desc = volume with ID '1759327' is still attached to a server (service_error)

The associated claim does not exist anymore, neither does the namespace it was in:

k get pvc -n cluster-prow-e2e-q8lt7fwh data-etcd-0
Error from server (NotFound): namespaces "cluster-prow-e2e-q8lt7fwh" not found

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 26 (12 by maintainers)

Most upvoted comments

I’ve just released version 1.1.3. Thank you for your help on this, @alvaroaleman!

We ran into the same problem and rc.2 seems to fix it. Will probably release 1.1.3 tomorrow.

I implemented “detach before delete” and pushed hetznercloud/hcloud-csi-driver:1.1.3-rc.2. That should finally solve the issue. Please test.

I pushed hetznercloud/hcloud-csi-driver:1.1.3-rc.1 which fixes DeleteVolume not returning the correct error code when the volume is still in use. Please test that version.

I am also experiencing this issue. I was testing yesterday (deleting & creating a lot of PVs), and just happened at the end of the day to check the number of hetzner volumes and all deleted volumes were still there.

I’ll see if 1.1.2 makes a difference. Thanks for the work on this @thcyron.

Will do

Great! That brings some light into the dark.

First call to delete the volume:

level=info ts=2019-02-13T15:00:14.075550654Z component=api-volume-service msg="deleting volume" volume-id=1807518
level=info ts=2019-02-13T15:00:14.19667338Z component=api-volume-service msg="failed to delete volume" volume-id=1807518 err="volume with ID '1807518' is still attached to a server (service_error)"

Fails because volume is still attached to the node. CO performs a detach call:

level=info ts=2019-02-13T15:00:16.679636537Z component=api-volume-service msg="detaching volume" volume-id=1807518 server-id=1809247
level=info ts=2019-02-13T15:00:17.010339926Z component=api-volume-service msg="failed to detach volume" volume-id=1807518 err="cannot perform operation because server is locked (locked)"

Fails because server is locked. Next call is delete again, not detach, i.e detaching is not retried:

level=info ts=2019-02-13T15:00:29.23479975Z component=api-volume-service msg="deleting volume" volume-id=1807518
level=info ts=2019-02-13T15:00:29.35502017Z component=api-volume-service msg="failed to delete volume" volume-id=1807518 err="volume with ID '1807518' is still attached to a server (service_error)"

In #4, we introduced special handling of server locked errors and return the gRPC error aborted in that case. From the CSI spec:

The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls.

The second part worries me and tells me that returning aborted in a server locked case is not the right thing to do since we actually want the call to be retried.

Will do some more testing/investigation and probably revert #4.

I’ve just released version 1.1.1 which adds logging to api.VolumeService so we don’t have to guess which code path was taken next time this problem occurs.