external-snapshotter: Problems dealing with snapshot create requests timing out

When a CSI plugin is passed a CreateSnapshot request and the caller (snapshotter sidecar) times out, the snapshotter sidecar marks this as an error and does not retry the snapshot. Further, as the call only timed out and did not fail, the storage provider may have actually created the said snapshot (although delayed).

When such an snapshot is deleted, there are no requests to the CSI plugin to delete the same, which cannot be issued by the sidecar as it does not have the SnapID.

The end result of this is that the snapshot is leaked on the storage provider.

The question/issue hence is as follows,

Should the snapshot be retried on timeouts from the CreateSnapshot call?

Based on the ready_to_use parameter in the CSI spec [1] and possibilities of application freeze as the snapshot is taken, I would assume this operation cannot be done indefinitely. But, also as per the spec timeout errors, the behavior should be a retry, as implemented for volume create and delete operations in the provisioner sidecar [2].

So to fix the potential snapshot leak by the storage provider, should the snapshotter sidecar retry till it gets an error from the plugin or a success with a SnapID, but mark the snapshot as bad/unusable as it was not completed in time (to honor the application freeze times and such)?

[1] CSI spec ready_to_use section: https://github.com/container-storage-interface/spec/blob/master/spec.md#the-ready_to_use-parameter

[2] timeout handling in provisioner sidecar: https://github.com/kubernetes-csi/external-provisioner#csi-error-and-timeout-handling

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 33 (3 by maintainers)

Commits related to this issue

Squashed 'release-tools/' changes from 7bc70e52..1748b16b 1748b16b Merge pull request #136 from pohly/go-1.16 ec844ea6 remove travis.yml, Go 1.16 df76aba8 Merge pull request #134 from andyzhangx/add-... — committed to pohly/external-snapshotter by pohly 3 years ago
Squashed 'release-tools/' changes from 3b6d17b1..bc0504ad bc0504ad Merge pull request #140 from jsafrane/remove-unused-k8s-libs 5b1de1ad go-get-kubernetes.sh: remove unused k8s libs 49b42693 Merge pu... — committed to ggriffiths/external-snapshotter by ggriffiths 3 years ago
Merge pull request #134 from andyzhangx/add-build-arg add build-arg ARCH — committed to xing-yang/external-snapshotter by k8s-ci-robot 3 years ago

Most upvoted comments

@ggriffiths will be helping out with this bug fix. Thanks.

xing-yang on Jul 12, 2019

IMO, proper solution without changing of CSI spec is the same as in volume provisioning. The snapshotter should retry until it gets a final response and decide what to do - either the snapshot is too old or user deleted VolumeSnapshot objects in the meantime and delete it or create VolumeSnapshotContent.

As you can immediately spot, if the snapshotter is restarted while the snapshot is being cut and after user deleted related VolumeSnapshot object, newly started snapshotter does not know that it should resume creating the snapshot. Volume provisioning has the same issue. We hope that volume taints could help here and we could create empty PV / VolumeSnapshotContent before knowing the real volume / snapshot ID as memento that there is some operation in progress on the storage backend.

jsafrane on Jul 3, 2019