external-snapshotter: Problems dealing with snapshot create requests timing out

When a CSI plugin is passed a CreateSnapshot request and the caller (snapshotter sidecar) times out, the snapshotter sidecar marks this as an error and does not retry the snapshot. Further, as the call only timed out and did not fail, the storage provider may have actually created the said snapshot (although delayed).

When such an snapshot is deleted, there are no requests to the CSI plugin to delete the same, which cannot be issued by the sidecar as it does not have the SnapID.

The end result of this is that the snapshot is leaked on the storage provider.

The question/issue hence is as follows,

Should the snapshot be retried on timeouts from the CreateSnapshot call?

Based on the ready_to_use parameter in the CSI spec [1] and possibilities of application freeze as the snapshot is taken, I would assume this operation cannot be done indefinitely. But, also as per the spec timeout errors, the behavior should be a retry, as implemented for volume create and delete operations in the provisioner sidecar [2].

So to fix the potential snapshot leak by the storage provider, should the snapshotter sidecar retry till it gets an error from the plugin or a success with a SnapID, but mark the snapshot as bad/unusable as it was not completed in time (to honor the application freeze times and such)?

[1] CSI spec ready_to_use section: https://github.com/container-storage-interface/spec/blob/master/spec.md#the-ready_to_use-parameter

[2] timeout handling in provisioner sidecar: https://github.com/kubernetes-csi/external-provisioner#csi-error-and-timeout-handling

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 33 (3 by maintainers)

Commits related to this issue

Most upvoted comments

@ggriffiths will be helping out with this bug fix. Thanks.

IMO, proper solution without changing of CSI spec is the same as in volume provisioning. The snapshotter should retry until it gets a final response and decide what to do - either the snapshot is too old or user deleted VolumeSnapshot objects in the meantime and delete it or create VolumeSnapshotContent.

As you can immediately spot, if the snapshotter is restarted while the snapshot is being cut and after user deleted related VolumeSnapshot object, newly started snapshotter does not know that it should resume creating the snapshot. Volume provisioning has the same issue. We hope that volume taints could help here and we could create empty PV / VolumeSnapshotContent before knowing the real volume / snapshot ID as memento that there is some operation in progress on the storage backend.