external-snapshotter: Failure events are not propagated to VolumeSnapshot or VolumeSnapshotContent

In the case when driver fails to create a snapshot and returns an appropriate GRPC error code, the same is not propagated to the VolumeSnapshot object. The VolumeSnapshot remains in ReadyToUse=False as expected-

# kubectl get volumesnapshot
NAME          READYTOUSE   SOURCEPVC   SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS   SNAPSHOTCONTENT                                    CREATIONTIME   AGE
snapdep5      false        pvcdep4                                           snapclass1      snapcontent-9c441220-ce0d-4b41-8d0a-9b0372082495                  10m

But describe on the object does not show any failure events, it continues to show “CreatingSnapshot” as below-

Events:
  Type    Reason            Age   From                 Message
  ----    ------            ----  ----                 -------
  Normal  CreatingSnapshot  12m   snapshot-controller  Waiting for a snapshot default/snapdep5 to be created by the CSI driver.

Similarly the corresponding VolumeSnapshotContent doesnt show any events.

Env Details: Kubernetes v1.18.2 Snapshotter v2.2.0-rc2 Red Hat Enterprise Linux Server 7.6

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (1 by maintainers)

Most upvoted comments

The other aspect is, on a final error, the external snapshotter should stop retrying CreateSnapshot call back again. For instance, there are valid failures where storage systems has limits to the number of snapshots that can be created on a particular source volume. In such scenarios, when the driver returns valid final error, external snapshotter should honor the error by not retrying.

I disagree with not retrying again. Kubernetes controllers follow a reconciling architecture that will always retry until the Kubernetes object is deleted. In your example failure case, someone can clean up and delete some snapshots from the storage system and if the snapshot-controller did not retry, then it could not automatically recover.