longhorn: [BUG] Longhorn Snapshots are not deleted after expired Backups (Velero)

Describe the bug (🐛 if you encounter this issue)

We are using Velero to create backups from the Kubernetes manifests and the persistent volumes (in our example we backup Harbor). If we create a backup, Velero saves the K8s manifests to a Object Storage (MinIO) and creates snapshots resources to trigger Longhorn backups with the velero-plugin-for-csi. Longhorn writes the backups to another MinIO bucket. If we delete a Velero backup or the backup is expired, the snapshot (snapshots.longhorn.io) are not deleted: image

We are using Velero v1.9.4 with EnableCSI feature and the following plugins:

  • velero/velero-plugin-for-csi:v0.4.0
  • velero/velero-plugin-for-aws:v1.6.0

We have the same issue in Velero v1.11.0 with EnableCSI feature and the following plugins:

  • velero/velero-plugin-for-csi:v0.5.0
  • velero/velero-plugin-for-aws:v1.6.0

To Reproduce

Steps to reproduce the behavior:

  1. Install the newest version of Velero and Rancher-Longhorn
  2. In Longhorn configre a S3 Backup Target (we are usng MinIO for this)
  3. Enable CSI Snapshot Support for Longhorn.
  4. Create a backup (for example with the Schedule below): velero backup create --from-schedule harbor-daily-0200
  5. Delete the backup velero backup delete <BACKUPNAME>
  6. The snapshot (snapshots.longhorn.io) is not deleted.

Expected behavior

The snapshot is deleted.

Environment

  • Longhorn version: 102.2.0+up1.4.1
  • Velero version:
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher-Longhorn Helm Chart
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2, v1.25.7+rke2r1
    • Number of management node in the cluster: 1x
    • Number of worker node in the cluster: 3x
  • Node config
    • OS type and version: Ubuntu
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): VMs on Proxmox
  • Number of Longhorn volumes in the cluster: 17

Additional context

Velero Backup Schedule for Harbor

---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: harbor-daily-0200
  namespace: velero #Must be the namespace of the Velero server
spec:
  schedule: 0 0 * * *
  template:
    includedNamespaces:
    - 'harbor'
    includedResources:
    - '*'
    snapshotVolumes: true
    storageLocation: minio
    volumeSnapshotLocations:
      - longhorn
    ttl: 168h0m0s #7 Days retention
    defaultVolumesToRestic: false
    hooks:
      resources:
        - name: postgresql
          includedNamespaces:
          - 'harbor'
          includedResources:
          - pods
          excludedResources: []
          labelSelector:
            matchLabels:
              statefulset.kubernetes.io/pod-name: harbor-database-0
          pre:
            - exec:
                container: database
                command:
                  - /bin/bash
                  - -c
                  - "psql -U postgres -c \"CHECKPOINT\";"
                onError: Fail
                timeout: 30s

VolumeSnapshotClass

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: longhorn
  namespace: longhorn-system
  labels:
    velero.io/csi-volumesnapshot-class: "true"
driver: driver.longhorn.io
deletionPolicy: Delete

VolumeSnapshotClass

In our second cluster, with Velero v1.11.0 installed, we created the following resource (but same issue here):

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: longhorn
  namespace: longhorn-system
  labels:
    velero.io/csi-volumesnapshot-class: 'true'
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
  type: bak

VolumeSnapshotLocation

apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
  name: longhorn
  namespace: velero
spec:
  provider: longhorn.io/longhorn

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 7
  • Comments: 24 (3 by maintainers)

Most upvoted comments

Thanks for the valuable info.

We will improve this, as it’s quite important for space efficiency.

@R-Studio @innobead

I think this issue can be simplified to completely exclude Velero.

At the core the issue here is that Longhorn does not delete snapshots or backups when the backing CSI VolumeSnapshot resource is deleted.

As a user of Longhorn that is interfacing with CSI and not native Longhorn resources, I expect the state of Longhorn resources to reflect the state of my CSI resources.

  • If I create a CSI VolumeSnapshot I expect Longhorn to create a snapshot/backup/bi. This works!
  • If I delete a CSI VolumeSnapshot I expect Longhorn to delete the backing snapshot/backup/bi that it created. This doesn’t work.

Therefore I think it’s fair to state that Longhorn is currently only providing a partial implementation of the CSI interface/spec.

Velero is just using this common CSI interface as it is intended to be used and expecting it to have the desired effect. This is not a Velero issue.


Perhaps this should be opened as a new issue with a smaller scope (CSI spec conformance).

@innobead thanks for your reply. What I want: I have a Velero schedule that creates/triggers backup of my persistent volumes with a retention period of e.g. 7 days. After this retention period 7 days Velero deletes these backups, but the corresponding snapshots are not deleted and consumes disk space that I don’t want.
As a workaround, I have a recurring job that deletes these snapshots (retain 7), but there are two disadvantages.

  • I’m using up disk space for snapshots I don’t want and that are stored in my object store already.
  • For example, if I trigger 3 manual backups with Velero, the recurring job doesn’t delete the snapshots based on the creation timestamp like Velero does. This means that I lose backup data that is older than 4 days.

I’m having almost the same setup and versions and the same issue! One interesting log line found on longhorn-csi-plugin:

longhorn-csi-plugin-5k8lg longhorn-csi-plugin time="2023-10-06T08:12:20Z" level=info msg="DeleteSnapshot: req: {\"snapshot_id\":\"bak://pvc-c57da450-ce82-44c8-ac83-0a039634a334/backup-04db0d0fe4ef49f1\"}" longhorn-csi-plugin-5k8lg longhorn-csi-plugin time="2023-10-06T08:12:20Z" level=info msg="DeleteSnapshot: rsp: {}" csi-snapshotter-5d899fdcfc-xv627 csi-snapshotter E1006 08:12:20.143392 1 snapshot_controller_base.go:265] could not sync content "snapcontent-55c4399b-1dec-4cf2-b9bd-a4eff27f315e": snapshot controller failed to update snapcontent-55c4399b-1dec-4cf2-b9bd-a4eff27f315e on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-55c4399b-1dec-4cf2-b9bd-a4eff27f315e": StorageError: invalid object, Code: 4, Key: /registry/snapshot.storage.k8s.io/volumesnapshotcontents/snapcontent-55c4399b-1dec-4cf2-b9bd-a4eff27f315e, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: d4fa9b1d-416e-4df5-ad74-d3ac6bec3b66, UID in object meta:

@tcoupin thanks but this is not a solution because if I use --snapshot-volumes=false then velero does not trigger a backup for the persistent volumes. So Velero only backups the manifests/YAML’s.

@weizhe0422 here the support bundle. Thanks for any help. Info

  • Start Backup: 2023-05-01 09:12:59 +0200 CEST
  • Complete Backup: 2023-05-01 09:14:29 +0200 CEST
  • Delete Backup: 2023-05-01 09:17:34 +0200 CEST

supportbundle_a3236774-99ca-4ab5-a2a5-74c925273bb4_2023-05-01T07-20-00Z.zip