velero: [BUG] Longhorn Snapshots are not deleted after expired Backups (Velero)
Describe the bug (🐛 if you encounter this issue)
We are using Velero to create backups from the Kubernetes manifests and the persistent volumes (in our example we backup Harbor).
If we create a backup, Velero saves the K8s manifests to a Object Storage (MinIO) and creates snapshots resources to trigger Longhorn backups with the velero-plugin-for-csi
. Longhorn writes the backups to another MinIO bucket.
If we delete a Velero backup or the backup is expired, the snapshot (snapshots.longhorn.io
) are not deleted:
We are using Velero v1.9.4 with EnableCSI
feature and the following plugins:
- velero/velero-plugin-for-csi:v0.4.0
- velero/velero-plugin-for-aws:v1.6.0
We have the same issue in Velero v1.11.0 with EnableCSI
feature and the following plugins:
- velero/velero-plugin-for-csi:v0.5.0
- velero/velero-plugin-for-aws:v1.6.0
To Reproduce
Steps to reproduce the behavior:
- Install the newest version of Velero and Rancher-Longhorn
- In Longhorn configre a S3 Backup Target (we are usng MinIO for this)
- Enable CSI Snapshot Support for Longhorn.
- Create a backup (for example with the
Schedule
below):velero backup create --from-schedule harbor-daily-0200
- Delete the backup
velero backup delete <BACKUPNAME>
- The snapshot (
snapshots.longhorn.io
) is not deleted.
Expected behavior
The snapshot is deleted.
Environment
- Longhorn version: 102.2.0+up1.4.1
- Velero version:
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher-Longhorn Helm Chart
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2, v1.25.7+rke2r1
- Number of management node in the cluster: 1x
- Number of worker node in the cluster: 3x
- Node config
- OS type and version: Ubuntu
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): VMs on Proxmox
- Number of Longhorn volumes in the cluster: 17
- Velero features (use
velero client config get features
):
Additional context
Velero Backup Schedule for Harbor
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: harbor-daily-0200
namespace: velero #Must be the namespace of the Velero server
spec:
schedule: 0 0 * * *
template:
includedNamespaces:
- 'harbor'
includedResources:
- '*'
snapshotVolumes: true
storageLocation: minio
volumeSnapshotLocations:
- longhorn
ttl: 168h0m0s #7 Days retention
defaultVolumesToRestic: false
hooks:
resources:
- name: postgresql
includedNamespaces:
- 'harbor'
includedResources:
- pods
excludedResources: []
labelSelector:
matchLabels:
statefulset.kubernetes.io/pod-name: harbor-database-0
pre:
- exec:
container: database
command:
- /bin/bash
- -c
- "psql -U postgres -c \"CHECKPOINT\";"
onError: Fail
timeout: 30s
VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: longhorn
namespace: longhorn-system
labels:
velero.io/csi-volumesnapshot-class: "true"
driver: driver.longhorn.io
deletionPolicy: Delete
VolumeSnapshotClass
In our second cluster, with Velero v1.11.0 installed, we created the following resource (but same issue here):
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: longhorn
namespace: longhorn-system
labels:
velero.io/csi-volumesnapshot-class: 'true'
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
type: bak
VolumeSnapshotLocation
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
name: longhorn
namespace: velero
spec:
provider: longhorn.io/longhorn
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 4
- Comments: 23 (7 by maintainers)
Hi everyone! I am from the Longhorn team. It is a great discussion so far in this thread and I would like to join the conversation.
First of all, like others already mentioned, a CSI VolumeSnapshot (this is a Kubernetes upstream CRD) can be associated with either a Longhorn snapshot (live inside the cluster data) or a Longhorn backup (live outside of the cluster in S3 endpoint). For example, the CSI VolumeSnapshot created by this VolumeSnapshotClass corresponds to a Longhorn snapshot (link):
and the CSI VolumeSnapshot created by this VolumeSnapshotClass corresponds to a Longhorn backup (link):
CSI VolumeSnapshot of
type: snap
When you create a CSI VolumeSnapshot of
type: snap
, Longhorn provisions a in-cluster Longhorn snapshot (the ones shown in this picture https://user-images.githubusercontent.com/7498854/234019298-e7cd2853-199b-4702-b020-24227c0a13bc.png). When you delete this CSI VolumeSnapshot, Longhorn will delete the snapshot in this picture. The benefit of this approach us there is no leftover resource in this case. However, as pointed out by others, this CSI VolumeSnapshot is local inside the cluster, it is not backed up to the remote S3 endpoint. If someone (like the Velero 1.12.2 data mover) try to mount (also means clone) this CSI VolumeSnapshot and upload the data to S3 endpoint, Longhorn will need to first fully copy the data to a new PVC, then the data move can upload data to the S3 endpoint. Finally, data mover can delete the newly cloned PVC at the end. This is a costly operation and doesn’t seems fit this backup use-case much. This feature (clone a new PVC from a CSI VolumeSnapshot) was intended for the use-case likes VM cloning (in Harvester) in which a brand new VM and its data is cloned from another VM.CSI VolumeSnapshot of
type: bak
On the other hand, when you create a CSI VolumeSnapshot of
type: bak
, Longhorn will:When you delete a CSI VolumeSnapshot of
type: bak
, Longhorn will:The downside of this CSI VolumeSnapshot of
type: bak
currently is that there are leftover Longhorn snapshot (the ones shown in this picture https://user-images.githubusercontent.com/7498854/234019298-e7cd2853-199b-4702-b020-24227c0a13bc.png) after deleting the CSI VolumeSnapshot. However, the upside is huge, this method is the native way to backup data to S3 endpoint in Longhorn. It is fast and efficient.Conclusion:
I would recommend using the CSI VolumeSnapshot of
type: bak
as it is the native way to backup data to S3 endpoint in Longhorn. It is fast and efficient. To overcome its limitation (leftover Longhorn snapshot), I suggest:Firstly, there are couple of points I would like to highlight about your setup:
The value “bak” tells Longhorn driver to do actual “backup” when a CSI snapshot is taken. This was the default behavior of Longhorn CSI driver until version 1.3. Since then, there is a different value you can use called “snap”. This causes CSI driver to take a real “snapshot” without triggering data movement. Just wanted to mention it in case you want to use this feature. See https://longhorn.io/docs/1.4.1/snapshots-and-backups/csi-snapshot-support/csi-volume-snapshot-associated-with-longhorn-snapshot/ for details.
Now, coming to the actual snapshot deletion, if VolumeSnapshot and VolumeSnapshotContent resources are gone and if storage snapshots remain, most probable cause would be an issue with CSI driver. You should check Longhorn CSI driver logs and verify if there are any messages corresponding to the VolumeSnapshotContent that was deleted. You can also try to reproduce the problem by creating a VolumeSnapshot manually and then deleting it to see what happens. We, at CloudCasa, have seen snapshot deletion issues with Longhorn but the driver version was pre-1.3. You use 1.4.1?
Thanks, Raghu (https://cloudcasa.io).