longhorn: [BUG] Unable to create backups after snapshot is generated

Hi! I’m having issues with the backup system. I’ve just found a way to make an error message disappear (the one described in issue #4187, btw redeploying everything after having removed the backup target from the config fixed it up, thanks for the tip!) and another problem came up.

Bug description

When clicking “create backup” in the volume management UI, a snapshot is created and nothing happens. Through the rancher UI, I tried to manually run the cronjob for a scheduled backup, the job takes forever to complete because for every available volume there is the following error in the pod log: failed to complete backupAndCleanup for volume-name: timeout waiting for the backup.

No backup is shown in the longhorn UI.

To Reproduce

In the rancher UI, use the button “run now” to start a scheduled backup job and check the log.

Log or Support bundle

The log of the pod running the cronjob shows this error multiple times: level=error msg="failed to run job for volume" concurrent=1 error="failed to complete backupAndCleanup for volume-name: timeout waiting for the backup of the snapshot sched-ba-16b52a09-ce71-429b-9ee5-618087750787 of volume volume-name to start" groups=default job=sched-backup labels="{\"RecurringJob\":\"sched-backup\"}" retain=30 task=backup volume=volume-name.

Edit: I’ve just sent an email to longhorn-support-bundle@suse.com with the support bundle attached.

Environment

Longhorn version: v1.3.0
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s version v1.22.7+k3s1
- Number of management node in the cluster: 2
- Number of worker node in the cluster: 4
Node config
- OS type and version: debian bullseye 11.4, kernel release 5.10.0-16-amd64
- CPU per node: 2 vCPU
- Memory per node: 4096MB on the control nodes, 5120MB on the worker nodes
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes: 1Gbps
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Proxmox hypervisor
Number of Longhorn volumes in the cluster: 15

Additional context

Previously, when running longhorn version 1.2.4, backups were working flawlessly; issues appeared after updating to v1.3.0. By manually opening a shell inside a longhorn-manager pod, I can successfully create files and folders in the nfs share through the mount in /var/lib/longhorn-backupstore-mounts/; I’d say this is not an issue related to NFS permissions.

About this issue

Original URL
State: open
Created 2 years ago
Comments: 15 (7 by maintainers)

Most upvoted comments

Didn’t know which component(s) to examine in detail, but here are some additional details and a work-around that I found for my situation…

Went into the UI, selected a volume, and “Create Backup”. A snapshot gets created, but not a backup. Here’s what I observed…

In the instance manager log I see: time="2022-08-18T13:58:39Z" level=warning msg="Snapshot backup timeout" backup=backup-b66f715210e84c74 controller=longhorn-backup error="proxyServer=192.168.80.215:8501 destination=192.168.80.215:10001: failed to backup snapshot e2ff9529-06c1-4d72-a215-c9b26b90c59b to backup-b66f715210e84c74: rpc error: code = Unknown desc = failed to create backup to nfs://nfs.longhorn-backup.svc.cluster.local:/ for volume pvc-63cfac91-a2d6-40f4-9a31-57184571f2bf: rpc error: code = DeadlineExceeded desc = context deadline exceeded" node=[redacted]

Notice that I am using k8s to manage the NFS server for the backups (there is an FQDN for the NFS server that I am using). I modified the backup target to use the IP address of the NFS service instead of the FQDN of the service. I don’t know why I had to change this since it was working in 1.2.4, but I guess some component is trying to access the NFS and can’t use the DNS properly (if containers try to use internal DNS resolution during start-up I know this fails because DNS is only available -inside- of the containers).

So, after switching from the FQDN of the service to the IP address of the service, everything is working as intended.

I know this doesn’t help actually identify the issue as to why the FQDN approach isn’t working (but, maybe someone can chime in who knows when the DNS resolution is being used which might result in this failure). I am hoping I can put together a test case using a development deployment soon, but I need to leave my production deployments alone for now since it’s working for me in this situation. It would be ideal to use the FQDN of the service though in case I ever need to remake the service for some reason and get a new IP address from k8s.

jlphillipsphd on Aug 18, 2022

@derekbit I will look into getting the bundle, but need to explore it for concerns about sensitive data…

You can check the error during back and its associated snapshot. Sounds the snapshot is somehow deleted before doing backup.

derekbit on Aug 18, 2022

I saw tons of the logs like the followings:

2022-08-07T15:44:51.547638859+02:00 time="2022-08-07T13:44:51Z" level=warning msg="Cannot take snapshot backup" backup=backup-989d2d761c4844c3 controller=longhorn-backup error="could not find snapshot 'sched-ba-353e2804-4e70-48b3-8f39-a91d375c366e' to backup, engine 'unifi-controller-vol-e-36c0bc0d'" node=nem-03
2022-08-07T15:44:51.547722181+02:00 time="2022-08-07T13:44:51Z" level=warning msg="Error syncing Longhorn backup longhorn-system/backup-989d2d761c4844c3" controller=longhorn-backup error="longhorn-backup: fail to sync backup longhorn-system/backup-989d2d761c4844c3: could not find snapshot 'sched-ba-353e2804-4e70-48b3-8f39-a91d375c366e' to backup, engine 'unifi-controller-vol-e-36c0bc0d'" node=nem-03

This means the volume snapshots are removed during the backup creation… Not sure how you can do that…

shuo-wu on Aug 15, 2022