longhorn: [BUG] Backup Job returns "Completed" despite running into errors
Describe the bug
The Scheduled Backup of one of our volumes fails every time it runs. It always gets to 94/95% and then throws an error. Some other volumes appear to have this same issue but only sporadically. The snapshot and backup actually do get created but the backup appears to be corrupted as a restore does not work. The job returns as succesful/completed in Kubernetes despite returning an error in the logs.
Doing the backup manually works without issue.
The biggest issue i have with this is that the backup job returns as “Completed” and creates a seemingly viable backup despite running into an error. Although it would be nice if we could find the actual issue with the backup as well. I only noticed this issue because i tried to restore one of the corrupt backups as part of a routine check.
Log or Support bundle
level=error msg=\"failed to run job for volume\" concurrent=1 error=\"failed to complete backupAndCleanup for pvc-e8c4857a-33bd-4de8-9377-e92933f624bb: cannot find the status of the backup for snapshot gitlab-m-8013b8e2-826d-4639-bdb3-92c443c49f83. It might because the engine has restarted\" groups=gitlab-minio job=gitlab-minio labels=\"{\\\"RecurringJob\\\":\\\"gitlab-minio\\\"}\" retain=3 task=backup volume=pvc-e8c4857a-33bd-4de8-9377-e92933f624bb\n
I looked through the longhorn-manager logs and found the following, the backup name refers to the backup that’s failing:
level=debug msg=\"Requeue backup-dbb8cedbc1b049d9 due to conflict\" backup=backup-dbb8cedbc1b049d9 controller=longhorn-backup error=\"Operation cannot be fulfilled on backups.longhorn.io \\\"backup-dbb8cedbc1b049d9\\\": the object has been modified; please apply your changes to the latest version and try again\"
The longhorn-manager pod doesn’t appear to actually have restarted though. Of course this might because the health check only restarts the pod after 3 connection failures.
To Reproduce
(I tried to this myself but was unsuccesful in reliably reproducing this, since it doesn’t happen every time. I think it has to do with the duration of the backup job but can’t be sure.)
Steps to reproduce the behavior:
- Create large volume (500GB+) (The size is fairly arbitrary but i’ve had more luck reproducing this with larger volumes)
- Set minio/s3 backup target (not sure if this is relevant since i unfortunately don’t have to option of trying a different backup target right now)
- Create Recurring Backup Job for Group (Concurrent: 1, Retain: 1)
- Assign volume to group
- Wait for backup job to run
- See error
Environment
- Longhorn version: 1.3.0
- Installation method: Helm
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE 1.3.12 - Kubernetes 1.22.10
- Node config
- Disk type(e.g. SSD/NVMe): SSD
- Underlying Infrastructure: Baremetal
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20 (13 by maintainers)
For this issue, we will only focus on making the job failed if the execution inside failed to prevent any confusion.
/app/recurring_job.go#L275-L314
Same issue happened last night. This is the full error message:
As you can see the Job returns a “Success” status despite running into an error. There’s no other error messages in the job pod itself.
The longhorn manager is completely filled with these (it seems to write these mutliple times a second for hours, which might explain why I can’t generate a support bundle 😕
These seem to be older backups that have failed in the past but are still writing logs that they reached their retry limit multiple times a second?
Other than that, it also reports a successful snapshot:
But immediately afterwards writes this requeue backup message:
It also writes this message during the backup job:
The instance-manager reports the same successful snapshot, but only once. No mention of a requeue:
It logs the start of the backup process but no errors during or at the end:
In case i missed something, i’ve attached the logs with search terms for the pvc, backup and snpshot names for the time frame of the backup. Unfortunately I can’t provde all longhorn logs for the time frame since those massively exceed the line limit of our loki instance. snapshot-gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705.txt backup-7e4216f214624e33.txt pvc-e8c4857a-33bd-4de8-9377-e92933f624bb.txt
When the progress pass 20%, we are collecting the logs of pods at this point. I think we can improve by having a timeout for streaming the log of each pods here. If it takes too long to stream the logs of on pod, we can skip it and move to the next pod instead of stuck there forever. Also, not sure if we can stream the logs in the revert time order so in the event of a timeout, we still have some latest logs. How do you think @shuo-wu ?