longhorn: [BUG] Backup Job returns "Completed" despite running into errors

Describe the bug

The Scheduled Backup of one of our volumes fails every time it runs. It always gets to 94/95% and then throws an error. Some other volumes appear to have this same issue but only sporadically. The snapshot and backup actually do get created but the backup appears to be corrupted as a restore does not work. The job returns as succesful/completed in Kubernetes despite returning an error in the logs.

Doing the backup manually works without issue.

The biggest issue i have with this is that the backup job returns as “Completed” and creates a seemingly viable backup despite running into an error. Although it would be nice if we could find the actual issue with the backup as well. I only noticed this issue because i tried to restore one of the corrupt backups as part of a routine check.

Log or Support bundle

level=error msg=\"failed to run job for volume\" concurrent=1 error=\"failed to complete backupAndCleanup for pvc-e8c4857a-33bd-4de8-9377-e92933f624bb: cannot find the status of the backup for snapshot gitlab-m-8013b8e2-826d-4639-bdb3-92c443c49f83. It might because the engine has restarted\" groups=gitlab-minio job=gitlab-minio labels=\"{\\\"RecurringJob\\\":\\\"gitlab-minio\\\"}\" retain=3 task=backup volume=pvc-e8c4857a-33bd-4de8-9377-e92933f624bb\n

I looked through the longhorn-manager logs and found the following, the backup name refers to the backup that’s failing: level=debug msg=\"Requeue backup-dbb8cedbc1b049d9 due to conflict\" backup=backup-dbb8cedbc1b049d9 controller=longhorn-backup error=\"Operation cannot be fulfilled on backups.longhorn.io \\\"backup-dbb8cedbc1b049d9\\\": the object has been modified; please apply your changes to the latest version and try again\"

The longhorn-manager pod doesn’t appear to actually have restarted though. Of course this might because the health check only restarts the pod after 3 connection failures.

To Reproduce

(I tried to this myself but was unsuccesful in reliably reproducing this, since it doesn’t happen every time. I think it has to do with the duration of the backup job but can’t be sure.)

Steps to reproduce the behavior:

  1. Create large volume (500GB+) (The size is fairly arbitrary but i’ve had more luck reproducing this with larger volumes)
  2. Set minio/s3 backup target (not sure if this is relevant since i unfortunately don’t have to option of trying a different backup target right now)
  3. Create Recurring Backup Job for Group (Concurrent: 1, Retain: 1)
  4. Assign volume to group
  5. Wait for backup job to run
  6. See error

Environment

  • Longhorn version: 1.3.0
  • Installation method: Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE 1.3.12 - Kubernetes 1.22.10
  • Node config
    • Disk type(e.g. SSD/NVMe): SSD
  • Underlying Infrastructure: Baremetal

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 20 (13 by maintainers)

Most upvoted comments

For this issue, we will only focus on making the job failed if the execution inside failed to prevent any confusion.

/app/recurring_job.go#L275-L314

func (job *Job) run() (err error) {
	job.logger.Info("job starts running")

	volumeAPI := job.api.Volume
	volumeName := job.volumeName
	volume, err := volumeAPI.ById(volumeName)
	if err != nil {
		return errors.Wrapf(err, "could not get volume %v", volumeName)
	}

	if len(volume.Controllers) > 1 {
		return fmt.Errorf("cannot run job for volume %v that is using %v engines", volume.Name, len(volume.Controllers))
	}

	if volume.State != string(longhorn.VolumeStateAttached) && volume.State != string(longhorn.VolumeStateDetached) {
		return fmt.Errorf("volume %v is in an invalid state for recurring job: %v. Volume must be in state Attached or Detached", volumeName, volume.State)
	}

	// only recurring job types `snapshot` and `backup` need to check if old snapshots can be deleted or not before creating
	switch job.task {
	case longhorn.RecurringJobTypeSnapshot, longhorn.RecurringJobTypeBackup:
		if err := job.doSnapshotCleanup(false); err != nil {
			return err
		}
	}

	switch job.task {
	case longhorn.RecurringJobTypeFilesystemTrim:
		job.logger.Infof("Running recurring filesystem trim for volume %v", volumeName)
		return job.doRecurringFilesystemTrim(volume)

	case longhorn.RecurringJobTypeBackup, longhorn.RecurringJobTypeBackupForceCreate:
		job.logger.Infof("Running recurring backup for volume %v", volumeName)
		return job.doRecurringBackup()

	default:
		job.logger.Infof("Running recurring snapshot for volume %v", volumeName)
		return job.doRecurringSnapshot()
	}
}

Same issue happened last night. This is the full error message:

{"status":"success","data":{"resultType":"streams","result":[{"stream":{"job":"longhorn-system/gitlab-minio","namespace":"longhorn-system","node_name":"nwp-kube02","pod":"gitlab-minio-27656475-q24md","app":"gitlab-minio","cluster":"nwpcluster01","container":"gitlab-minio","filename":"/var/log/pods/longhorn-system_gitlab-minio-27656475-q24md_9cd00d24-75c0-4a6b-b90a-068b48cb2458/gitlab-minio/0.log"},"values":[["1659390894526196271","{\"log\":\"time=\\\"2022-08-01T21:54:54Z\\\" level=error msg=\\\"failed to run job for volume\\\" concurrent=10 error=\\\"failed to complete backupAndCleanup for pvc-e8c4857a-33bd-4de8-9377-e92933f624bb: cannot find the status of the backup for snapshot gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705. It might because the engine has restarted\\\" groups=gitlab-minio job=gitlab-minio labels=\\\"{\\\\\\\"RecurringJob\\\\\\\":\\\\\\\"gitlab-minio\\\\\\\"}\\\" retain=3 task=backup volume=pvc-e8c4857a-33bd-4de8-9377-e92933f624bb\\n\",\"stream\":\"stderr\",\"time\":\"2022-08-01T21:54:54.45993679Z\"}"]]}],"stats":{"summary":{"bytesProcessedPerSecond":841161348,"linesProcessedPerSecond":3388879,"totalBytesProcessed":1316298073,"totalLinesProcessed":5303115,"execTime":1.564858009,"queueTime":0.000754722,"subqueries":5,"totalEntriesReturned":1},"querier":{"store":{"totalChunksRef":282,"totalChunksDownloaded":282,"chunksDownloadTime":152922480,"chunk":{"headChunkBytes":0,"headChunkLines":0,"decompressedBytes":1316298073,"decompressedLines":5303115,"compressedBytes":75990081,"totalDuplicates":0}}},"ingester":{"totalReached":5,"totalChunksMatched":0,"totalBatches":0,"totalLinesSent":0,"store":{"totalChunksRef":0,"totalChunksDownloaded":0,"chunksDownloadTime":0,"chunk":{"headChunkBytes":0,"headChunkLines":0,"decompressedBytes":0,"decompressedLines":0,"compressedBytes":0,"totalDuplicates":0}}}}}}

As you can see the Job returns a “Success” status despite running into an error. There’s no other error messages in the job pod itself.


The longhorn manager is completely filled with these (it seems to write these mutliple times a second for hours, which might explain why I can’t generate a support bundle 😕

level=error msg=\"Max retry period 24h0m0s reached in exponential backoff timer\" backup=backup-7e4216f214624e33 controller=longhorn-backup 

These seem to be older backups that have failed in the past but are still writing logs that they reached their retry limit multiple times a second?

Other than that, it also reports a successful snapshot:

"time=\"2022-08-01T21:15:46Z\" level=info msg=\"Event(v1.ObjectReference{Kind:\\\"Volume\\\", Namespace:\\\"longhorn-system\\\", Name:\\\"pvc-e8c4857a-33bd-4de8-9377-e92933f624bb\\\", UID:\\\"391d9d39-ff9c-4023-aad9-2f9f763b4f2a\\\", APIVersion:\\\"longhorn.io/v1beta2\\\", ResourceVersion:\\\"571904558\\\", FieldPath:\\\"\\\"}): type: 'Normal' reason: 'InProgress' Snapshot gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705 backup backup-26256a9ae9184243 label map[KubernetesStatus:{\\\"pvName\\\":\\\"pvc-e8c4857a-33bd-4de8-9377-e92933f624bb\\\",\\\"pvStatus\\\":\\\"Bound\\\",\\\"namespace\\\":\\\"gitlab\\\",\\\"pvcName\\\":\\\"gitlab-minio\\\",\\\"lastPVCRefAt\\\":\\\"\\\",\\\"workloadsStatus\\\":[{\\\"podName\\\":\\\"gitlab-minio-59b7dd4d56-dq2gq\\\",\\\"podStatus\\\":\\\"Running\\\",\\\"workloadName\\\":\\\"gitlab-minio-59b7dd4d56\\\",\\\"workloadType\\\":\\\"ReplicaSet\\\"}],\\\"lastPodRefAt\\\":\\\"\\\"} RecurringJob:gitlab-minio]\"\n"
"time=\"2022-08-01T21:54:52Z\" level=info msg=\"Event(v1.ObjectReference{Kind:\\\"Volume\\\", Namespace:\\\"longhorn-system\\\", Name:\\\"pvc-e8c4857a-33bd-4de8-9377-e92933f624bb\\\", UID:\\\"391d9d39-ff9c-4023-aad9-2f9f763b4f2a\\\", APIVersion:\\\"longhorn.io/v1beta2\\\", ResourceVersion:\\\"571945009\\\", FieldPath:\\\"\\\"}): type: 'Normal' reason: 'Completed' Snapshot gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705 backup backup-26256a9ae9184243 label map[KubernetesStatus:{\\\"pvName\\\":\\\"pvc-e8c4857a-33bd-4de8-9377-e92933f624bb\\\",\\\"pvStatus\\\":\\\"Bound\\\",\\\"namespace\\\":\\\"gitlab\\\",\\\"pvcName\\\":\\\"gitlab-minio\\\",\\\"lastPVCRefAt\\\":\\\"\\\",\\\"workloadsStatus\\\":[{\\\"podName\\\":\\\"gitlab-minio-59b7dd4d56-dq2gq\\\",\\\"podStatus\\\":\\\"Running\\\",\\\"workloadName\\\":\\\"gitlab-minio-59b7dd4d56\\\",\\\"workloadType\\\":\\\"ReplicaSet\\\"}],\\\"lastPodRefAt\\\":\\\"\\\"} RecurringJob:gitlab-minio]\"\n"

But immediately afterwards writes this requeue backup message:

"time=\"2022-08-01T21:54:53Z\" level=debug msg=\"Requeue backup-26256a9ae9184243 due to conflict\" backup=backup-26256a9ae9184243 controller=longhorn-backup error=\"Operation cannot be fulfilled on backups.longhorn.io \\\"backup-26256a9ae9184243\\\": the object has been modified; please apply your changes to the latest version and try again\" node=nwp-kube02\n"
"time=\"2022-08-01T21:54:53Z\" level=info msg=\"Event(v1.ObjectReference{Kind:\\\"Volume\\\", Namespace:\\\"longhorn-system\\\", Name:\\\"pvc-e8c4857a-33bd-4de8-9377-e92933f624bb\\\", UID:\\\"391d9d39-ff9c-4023-aad9-2f9f763b4f2a\\\", APIVersion:\\\"longhorn.io/v1beta2\\\", ResourceVersion:\\\"571945020\\\", FieldPath:\\\"\\\"}): type: 'Normal' reason: '' Snapshot gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705 backup backup-26256a9ae9184243 label map[KubernetesStatus:{\\\"pvName\\\":\\\"pvc-e8c4857a-33bd-4de8-9377-e92933f624bb\\\",\\\"pvStatus\\\":\\\"Bound\\\",\\\"namespace\\\":\\\"gitlab\\\",\\\"pvcName\\\":\\\"gitlab-minio\\\",\\\"lastPVCRefAt\\\":\\\"\\\",\\\"workloadsStatus\\\":[{\\\"podName\\\":\\\"gitlab-minio-59b7dd4d56-dq2gq\\\",\\\"podStatus\\\":\\\"Running\\\",\\\"workloadName\\\":\\\"gitlab-minio-59b7dd4d56\\\",\\\"workloadType\\\":\\\"ReplicaSet\\\"}],\\\"lastPodRefAt\\\":\\\"\\\"} RecurringJob:gitlab-minio]\"\n"
"time=\"2022-08-01T21:54:55Z\" level=info msg=\"Event(v1.ObjectReference{Kind:\\\"Volume\\\", Namespace:\\\"longhorn-system\\\", Name:\\\"pvc-e8c4857a-33bd-4de8-9377-e92933f624bb\\\", UID:\\\"391d9d39-ff9c-4023-aad9-2f9f763b4f2a\\\", APIVersion:\\\"longhorn.io/v1beta2\\\", ResourceVersion:\\\"571945020\\\", FieldPath:\\\"\\\"}): type: 'Normal' reason: 'Completed' Snapshot gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705 backup backup-26256a9ae9184243 label map[KubernetesStatus:{\\\"pvName\\\":\\\"pvc-e8c4857a-33bd-4de8-9377-e92933f624bb\\\",\\\"pvStatus\\\":\\\"Bound\\\",\\\"namespace\\\":\\\"gitlab\\\",\\\"pvcName\\\":\\\"gitlab-minio\\\",\\\"lastPVCRefAt\\\":\\\"\\\",\\\"workloadsStatus\\\":[{\\\"podName\\\":\\\"gitlab-minio-59b7dd4d56-dq2gq\\\",\\\"podStatus\\\":\\\"Running\\\",\\\"workloadName\\\":\\\"gitlab-minio-59b7dd4d56\\\",\\\"workloadType\\\":\\\"ReplicaSet\\\"}],\\\"lastPodRefAt\\\":\\\"\\\"} RecurringJob:gitlab-minio]\"\n"

It also writes this message during the backup job:

"time=\"2022-08-01T21:54:55Z\" level=info msg=\"Found 1 backups in the backup target that do not exist in the cluster and need to be pulled\" backupVolume=pvc-e8c4857a-33bd-4de8-9377-e92933f624bb controller=longhorn-backup-volume node=nwp-kube02\n"

The instance-manager reports the same successful snapshot, but only once. No mention of a requeue:

"[pvc-e8c4857a-33bd-4de8-9377-e92933f624bb-e-0769cdd6] time=\"2022-08-01T21:15:02Z\" level=info msg=\"Starting snapshot\" snapshot=gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705 volume=pvc-e8c4857a-33bd-4de8-9377-e92933f624bb\n"
"[pvc-e8c4857a-33bd-4de8-9377-e92933f624bb-e-0769cdd6] time=\"2022-08-01T21:15:02Z\" level=info msg=\"Requesting system sync before snapshot\" snapshot=gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705 volume=pvc-e8c4857a-33bd-4de8-9377-e92933f624bb\n"
"[pvc-e8c4857a-33bd-4de8-9377-e92933f624bb-e-0769cdd6] time=\"2022-08-01T21:15:02Z\" level=info msg=\"Starting to snapshot: 10.42.2.91:10075 gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705 UserCreated true Created at 2022-08-01T21:15:02Z, Labels map[RecurringJob:gitlab-minio]\"\n"
"[pvc-e8c4857a-33bd-4de8-9377-e92933f624bb-e-0769cdd6] time=\"2022-08-01T21:15:02Z\" level=info msg=\"Starting to snapshot: 10.42.1.21:10435 gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705 UserCreated true Created at 2022-08-01T21:15:02Z, Labels map[RecurringJob:gitlab-minio]\"\n"
"[pvc-e8c4857a-33bd-4de8-9377-e92933f624bb-e-0769cdd6] time=\"2022-08-01T21:15:02Z\" level=info msg=\"Finished to snapshot: 10.42.1.21:10435 gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705 UserCreated true Created at 2022-08-01T21:15:02Z, Labels map[RecurringJob:gitlab-minio]\"\n"
"[pvc-e8c4857a-33bd-4de8-9377-e92933f624bb-e-0769cdd6] time=\"2022-08-01T21:15:02Z\" level=info msg=\"Finished to snapshot: 10.42.0.192:10480 gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705 UserCreated true Created at 2022-08-01T21:15:02Z, Labels map[RecurringJob:gitlab-minio]\"\n"
"[pvc-e8c4857a-33bd-4de8-9377-e92933f624bb-e-0769cdd6] time=\"2022-08-01T21:15:02Z\" level=info msg=\"Finished to snapshot: 10.42.2.91:10075 gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705 UserCreated true Created at 2022-08-01T21:15:02Z, Labels map[RecurringJob:gitlab-minio]\"\n"
"time=\"2022-08-01T21:15:02Z\" level=info msg=\"Finished snapshot\" snapshot=gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705 volume=pvc-e8c4857a-33bd-4de8-9377-e92933f624bb\n"

It logs the start of the backup process but no errors during or at the end:

"[longhorn-instance-manager] time=\"2022-08-01T21:15:02Z\" level=debug msg=\"Backing up snapshots gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705 to backup backup-26256a9ae9184243\" serviceURL=\"10.42.1.19:10027\"\n"
"[longhorn-instance-manager] time=\"2022-08-01T21:15:02Z\" level=info msg=\"Backing up gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705 on tcp://10.42.1.21:10435, to s3://nwp-kube-cluster01@us-east-1/\"\n"

In case i missed something, i’ve attached the logs with search terms for the pvc, backup and snpshot names for the time frame of the backup. Unfortunately I can’t provde all longhorn logs for the time frame since those massively exceed the line limit of our loki instance. snapshot-gitlab-m-e44c1ef4-44a2-4354-9d40-0da910656705.txt backup-7e4216f214624e33.txt pvc-e8c4857a-33bd-4de8-9377-e92933f624bb.txt

It seems that the log in the Longhorn system is too large and there is something wrong with this log collection feature…

When the progress pass 20%, we are collecting the logs of pods at this point. I think we can improve by having a timeout for streaming the log of each pods here. If it takes too long to stream the logs of on pod, we can skip it and move to the next pod instead of stuck there forever. Also, not sure if we can stream the logs in the revert time order so in the event of a timeout, we still have some latest logs. How do you think @shuo-wu ?