longhorn: [BUG] Restoring volume stuck forever if the backup is already deleted.

Describe the bug Restoring volume stuck with null state forever if the backup is already deleted.

To Reproduce Steps to reproduce the behavior:

  1. Create a volume and write 100 mb data to it.
  2. Take a backup.
  3. Restore the backup and immediately delete the backup.
  4. The restoring volume stuck and never recovers or become faulted.

Expected behavior The restoring volume should become faulted after retrying for some time, there should be a time out for retry.

Log longhorn-support-bundle_e636dffc-08d6-4fd1-8aad-7a5fe167b4d2_2020-10-08T22-41-51Z.zip

Backup volume - pvc-f8473372-108d-4667-a54b-23b99595de66 Backup name - backup-708a7f01339f467a Restoring volume - restore-2 Time - ~2020-10-08 22:24:00

Environment:

  • Longhorn version: Longhorn Master - 10/08/2020
  • Kubernetes version: 19.2
  • Node OS type and version: Ubuntu 18.04

Additional context The backupfrom of the restoring volume is referring to a non-existent backup because the backup is deleted.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 16 (15 by maintainers)

Most upvoted comments

@chriscchien

Got it. This is different than the backup deletion, and it is a backupVolume deletion case.

The backupVolume does code

  1. Delete all backups CR and trigger the deletions of backups in backupstore
  2. Delete the backup volume in backupstore

Deleting all backups CR should succeed, but the deletions in backupstore are rejected until the restoration process is complete. The behaviors are expected.

The restoration is succesful, but the volume state is stuck in attached with frontendDisabled=true because the error

time="2023-03-07T14:23:37Z" level=warning msg="Dropping Longhorn volume longhorn-system/bbb out of the queue" controller=longhorn-volume error="failed to sync longhorn-system/bbb: failed to reconcile volume state for bbb: failed to get backup volume: bbb: backupvolume.longhorn.io \"pvc-0639d954-0043-4ae8-a274-05ab7731204c\" not found" node=rancher60-worker1

The root cause is the check in checkForAutoDetachment. We should ignore the IsNotFound error.

cc @weizhe0422

Yeah.

I think ignoring the IsNotFound error might be not a good solution. Probably need to check if the backupVolume is being used for any restoration before removing it from the backupstore, and it will be consistent with the backup deletion.

Hi, in longhorn-manager master 44f425, I can delete backup from UI while volume is restoring. After the restoration complete, the volume will keep attached and the ROBUSTNESS = healthy, Ready for workload = Not Ready, do we have a definition of backup can be deleted while volume restoring?

@chriscchien A bit confused. I didn’t see any backupDelete API call in the support bundle. I tried the backup deletion while restoring backup again. The backup was deleted after restoration, and the volume becomes detached in the end.

Hi @derekbit , I can only reproduce the situation by clicking Delete All Backups in UI while volume restoration is in progress, from terminal I can see the backup CR deleted(disappear) too. Below is the support bundle supportbundle_4cb5bb67-7d01-4b27-91ef-775a969a3043_2023-03-07T13-54-09Z.zip

If I try delete backup CR by command or try delete single backup in UI during volume restoration, the restoration will complete and backup deleted after restoration, volume become detached.

Close this ticket because in longhorn-manager master 44f425 below, actions worked as expected

  • Restore volume from backup then immediately delete the backup.
    • Volume become faulted immediately
  • Restore volume from backup then delete the backup while restoring was in progress.
    • Backup deleted after restoration complete and volume in detached state after resotrion

Because of the latest PR, so the behavior won’t be changed.

cc @longhorn/qa

@mantissahz need to backport to 1.3 and 1.2.