longhorn: [BUG] Create backup failed: failed lock lock-*.lck type 1 acquisition

Describe the bug

Run test case test_recurring_job_restored_from_backup_target[nfs]. It’s possible to encounter backup creation failed with error message:

  error: 'proxyServer=10.42.2.10:8501 destination=10.42.2.10:10000: failed to backup
    snapshot backup1--ef2a60d6-5123-4f7d-8da8-ea6eff29154d to backup-8d164319ad7140d1:
    rpc error: code = Internal desc = failed to create backup: failed to create backup
    to nfs://longhorn-test-nfs-svc.default:/opt/backupstore for volume record-recurring-job:
    rpc error: code = Unknown desc = failed lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck
    type 1 acquisition'

https://ci.longhorn.io/job/public/job/v1.6.x/job/v1.6.x-longhorn-upgrade-tests-sles-amd64/8/testReport/junit/tests/test_recurring_job/test_recurring_job_restored_from_backup_target_nfs_/ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6026/console

To Reproduce

Run test case test_recurring_job_restored_from_backup_target[nfs]

Expected behavior

Support bundle for troubleshooting

supportbundle_e6e4d284-3981-4fc0-8f00-8051c70902dc_2024-01-22T05-36-02Z.zip

Environment

  • Longhorn version: v1.6.x-head (longhorn-manager 0c1a49) or master-head (longhorn-manager 8a8414)
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.4+k3s1
    • Number of control plane nodes in the cluster:
    • Number of worker nodes in the cluster:
  • Node config
    • OS type and version: sles 15-sp5
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 5 months ago
  • Comments: 28 (28 by maintainers)

Most upvoted comments

Verified passed on master-head (longhorn-engine 3c1301) and v1.6.x-head (longhorn-engine 5a9b9d) by running test case test_recurring_job_restored_from_backup_target 300 times. All test cases passed.

Test results: master-head: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6130/ ~ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6132/ v1.6.x-head: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6133/ ~ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6135/

Leave issue closing to @chriscchien in case you’re already in the middle of testing.

@mantissahz Agree, let me finish the test to see if it passes all 100 runs I will then create the PR.

Note

This issue could happens when we unmount the nfs during the cleanup, it could caused by network issue as well In our original implementation, our lock has 150secs expiration time, so it is okay. But if we test it 100 times frequently, it may hit the issue because the lock is not expired yet and we unmount the nfs too fast. The fix is to prevent such aggressive test.

Moved this back to 1.6.0 to see if we can get this fix in.

Since we didn’t change any backup logic in v1.6.0 I suspect this behavior exists very long time From testing side, we can apply the PR I provided to remove the possibility by using different volume name From Longhorn side, I still need some time to figure out how it happens.

@yangchiu @innobead

I have run the test and confirmed that the root cause is because we get volume before the backup is ready From the below logs, Backup was ready after a while, but the volume status was empty because we got the volume before Backup was ready.

test_recurring_job.py::test_recurring_job_restored_from_backup_target[nfs-43-100] wait_for_volume_status: record-recur-longhorn-testvol-rclsga
wait_for_volume_status: record-recur-longhorn-testvol-rclsga
wait_for_volume_status: record-recur-longhorn-testvol-rclsga
Volume record-recur-longhorn-testvol-rclsga
Backup {'backupURL': '', 'error': '', 'progress': 0, 'replica': '', 'size': '', 'snapshot': '', 'state': ''}
Volume record-recur-longhorn-testvol-rclsga
Backup {'backupURL': '', 'error': '', 'progress': 0, 'replica': '', 'size': '', 'snapshot': '', 'state': ''}
Volume record-recur-longhorn-testvol-rclsga
Backup {'backupURL': '', 'error': '', 'progress': 0, 'replica': '', 'size': '', 'snapshot': '', 'state': ''}
Volume record-recur-longhorn-testvol-rclsga
Backup {'backupURL': 'nfs://longhorn-test-nfs-svc.default:/opt/backupstore?backup=backup-90a16d4a5ef544ba&volume=record-recur-longhorn-testvol-rclsga', 'error': '', 'progress': 100, 'replica': 'record-recur-longhorn-testvol-rclsga-r-d9b01af4', 'size': '2097152', 'snapshot': 'backup1--877b36f7-f959-4a26-967c-111dabce7492', 'state': 'Completed'}
volume.backupStatus: [{'backupURL': '', 'error': '', 'progress': 0, 'replica': '', 'size': '', 'snapshot': '', 'state': ''}]
FAILED

Conclusion

This issue can concludes to 2 reasons

  1. We use the same volume name across tests. When the next test is creating Backup, the last test might be still deleting the Backup at the same time and leads to lock issue. => Should use random volume name
  2. We should get the volume after Backup is ready, or the Volume.BackupStatus will not be correct

We shouldn’t get the volume first and then wait for backup completion because the volume we get might not have the latest backupStatus, it may haven’t completed yet

Yes, it is.

@ChanYiLin We should use the latest volume from wait_for_backup_completion, or get the volume by id after wait_for_backup_completion. WDYT?

    complete_backup_1_count = 0
    restore_snapshot_name = ""
    #volume = client.by_id_volume(volume_name1)
    volume = wait_for_backup_completion(client, volume_name1)
    for b in volume.backupStatus:
        if back1+"-" in b.snapshot:
            complete_backup_1_count += 1
            restore_snapshot_name = b.snapshot

    assert complete_backup_1_count == 1

Wonder if this issue happened in v1.5.x regression test? @yangchiu

Tested with the same test image, ran the test case for 100 times, didn’t find this issue in v1.5.x: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6046/