longhorn: [BUG] Create backup failed: failed lock lock-*.lck type 1 acquisition
Describe the bug
Run test case test_recurring_job_restored_from_backup_target[nfs]. It’s possible to encounter backup creation failed with error message:
error: 'proxyServer=10.42.2.10:8501 destination=10.42.2.10:10000: failed to backup
snapshot backup1--ef2a60d6-5123-4f7d-8da8-ea6eff29154d to backup-8d164319ad7140d1:
rpc error: code = Internal desc = failed to create backup: failed to create backup
to nfs://longhorn-test-nfs-svc.default:/opt/backupstore for volume record-recurring-job:
rpc error: code = Unknown desc = failed lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck
type 1 acquisition'
https://ci.longhorn.io/job/public/job/v1.6.x/job/v1.6.x-longhorn-upgrade-tests-sles-amd64/8/testReport/junit/tests/test_recurring_job/test_recurring_job_restored_from_backup_target_nfs_/ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6026/console
To Reproduce
Run test case test_recurring_job_restored_from_backup_target[nfs]
Expected behavior
Support bundle for troubleshooting
supportbundle_e6e4d284-3981-4fc0-8f00-8051c70902dc_2024-01-22T05-36-02Z.zip
Environment
- Longhorn version: v1.6.x-head (longhorn-manager 0c1a49) or master-head (longhorn-manager 8a8414)
- Impacted volume (PV):
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.4+k3s1
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
- Node config
- OS type and version: sles 15-sp5
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:
Additional context
About this issue
- Original URL
- State: closed
- Created 5 months ago
- Comments: 28 (28 by maintainers)
Verified passed on master-head (longhorn-engine 3c1301) and v1.6.x-head (longhorn-engine 5a9b9d) by running test case
test_recurring_job_restored_from_backup_target300 times. All test cases passed.Test results: master-head: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6130/ ~ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6132/ v1.6.x-head: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6133/ ~ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6135/
Leave issue closing to @chriscchien in case you’re already in the middle of testing.
@mantissahz Agree, let me finish the test to see if it passes all 100 runs I will then create the PR.
Note
This issue could happens when we unmount the nfs during the cleanup, it could caused by network issue as well In our original implementation, our lock has 150secs expiration time, so it is okay. But if we test it 100 times frequently, it may hit the issue because the lock is not expired yet and we unmount the nfs too fast. The fix is to prevent such aggressive test.
Moved this back to 1.6.0 to see if we can get this fix in.
Since we didn’t change any backup logic in v1.6.0 I suspect this behavior exists very long time From testing side, we can apply the PR I provided to remove the possibility by using different volume name From Longhorn side, I still need some time to figure out how it happens.
@yangchiu @innobead
I have run the test and confirmed that the root cause is because we get volume before the backup is ready From the below logs, Backup was ready after a while, but the volume status was empty because we got the volume before Backup was ready.
Conclusion
This issue can concludes to 2 reasons
Yes, it is.
@ChanYiLin We should use the latest
volumefromwait_for_backup_completion, or get the volume by id afterwait_for_backup_completion. WDYT?Tested with the same test image, ran the test case for
100times, didn’t find this issue inv1.5.x: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6046/