longhorn: [BUG] test_recurring_jobs_allow_detached_volume failed
Describe the bug (🐛 if you encounter this issue)
I have observed that the test case test_recurring_jobs_allow_detached_volume & test_recurring_jobs_when_volume_detached_unexpectedly are failing intermittently on SLES and SLE-micro. The volume backup status is showing as Error. I also tried to reproduce the issue manually, and I was successful in reproducing it.
To Reproduce
Steps to reproduce the behavior:
- https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/
- Verify test result of
test_recurring_jobs_allow_detached_volume
Alternatively,
- Given
allow-recurring-job-while-volume-detachedset totrue. - And volume created and attached.
- And 50MB data written to volume.
- And volume detached.
- When a recurring job created runs every minute.
- And wait for backup to complete.
Expected behavior
We should have consistent test results on all distro.
Log or Support bundle
supportbundle_623f96f1-0ad6-487b-b6e2-7f961da0af8f_2023-06-14T08-34-19Z.zip
supportbundle_623f96f1-0ad6-487b-b6e2-7f961da0af8f_2023-06-14T14-23-13Z.zip
Environment
- Longhorn version: v1.5.0-rc2
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl):Kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:v1.25.9+k3s1
- Number of management node in the cluster:1
- Number of worker node in the cluster:3
- Node config
- OS type and version: SLE-Micro 5.3
- CPU per node: 4
- Memory per node: 16G
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):AWS
- Number of Longhorn volumes in the cluster:
Additional context
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (13 by maintainers)
Thanks @ChanYiLin for the excellent explanation! I agree with your analysis that the error popped from
enableBackupMonitorwhich is a part of the functioncheckMonitor. Your fix is also very clever! 🚀From the log I found it deleted the attachment then the update failed So for the next run, it re do the sync again
The error happens in
checkMonitor()when enabling thesyncMonitorhttps://github.com/longhorn/longhorn-manager/blob/master/controller/backup_controller.go#L646-L652Verified on v1.5.x-head 20230620
Verified on master-head 20230620
Result Passed
The
test_recurring_jobs_allow_detached_volumetest has passed. The v1.5.x-head test results : https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4236/ The master-head test results : https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4260/The
test_recurring_jobs_when_volume_detached_unexpectedlytest has passed. The v1.5.x-head test results : https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4237/ The master-head test results : https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4261/I think there is a problem in the backup monitoring instead of the backup controller @ChanYiLin
However, my head has stopped working now so I will do some more investigations and update tomorrow.
Oh, I got one
https://github.com/longhorn/longhorn-manager/blob/af2f111f6dfba09aaf24eaf2860ef3300af96c2a/controller/backup_controller.go#L321-L338
This issue is caused by status not updated to the k8s(etcd), but because the second defer function execute first, it use the in-memory status which is not empty so it detach the volume
Maybe we can move the second defer function to the first place after controller update the backup status ,and get the backup again before getting the status So if the status is not updated, it won’t detach the volume
WDYT?
cc @innobead @PhanLe1010
manager’s log
another manager’s log for
engineI kind of find the root cause it is a flaky situation, might need some time to summarize the issue.
after some investigation I think the error is because of the following flow
checkMonitorI can see the log when reproducing the issue
I think that is why the backup failed and the message was related to connection
The next reconcile loop should reattach the volume again, but as you can see from the above logs It failed to sync VolumeAttachment because it has been modified
cc @innobead @PhanLe1010