longhorn: [BUG][v1.5.x] Recurring job fails to create backup when volume detached
Describe the bug
Running test case test_recurring_jobs_when_volume_detached_unexpectedly[s3], it randomly fails to create backup with error message:
proxyServer=10.42.3.8:8501 destination=10.42.2.10:10010:
failed to backup snapshot recurrin-532a51eb-2fda-4efb-8772-b4c41472eabb to backup-2bde369ad7864927:
rpc error: code = Unknown desc = failed to get volume 10.42.2.10:10010: rpc error: code = Unavailable
desc = error reading from server: EOF
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6238/consoleFull
The reproducibility is about 1/10.
To Reproduce
Running test case test_recurring_jobs_when_volume_detached_unexpectedly[s3] repeatedly.
Expected behavior
Support bundle for troubleshooting
supportbundle_df8064e0-2445-44fd-9af4-63ed76159e29_2024-02-16T02-11-07Z.zip
Environment
- Longhorn version: v1.5.x-head
- Impacted volume (PV):
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.4+k3s1
- Number of control plane nodes in the cluster: 1
- Number of worker nodes in the cluster: 3
- Node config
- OS type and version: sles 15-sp5
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:
Additional context
About this issue
- Original URL
- State: closed
- Created 5 months ago
- Reactions: 1
- Comments: 22 (21 by maintainers)
This issue is first found on Feb. 8 with longhorn-manager
485f498, which isv1.5.4-rc2: https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-upgrade-tests-sles-amd64/281/ So I believe it doesn’t happen in v1.5.4-rc1 or earlier version.And I also found this issue in v1.6.x-head on Feb. 9 and master-head on Feb. 7: https://ci.longhorn.io/job/public/job/v1.6.x/job/v1.6.x-longhorn-tests-sles-amd64/34/testReport/junit/tests/test_recurring_job/test_recurring_jobs_when_volume_detached_unexpectedly_s3_/ https://ci.longhorn.io/job/public/job/master/job/sles/job/amd64/job/longhorn-tests-sles-amd64/812/testReport/tests/test_recurring_job/test_recurring_jobs_allow_detached_volume_s3_/
Looks like @PhanLe1010 proposed a better fix. I will also test it.
I still need more investigation but I think I kind of see the root cause of race condition that @c3y1huang also saw at https://github.com/longhorn/longhorn/issues/7937#issuecomment-1956182452:
The race condition is that we sometimes mark the volumeattachment ticket as satisfied even though the volume has empty spec.NodeID and has started the detaching process.
This is the flow when I watch the changes of the
volumeandvolumeattachmentobject when the issue happen:First, snapshot controller create a VA ticket for the snapshot:
Then, volume.spec is set to the corresponding node:
The snapshot VA ticket initially not satisfied:
Then, once the volume is attached
Snapshot VA ticket has satisfied true
Once the snapshot is created, snapshot controller delete the snapshot VA
This lead to volume.Spec.NodeID being reset
Note that at this moment the volume still have status.state as attached because the volume has not started to detach the volume yet:
Then a backup VA ticket is created
Then backup VA ticket is immediate set to have
satisfied: trueeven though the volume is being detached with the spec.NodeID emptyAt this moment, backup controller starts to initiate the backup:
The backup later failed because the volume is in detaching process and it start killing engine process:
This is the error:
We’ve also discussed this in the local meeting. Maybe we can check again after the fix.
On observation, the same issue can also be seen before the engine crash.
Test exit before engine crash:
Analyse the log:
Based on the log, there seems to be a potential scenario of race condition where the snapshot ticket is removed from the volume attachment, resulting in the detachment of the volume (initiated by the snapshot controller), even as the backup process has already been initiated by the recurring job.
If true, then the commits in RC1 might not be directly linked to the test failure. However, they could potentially impact the processing speed, thereby triggering the condition.
I will experiment and gather more logs to confirm the case.
According to https://github.com/longhorn/longhorn/issues/7937#issuecomment-1947706261, it doesn’t. Since it’s happening from Feb. 7, the only affected release is v1.5.4.
Verified passed on master-head (longhorn-manager 917762ab) by running test case
test_recurring_jobs_when_volume_detached_unexpectedly100times.Test results: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6409/ ~ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6413/
@c3y1huang Please help drive the tickets here to be ready for testing. Thanks.
Hi Jack, the race here is when volume.spec.nodeID is cleared, volume controller start detaching the volume but not reach the step of setting the engine.spec.desirestate to stop yet.
Now we can turn our investigation to why it is happening only in recent Longhorn version. Is it possible that there a bigger delay between volume.spec.nodeID is cleared and engine.spec.desirestate state is set to stop in recent Longhorn version?
Proposed solution:
Add a recurring job volume-attachement-ticket for the job. This should prevent unnecessary intermediate detachment during snapshot and backup operations triggered by the recurring backup job.
Looks like I got different error on Longhorn v1.5.4-rc3:
proxyServer=10.42.191.99:8501 destination=10.42.191.99:10010: failed to backup snapshot c-00cx94-c844d1f7-a38b-4151-8058-e77e9b2f422e to backup-dfeb5dad14aa4e6d: rpc error: code = Unknown desc = failed to create backup to for volume testvol: rpc error: code = Unknown desc = missing input parameterError logs from instance manager:
2024-02-21T23:04:13.127508386Z time="2024-02-21T23:04:13Z" level=error msg="Failed to initialize backup backup-55ef929c06664af9" func="rpc.(*SyncAgentServer).BackupCreate" file="server.go:714" error="missing input parameter"Talked to @mantissahz and he will investigate more on https://github.com/longhorn/longhorn-engine/commit/55c5c5ad6720d37b3243d3bc649c2619aa8a628d.
@mantissahz Please help check this .