longhorn: [BUG] The last healthy replica may be evicted or removed
Describe the bug
test_disk_eviction_with_node_level_soft_anti_affinity_disabled failed in master-head edc1b83
Double verified in release version, the fail situation not happen on V1.3.0
To Reproduce
Steps to reproduce the behavior:
- Setup longhorn with 3 nodes
- Deploy longhorn-test
- Run
test_disk_eviction_with_node_level_soft_anti_affinity_disabled - After test steps 6, volume will keep in attaching state and no replica exist
Expected behavior
Test case should pass
Log or Support bundle
longhorn-support-bundle_35fabdcc-d73a-4168-a2dd-65c2298709b1_2022-07-15T06-48-21Z.zip
Environment
- Longhorn version: edc1b83
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
- Node config
- OS type and version: Ubuntu 20.04
Additional context
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (14 by maintainers)
Fixing issue 2
retaining the evicting replica if it's the only healthy replicawould not resolve the test failure. During the replica removal for the eviction, other replicas may be not removed yet. On the other hand, retaining the last RW replica for ReplicaRemove API or eviction may not work. Since these 2 options can be executed simultaneously (it’s what the test case does) and there is no lock protecting the replicas.I will continue the investigation tomorrow.
IMO we shouldn’t add in the wait because it could potentially happen to users.
Sounds good to me.
The cause of this issue:
Healthyduring the eviction. And these checks cannot make the test wait for the new replica rebuilding complete at all before executing the next step.There are 2 issues here:
This issue is probably related to https://github.com/longhorn/longhorn/issues/4294 as well
Note: this regression from https://github.com/longhorn/longhorn-manager/commit/1bdd786158f1162f7a89d2c67a7c69694efd25e3. Set
auto-cleanup-system-generated-snapshotto false allows the test to pass.