longhorn: [BUG] Failed to drain nodes because IM pods are unable to evict due to timeout
Describe the bug
After adding/removing new nodes the cluster is stuck in a deadlock where the control plane cannot be updated, since the system tries to evict a pod that is not existant anymore in the cluster and is stuck since the task can’t be completed.
[controlPlane] Failed to upgrade Control Plane: [[error draining node s1: [error when evicting pods/"instance-manager-e-689a397e" -n "longhorn-system": global timeout reached: 2m0s, error when evicting pods/"instance-manager-r-68277463" -n "longhorn-system": global timeout reached: 2m0s]]]
–
To Reproduce
it randomly occured after removing two nodes and adding 2 nodes right after
Expected behavior
longhorn automatically detect if manager instances not present anymore and resolved the deadlock
Environment
- Longhorn version: 1.3.1
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): rancher catalog
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: 1.23.7
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
- Node config
- OS type and version: ubuntu
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): hetzner
- Number of Longhorn volumes in the cluster: ~60
Additional context
this is quite critical since we cannot add any further nodes to the cluster as long as it is in error state. Is there any way to remove the error state manually, since the pods that are to be evicted are not existing anymore anyways.
Thanks!
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 20 (11 by maintainers)
The logs
[controlPlane] Failed to upgrade Control Plane: [[error draining node s1: [error when evicting pods/"instance-manager-e-689a397e" -n "longhorn-system": global timeout reached: 2m0s, error when evicting pods/"instance-manager-r-68277463" -n "longhorn-system": global timeout reached: 2m0s]]]is complaining that it cannot delete 2 pods. This error doesn’t seems related to PDB because PDB blocking would have something like:cannot evict pod xxx because it violates PDB yOn the other hand, you are saying that you cannot see the pods using kubectl. This leads me to the theory about the possible inconsistency in etcd.
This cluster has 2 etcd nodes (
s1,l1) and 1 CP node (s1). This a dangerous situation because if you losel1, the etcd quorum will be permanently lost ( https://etcd.io/docs/v3.3/faq/#why-an-odd-number-of-cluster-members). And indeed, you are trying to deletel1.To verify this theory, can you increase the number of etcd node to 3?
Weird… I checked the support bundle. Based on the longhorn-manager logs, both the instance manager CRs and the pods are removed.
At least I didn’t find any leftovers related to these 2 instance managers in Longhorn.