longhorn: [BUG] Failed to drain nodes because IM pods are unable to evict due to timeout

Describe the bug

After adding/removing new nodes the cluster is stuck in a deadlock where the control plane cannot be updated, since the system tries to evict a pod that is not existant anymore in the cluster and is stuck since the task can’t be completed.

[controlPlane] Failed to upgrade Control Plane: [[error draining node s1: [error when evicting pods/"instance-manager-e-689a397e" -n "longhorn-system": global timeout reached: 2m0s, error when evicting pods/"instance-manager-r-68277463" -n "longhorn-system": global timeout reached: 2m0s]]]

To Reproduce

it randomly occured after removing two nodes and adding 2 nodes right after

Expected behavior

longhorn automatically detect if manager instances not present anymore and resolved the deadlock

Environment

  • Longhorn version: 1.3.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): rancher catalog
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: 1.23.7
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: ubuntu
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): hetzner
  • Number of Longhorn volumes in the cluster: ~60

Additional context

this is quite critical since we cannot add any further nodes to the cluster as long as it is in error state. Is there any way to remove the error state manually, since the pods that are to be evicted are not existing anymore anyways.

Thanks!

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 20 (11 by maintainers)

Most upvoted comments

The logs [controlPlane] Failed to upgrade Control Plane: [[error draining node s1: [error when evicting pods/"instance-manager-e-689a397e" -n "longhorn-system": global timeout reached: 2m0s, error when evicting pods/"instance-manager-r-68277463" -n "longhorn-system": global timeout reached: 2m0s]]] is complaining that it cannot delete 2 pods. This error doesn’t seems related to PDB because PDB blocking would have something like: cannot evict pod xxx because it violates PDB y

On the other hand, you are saying that you cannot see the pods using kubectl. This leads me to the theory about the possible inconsistency in etcd.

This cluster has 2 etcd nodes (s1, l1) and 1 CP node (s1). This a dangerous situation because if you lose l1, the etcd quorum will be permanently lost ( https://etcd.io/docs/v3.3/faq/#why-an-odd-number-of-cluster-members). And indeed, you are trying to delete l1.

To verify this theory, can you increase the number of etcd node to 3?

Weird… I checked the support bundle. Based on the longhorn-manager logs, both the instance manager CRs and the pods are removed.

2022-08-20T19:03:55.198830249Z time="2022-08-20T19:03:55Z" level=info msg="Longhorn instance manager instance-manager-e-689a397e has been deleted, will try best to do cleanup"
2022-08-20T19:03:55.225826131Z time="2022-08-20T19:03:55Z" level=warning msg="Deleted instance manager pod instance-manager-e-689a397e for instance manager instance-manager-e-689a397e"
2022-08-20T19:03:55.233928010Z time="2022-08-20T19:03:55Z" level=warning msg="Can't find instance manager for pod instance-manager-e-689a397e, may be deleted"
2022-08-20T19:07:02.390705626Z time="2022-08-20T19:07:02Z" level=info msg="Longhorn instance manager instance-manager-r-68277463 has been deleted, will try best to do cleanup"
2022-08-20T19:07:02.426218720Z time="2022-08-20T19:07:02Z" level=warning msg="Can't find instance manager for pod instance-manager-r-68277463, may be deleted"
2022-08-20T19:07:02.441253260Z time="2022-08-20T19:07:02Z" level=warning msg="Deleted instance manager pod instance-manager-r-68277463 for instance manager instance-manager-r-68277463"

At least I didn’t find any leftovers related to these 2 instance managers in Longhorn.