longhorn: [BUG] kubectl drain node gets stuck forever

Describe the bug In case we want to drain a node (RKE2 1.20.7 rke2r2 / longhorn 1.1.100) the drain gets stuck forever in

evicting pod longhorn-system/instance-manager-r-b4be9e85
error when evicting pods/"instance-manager-r-b4be9e85" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

To Reproduce Steps to reproduce the behavior: deploy rke2 (3 master, >=4 worker) deploy longhorn deploy rancher-monitoring which creates two PVCs kubectl drain on one worker that has a replica of the grafana or prometheus PV

Expected behavior The drain should complete

Log

evicting pod longhorn-system/instance-manager-r-b4be9e85
error when evicting pods/"instance-manager-r-b4be9e85" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

You can also attach a Support Bundle here. You can generate a Support Bundle using the link at the footer of the Longhorn UI. –> will attach this in a few minutes.

Environment:

Longhorn version: 1.1.100
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): rancher catalog app
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2 1.20.7 rke2r2
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 14
Node config
- OS type and version: SLES 15 SP2
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 4 (2 active)

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 20 (11 by maintainers)

Commits related to this issue

Wrong flag scope. 'isUnusedReplicaOnCurrentNode' is always false at outer for loop. longhorn/longhorn#2673 — committed to mantissahz/longhorn-manager by mantissahz 2 years ago
Wrong flag scope. 'isUnusedReplicaOnCurrentNode' is always false at outer for loop. longhorn/longhorn#2673 Signed-off-by: James Lu <james.lu@suse.com> — committed to mantissahz/longhorn-manager by mantissahz 2 years ago
Wrong flag scope. 'isUnusedReplicaOnCurrentNode' is always false at outer for loop. longhorn/longhorn#2673 Signed-off-by: James Lu <james.lu@suse.com> — committed to longhorn/longhorn-manager by mantissahz 2 years ago

Most upvoted comments

Thanks @Martin-Weiss for reporting

I think this a Longhorn bug:

When the volume is created the first time (the volume could be created via UI or via PVC yaml manifest) and has never been attached to a node, Longhorn doesn’t remove the PDB for the instance-manager-r-xxx pod that contains the volume’s replicas when user run kubectl drains. This blocks kubectl drain command

The reason Longhorn doesn’t remove the PDB is because Longhorn is trying to find a healthy replica on a different node by checking r.Spec.HealthyAt != “”. This check always fails on the volume that has never been attached to a node since the r.Spec.HealthyAt has never been set.

From your provided support bundle I can see that you are having 2 volumes that have never been attached.

Workaround:

Find the volumes that have never been attached to a node and try attach them then detach them. This will set the r.Spec.HealthyAt for the volumes’ replicas so Longhorn will remove the PDB for the instance-manager-r-xxx pod that contains the volume’s replicas when user run kubectl drains. To find those volumes, you can run kubectl get replicas -n longhorn-system -o yaml and find the replicas that have failedAt == "" and healthyAt == "", and get the volume name from the replica.metadata.ownerReferences[0].name

PhanLe1010 on Jun 16, 2021

@mantissahz please don’t remove the previous flag. The issue is a regression in 1.3.0-rc instead only. Good catch! Also, you should create another issue to track instead of reopening an already closed released issue.

innobead on Jun 10, 2022

@Martin-Weiss Currently, I was able to identify several scenarios that would cause PDB. I saw from log that you have create longhorn-test-pvc-rwx. Is it running at time of draining? For monitoring storage, which storage accessModes are you using?

Known Scenarios

Scenario 1: Storage class has `numberOfReplicas` of `1`

If volume storageclass numberOfReplicas is 1, need to increase the number to 2. Otherwise, will encounter PDB errors during draining and upgrade will time out.

Scenario 2: PVC/PV/LHV is created through Longhorn UI, but has not yet attached and replicated

After volume is attached, replicate, and detached, the nodes with volume replicas can be drained successfully.
This issue does not seem to be an issue if volume create through pvc using manifest.

Scenario 3: PVC/PV/LHV is created through Longhorn UI and attached to a host node

Need to detach then volume can be drain successfully.

Scenario 4: RWX volume attached to a node

Need to scale down the workload and drain.

Scenario 5: RWO volume with last healthy replica

set allow-node-drain-with-last-healthy-replica to true and able to drain.

cclhsu on Jun 15, 2021