longhorn: [BUG] kubectl drain node is blocked by unexpected orphan engine processes
Just can not drain a node. This happens almost at least half of the times I need to drain a node. Cannot evict pod as it would violate the pod’s disruption budget.
To Reproduce
Simple 4 node cluster, tried to drain one of the nodes. I could see in the interface that all volumes had been detached and reattached and the pods started running in other nodes, but the node never finishes draining.
longhorn-system pod/share-manager-pvc-64f91ac7-abcf-4ffa-8925-89de19818d3b 1/1 Running 0 16h 10.238.68.16 node2 <none> <none>
longhorn-system pod/share-manager-pvc-8325be46-d63e-4fc8-9fef-fa4f089d8490 1/1 Running 0 16h 10.238.73.14 node1 <none> <none>
longhorn-system pod/share-manager-pvc-93649f22-80e0-499d-bfeb-9aeb10c5ad55 1/1 Running 0 16h 10.238.67.11 node3 <none> <none>
wp pod/wp-mariadb-0 1/1 Running 0 10m 10.238.67.15 node3 <none> <none>
wp pod/wp-wordpress-cfcc7fd6f-nh8fl 1/1 Running 0 10m 10.238.68.20 node2 <none> <none>
wp pod/wp-wordpress-cfcc7fd6f-nws8m 1/1 Running 3 (9m53s ago) 16h 10.238.73.16 node1 <none> <none>
wp2 pod/wp-mariadb-0 1/1 Running 0 16h 10.238.67.12 node3 <none> <none>
wp2 pod/wp-wordpress-cfcc7fd6f-bdvzd 1/1 Running 1 (16h ago) 16h 10.238.73.17 node1 <none> <none>
wp2 pod/wp-wordpress-cfcc7fd6f-sv8fw 1/1 Running 0 10m 10.238.68.19 node2 <none> <none>
wp3 pod/wp-mariadb-0 1/1 Running 0 10m 10.238.67.14 node3 <none> <none>
wp3 pod/wp-wordpress-cfcc7fd6f-qw27r 1/1 Running 0 10m 10.238.68.18 node2 <none> <none>
wp3 pod/wp-wordpress-cfcc7fd6f-wppst 1/1 Running 3 (9m51s ago) 16h 10.238.73.15 node1 <none> <none>
wp persistentvolumeclaim/data-wp-mariadb-0 Bound pvc-efa3b148-a765-430f-a94e-a5df20757c0b 5Gi RWO longhorn 16h Filesystem
wp persistentvolumeclaim/wp-wordpress Bound pvc-93649f22-80e0-499d-bfeb-9aeb10c5ad55 2Gi RWX longhorn 16h Filesystem
wp2 persistentvolumeclaim/data-wp-mariadb-0 Bound pvc-d11ab7d8-6a5c-48b1-9260-4bb5b0043953 5Gi RWO longhorn 16h Filesystem
wp2 persistentvolumeclaim/wp-wordpress Bound pvc-64f91ac7-abcf-4ffa-8925-89de19818d3b 2Gi RWX longhorn 16h Filesystem
wp3 persistentvolumeclaim/data-wp-mariadb-0 Bound pvc-51f9a96d-410f-49cd-be2d-d25ee33eac27 5Gi RWO longhorn 16h Filesystem
wp3 persistentvolumeclaim/wp-wordpress Bound pvc-8325be46-d63e-4fc8-9fef-fa4f089d8490 2Gi RWX longhorn 16h Filesystem
I used this command to drain:
kubectl drain --force --ignore-daemonsets --delete-emptydir-data --grace-period=10 node4
Expected behavior
After all pods have been removed, the controller should remove the PDB that blocks draining of the manager allowing the drain to finish.
Support bundle for troubleshooting
Environment
- Longhorn version: 1.5.1
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubespray
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Onprem vanilla kubernetes installed by kubespray
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 4
Additional context
About this issue
- Original URL
- State: open
- Created 10 months ago
- Reactions: 2
- Comments: 29 (15 by maintainers)
I have opened https://github.com/longhorn/longhorn/issues/6978 to track the feature request for automatically detaching manually attached volumes during drain.
Let’s refocus this one on the original issue. Both @jsalatiel and @docbobo have clusters with multiple engine processes for a volume running on different nodes, and @docbobo has seen engine processes running for detached volumes. Another user has provided a support bundle with similar symptoms offline. We need to discover how Longhorn is “forgetting” to clean up old engine processes, as these processes prevent drains.
I like the icon idea, but i still think that drain should always have higher priority and detach even volumes attached by the UI.
Agree with @ejweber that the action to detach it via the UI. This behavior has been the same in older Longhorn versions: if user manually attach the volume using UI, the volume need to be detach from UI first before the workload on another node can attach the volume.
If user has never been using UI/Longhorn API to attach the volume, this is unexpected
@schmidp you can send it over to longhorn-support-bundle@Suse.com.
@ejweber On volumes attached via the UI, once detached via the UI, will it be auto reattached or will become a UI managed only volume forever? That does not look like a nice behavior. If i try to drain a node, i would expect it to be drained despite if it some volume was or not attached via the UI.
Thanks @schmidp. Do you have a support bundle? For this issue, it is best to capture one while the drain is failing, as after the problem is resolved, we will not see the state of the
attachmentTicketswhen detach was failing.Note that if you have attached a volume via the UI, kubectl will not be able to drain it. Volumes attached via the UI must be detached via the UI.
Sorry for the misleading, last time I just mounted the volume to the host directly so I couldn’t drain the node before because the volume could not be detached from the host automatically.
By mounting the volume to the pod, I can successfully drain the node because the pod will be evicted and rescheduled and the volume will be reattached to another node where pod recreated. Longhorn can then delete the pdb for the instance-manager to unblock the drain process.