longhorn: [BUG] Cannot evict pod as it would violate the pod's disruption budget.
Describe the bug (đ if you encounter this issue)
After upgrading RKE2 control nodes from v1.22.8+rke2r1 to v1.25.9+rke2r1 agent nodes cannot be drained and get stuck with Cannot evict pod as it would violate the podâs disruption budget.
To Reproduce
On RKE2 control node update to latest release and restart rke2 curl -sfL https://get.rke2.io | sh - systemctl restart rke2-server
Perform steps on all controller nodes. Once upgrade is complete try draining agent node kubectl cordon agent-node kubectl drain agent-node --ignore-daemonsets --delete-emptydir-data --pod-selector=âapp!=csi-attacher,app!=csi-provisioner,app!=longhorn-admission-webhook,app!=longhorn-conversion-webhook,app!=longhorn-driver-deployerâ --grace-period=10
Expected behavior
Agent node is drained.
Log or Support bundle
If applicable, add the Longhorn managersâ log or support bundle when the issue happens. You can generate a Support Bundle using the link at the footer of the Longhorn UI.
Environment
- Longhorn version: 1.26 (have also tried upgrading Longhorn to 1.33)
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.22.8+rke2r1
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 4
- Node config
- OS type and version: Ubuntu 20.04
- CPU per node: 4
- Memory per node: 8GB
- Disk type(e.g. SSD/NVMe): NVMe
- Network bandwidth between the nodes: 10GB
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): VMWare
- Number of Longhorn volumes in the cluster: 1
Additional context
RKE2 upgrade is being done from 1.22 to 1.25 not in incremental steps (donât know if that has any impact or not) From what I can tell the PDBs are not being deleted when the agent node is cordoned.
Pods and PDBs before cordon
csi-attacher-5f46994f7-9629c 1/1 Running 0 20h 10.42.3.10
csi-resizer-6dd8bd4c97-srr7d 1/1 Running 0 20h 10.42.3.11
csi-snapshotter-86f65d8bc-p7p6q 1/1 Running 0 20h 10.42.3.12
engine-image-ei-9bf563e8-wjm88 1/1 Running 0 20h 10.42.3.7
instance-manager-e-6defff48 1/1 Running 0 20h 10.42.3.8
instance-manager-r-d2f0328a 1/1 Running 0 20h 10.42.3.9
longhorn-csi-plugin-rvhgz 2/2 Running 0 20h 10.42.3.13
longhorn-manager-bx4x6 1/1 Running 0 20h 10.42.3.5
longhorn-ui-57c49478dc-s4vtl 1/1 Running 0 20h 10.42.3.6
csi-attacher-5f46994f7-9629c 1/1 Running 0 20h 10.42.3.10
csi-resizer-6dd8bd4c97-srr7d 1/1 Running 0 20h 10.42.3.11
csi-snapshotter-86f65d8bc-p7p6q 1/1 Running 0 20h 10.42.3.12
engine-image-ei-9bf563e8-wjm88 1/1 Running 0 20h 10.42.3.7
instance-manager-e-6defff48 1/1 Running 0 20h 10.42.3.8
instance-manager-r-d2f0328a 1/1 Running 0 20h 10.42.3.9
longhorn-csi-plugin-rvhgz 2/2 Running 0 20h 10.42.3.13
longhorn-manager-bx4x6 1/1 Running 0 20h 10.42.3.5
longhorn-ui-57c49478dc-s4vtl 1/1 Running 0 20h 10.42.3.6
after cordon
kubectl cordon agent-node
node/agent-node cordoned
kubectl -n longhorn-system get pdb
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
instance-manager-e-17712f99 1 N/A 0 20h
instance-manager-e-6defff48 1 N/A 0 20h
instance-manager-e-9c26f77c 1 N/A 0 20h
instance-manager-e-d82e16a5 1 N/A 0 20h
instance-manager-r-23246e55 1 N/A 0 20h
instance-manager-r-37ca8394 1 N/A 0 20h
instance-manager-r-d2f0328a 1 N/A 0 20h
instance-manager-r-e3eda6dd 1 N/A 0 20h
longhorn-support-bundle_6c16098f-a97d-41e6-8aa8-1329a6535eb8_2023-05-10T16-57-07Z.zip
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 5
- Comments: 24 (6 by maintainers)
This still happens on v1.4.2
I have the same problem:
(the longhorn-manager has such a low age because I tried killing it)
to fix it I manually modified the pod disruption budgets.
I noticed, that the nodes where the upgrade worked, the pod disruption budget was recently created (age was about 30 minutes old -> from when the node was updated). for the nodes where it got stuck the PDB was 30 days old (probably when I installed longhorn).
not sure if this is the reason but: I use strictly-local volumes on 3 out of 6 nodes. 1 node with a strictly-local volume upgraded without this problem, 2 experienced the problem.
I sadly did donât take a screenshot of the PDBs, but I delete them for the 2 failed nodes. longhorn recreated them. I am now waiting for k3s 1.27.6 to see how it will behave on the next update.
Facing the same issue currently longhorn v1.4.3 - What can we do to help you debug the issue?
This started happening to me last week when i upgraded from longhorn 1.5.1 -> 1.5.4. The RKE clusters are running 1.26.8.
Draining procedure gets stuck at:
error when evicting pods/"instance-manager-992d684c4a70f8d17d28be3fe762bf61" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.This is very frustrating as my drain and patch Ansible playbook cant be used anymore.
When it happens next time, could you send us a support bundle to
longhorn-support-bundle@Suse.comtaken while the issue is happening? Thank youIt is unfortunately not consistent. It happens sometimes as I drain the nodes for patches. The node just started reporting that it couldnât evict the pod for hours until I saw it. There are no disks with less than 3 replicas, and all nodes have enough space for the replicas if two nodes fail.