longhorn: [BUG] Cannot evict pod as it would violate the pod's disruption budget.

Describe the bug (🐛 if you encounter this issue)

After upgrading RKE2 control nodes from v1.22.8+rke2r1 to v1.25.9+rke2r1 agent nodes cannot be drained and get stuck with Cannot evict pod as it would violate the pod’s disruption budget.

To Reproduce

On RKE2 control node update to latest release and restart rke2 curl -sfL https://get.rke2.io | sh - systemctl restart rke2-server

Perform steps on all controller nodes. Once upgrade is complete try draining agent node kubectl cordon agent-node kubectl drain agent-node --ignore-daemonsets --delete-emptydir-data --pod-selector=‘app!=csi-attacher,app!=csi-provisioner,app!=longhorn-admission-webhook,app!=longhorn-conversion-webhook,app!=longhorn-driver-deployer’ --grace-period=10

Expected behavior

Agent node is drained.

Log or Support bundle

If applicable, add the Longhorn managers’ log or support bundle when the issue happens. You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

  • Longhorn version: 1.26 (have also tried upgrading Longhorn to 1.33)
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.22.8+rke2r1
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 4
  • Node config
    • OS type and version: Ubuntu 20.04
    • CPU per node: 4
    • Memory per node: 8GB
    • Disk type(e.g. SSD/NVMe): NVMe
    • Network bandwidth between the nodes: 10GB
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): VMWare
  • Number of Longhorn volumes in the cluster: 1

Additional context

RKE2 upgrade is being done from 1.22 to 1.25 not in incremental steps (don’t know if that has any impact or not) From what I can tell the PDBs are not being deleted when the agent node is cordoned.

Pods and PDBs before cordon

csi-attacher-5f46994f7-9629c                             1/1     Running   0          20h   10.42.3.10  
csi-resizer-6dd8bd4c97-srr7d                             1/1     Running   0          20h   10.42.3.11
csi-snapshotter-86f65d8bc-p7p6q                          1/1     Running   0          20h   10.42.3.12
engine-image-ei-9bf563e8-wjm88                           1/1     Running   0          20h   10.42.3.7
instance-manager-e-6defff48                              1/1     Running   0          20h   10.42.3.8
instance-manager-r-d2f0328a                              1/1     Running   0          20h   10.42.3.9
longhorn-csi-plugin-rvhgz                                2/2     Running   0          20h   10.42.3.13
longhorn-manager-bx4x6                                   1/1     Running   0          20h   10.42.3.5
longhorn-ui-57c49478dc-s4vtl                             1/1     Running   0          20h   10.42.3.6
csi-attacher-5f46994f7-9629c                             1/1     Running   0          20h   10.42.3.10  
csi-resizer-6dd8bd4c97-srr7d                             1/1     Running   0          20h   10.42.3.11
csi-snapshotter-86f65d8bc-p7p6q                          1/1     Running   0          20h   10.42.3.12
engine-image-ei-9bf563e8-wjm88                           1/1     Running   0          20h   10.42.3.7
instance-manager-e-6defff48                              1/1     Running   0          20h   10.42.3.8
instance-manager-r-d2f0328a                              1/1     Running   0          20h   10.42.3.9
longhorn-csi-plugin-rvhgz                                2/2     Running   0          20h   10.42.3.13
longhorn-manager-bx4x6                                   1/1     Running   0          20h   10.42.3.5
longhorn-ui-57c49478dc-s4vtl                             1/1     Running   0          20h   10.42.3.6

after cordon

kubectl cordon agent-node
node/agent-node cordoned

kubectl -n longhorn-system get pdb
NAME                          MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
instance-manager-e-17712f99   1               N/A               0                     20h
instance-manager-e-6defff48   1               N/A               0                     20h
instance-manager-e-9c26f77c   1               N/A               0                     20h
instance-manager-e-d82e16a5   1               N/A               0                     20h
instance-manager-r-23246e55   1               N/A               0                     20h
instance-manager-r-37ca8394   1               N/A               0                     20h
instance-manager-r-d2f0328a   1               N/A               0                     20h
instance-manager-r-e3eda6dd   1               N/A               0                     20h

longhorn-support-bundle_6c16098f-a97d-41e6-8aa8-1329a6535eb8_2023-05-10T16-57-07Z.zip

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 5
  • Comments: 24 (6 by maintainers)

Commits related to this issue

Most upvoted comments

This still happens on v1.4.2

I have the same problem:

chart: longhorn
version: 1.5.1

(the longhorn-manager has such a low age because I tried killing it)

Screenshot 2023-09-05 at 22 28 39 Screenshot 2023-09-05 at 22 28 48

to fix it I manually modified the pod disruption budgets.

I noticed, that the nodes where the upgrade worked, the pod disruption budget was recently created (age was about 30 minutes old -> from when the node was updated). for the nodes where it got stuck the PDB was 30 days old (probably when I installed longhorn).

not sure if this is the reason but: I use strictly-local volumes on 3 out of 6 nodes. 1 node with a strictly-local volume upgraded without this problem, 2 experienced the problem.

I sadly did don’t take a screenshot of the PDBs, but I delete them for the 2 failed nodes. longhorn recreated them. I am now waiting for k3s 1.27.6 to see how it will behave on the next update.

Facing the same issue currently longhorn v1.4.3 - What can we do to help you debug the issue?

This started happening to me last week when i upgraded from longhorn 1.5.1 -> 1.5.4. The RKE clusters are running 1.26.8.

Draining procedure gets stuck at: error when evicting pods/"instance-manager-992d684c4a70f8d17d28be3fe762bf61" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

This is very frustrating as my drain and patch Ansible playbook cant be used anymore.

When it happens next time, could you send us a support bundle to longhorn-support-bundle@Suse.com taken while the issue is happening? Thank you

It is unfortunately not consistent. It happens sometimes as I drain the nodes for patches. The node just started reporting that it couldn’t evict the pod for hours until I saw it. There are no disks with less than 3 replicas, and all nodes have enough space for the replicas if two nodes fail.