longhorn: [BUG] 1.5.1: AttachVolume.Attach failed for volume "pvc-xxx » […] volume pvc-xxx failed to attach to node node0x with attachmentID csi-xxx
Describe the bug (🐛 if you encounter this issue)
Sometimes, after cordon / drain from a cluster node, some pods remain in the pending state.
The events generally display :
LAST SEEN TYPE REASON OBJECT MESSAGE
75s Warning FailedAttachVolume pod/keycloak-test-postgresql-0 AttachVolume.Attach failed for volume "pvc-b91abf30-485d-4aeb-96e1-938a50bdb291" : rpc error: code = Internal desc = volume pvc-b91abf30-485d-4aeb-96e1-938a50bdb291 failed to attach to node node03 with attachmentID csi-a782ae047fc39dbf4c999f20196ca3ce16d7d9bf7cb5382a57839f1c2b9f6e8a: the volume is currently attached to different node node04
10m Warning FailedMount pod/keycloak-test-postgresql-0 Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[dshm data kube-api-access-cjdtl]: timed out waiting for the condition
I have no way of getting this pod to work again, and the volume seems healthy on the UI.
Environment
- Longhorn version: 1.5.1
- Installation method: Helm
- Kubernetes distro (Official Kubernetes with Kubespray and version: 1.24
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 4
- Node config
- OS type and version: Debian 10
- Kernel version: 4.19.0-25-amd64
- CPU per node: 96
- Memory per node: 256
- Disk type(e.g. SSD/NVMe/HDD): SSD
My support bundle is ~600 mb, I’ll prepare the upload and send it to the support.
About this issue
- Original URL
- State: open
- Created 7 months ago
- Reactions: 2
- Comments: 20 (10 by maintainers)
The same goes for me, after a reboot campaign, a lot of pods got stuck. I took the opportunity to regenerate a bundle if that helps.
Hello @PhanLe1010 best wishes for 2024,
We have found that a helm kafka deployment bitnami/kafka:3.4.0-debian-11-r28 often encountered the problem.
Our conf is very very basic, in persistence we declare :
persistence: enabled: true size: 2Gi storageClass: longhorn
Then it ends up working for a while, and one day after a node reboot or a helm upgrade you come back to the initial problem. Hard to reproduce from the given workaround —> https://github.com/longhorn/longhorn/issues/7313#issuecomment-1863606967
If you need any further details don’t hesistate.
I also encountered a curious situation, on attachmentTicket
pvc-a574efdb-67fc-49f5-b592-9f595d1e3b9aI was forced to delete a null ticket so that the pod could reschedule:Hi @PhanLe1010, We’ve been dealing with this type of problem since July, i.e. since 1.5.0, then when I encountered the current problem I immediately upgraded to 1.5.1 but there’s been no improvement. Each upgrade of Kubernetes is complicated because a large proportion of our volumes remain pending because the pods aren’t scheduled on the right node. Before upgrading to this major release we were on 1.4.3. We’ve been updating Longhorn since 1.2.x or a little earlier.