longhorn: [BUG] 1.5.1: AttachVolume.Attach failed for volume "pvc-xxx » […] volume pvc-xxx failed to attach to node node0x with attachmentID csi-xxx

Describe the bug (🐛 if you encounter this issue)

Sometimes, after cordon / drain from a cluster node, some pods remain in the pending state.

The events generally display :

LAST SEEN   TYPE      REASON               OBJECT                           MESSAGE
75s         Warning   FailedAttachVolume   pod/keycloak-test-postgresql-0   AttachVolume.Attach failed for volume "pvc-b91abf30-485d-4aeb-96e1-938a50bdb291" : rpc error: code = Internal desc = volume pvc-b91abf30-485d-4aeb-96e1-938a50bdb291 failed to attach to node node03 with attachmentID csi-a782ae047fc39dbf4c999f20196ca3ce16d7d9bf7cb5382a57839f1c2b9f6e8a: the volume is currently attached to different node node04
10m         Warning   FailedMount          pod/keycloak-test-postgresql-0   Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[dshm data kube-api-access-cjdtl]: timed out waiting for the condition

I have no way of getting this pod to work again, and the volume seems healthy on the UI.

Environment

Longhorn version: 1.5.1
Installation method: Helm
Kubernetes distro (Official Kubernetes with Kubespray and version: 1.24
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 4
Node config
- OS type and version: Debian 10
- Kernel version: 4.19.0-25-amd64
- CPU per node: 96
- Memory per node: 256
- Disk type(e.g. SSD/NVMe/HDD): SSD

My support bundle is ~600 mb, I’ll prepare the upload and send it to the support.

About this issue

Original URL
State: open
Created 7 months ago
Reactions: 2
Comments: 20 (10 by maintainers)

Most upvoted comments

The same goes for me, after a reboot campaign, a lot of pods got stuck. I took the opportunity to regenerate a bundle if that helps.

CaptainKrby on Jan 28, 2024

Hello @PhanLe1010 best wishes for 2024,

We have found that a helm kafka deployment bitnami/kafka:3.4.0-debian-11-r28 often encountered the problem.

Our conf is very very basic, in persistence we declare :

persistence: enabled: true size: 2Gi storageClass: longhorn

Then it ends up working for a while, and one day after a node reboot or a helm upgrade you come back to the initial problem. Hard to reproduce from the given workaround —> https://github.com/longhorn/longhorn/issues/7313#issuecomment-1863606967

If you need any further details don’t hesistate.

CaptainKrby on Jan 3, 2024

I also encountered a curious situation, on attachmentTicket pvc-a574efdb-67fc-49f5-b592-9f595d1e3b9a I was forced to delete a null ticket so that the pod could reschedule:

spec:
  attachmentTickets:
    "":
      generation: 0
      id: ""
      nodeID: node02
      parameters:
        disableFrontend: "false"
        lastAttachedBy: ""
      type: longhorn-api

CaptainKrby on Dec 22, 2023

Hi @PhanLe1010, We’ve been dealing with this type of problem since July, i.e. since 1.5.0, then when I encountered the current problem I immediately upgraded to 1.5.1 but there’s been no improvement. Each upgrade of Kubernetes is complicated because a large proportion of our volumes remain pending because the pods aren’t scheduled on the right node. Before upgrading to this major release we were on 1.4.3. We’ve been updating Longhorn since 1.2.x or a little earlier.

CaptainKrby on Dec 21, 2023