longhorn: [BUG] longhorn rwx volume fails to mount on first pod
Describe the bug (🐛 if you encounter this issue)
Deployed a rwx pvc and pod. The pod fails to mount the pvc.
` Events: Type Reason Age From Message
Normal Scheduled 5m41s default-scheduler Successfully assigned default/rwx-node-a to 8c857dec-7cb8-45cb-8e20-a6c7569446c4 Normal SuccessfulAttachVolume 5m29s attachdetach-controller AttachVolume.Attach succeeded for volume “pvc-254b9a67-3a84-4309-b591-b83bac9c4900” Warning FailedMount 5m12s (x6 over 5m28s) kubelet MountVolume.MountDevice failed for volume “pvc-254b9a67-3a84-4309-b591-b83bac9c4900” : rpc error: code = Aborted desc = volume pvc-254b9a67-3a84-4309-b591-b83bac9c4900 share not yet available Warning FailedMount 84s (x2 over 3m38s) kubelet Unable to attach or mount volumes: unmounted volumes=[rwx1], unattached volumes=[rwx1 kube-api-access-bkm5r]: timed out waiting for the condition Warning FailedMount 78s (x4 over 4m56s) kubelet MountVolume.MountDevice failed for volume “pvc-254b9a67-3a84-4309-b591-b83bac9c4900” : rpc error: code = Internal desc = mount failed: exit status 32 Mounting command: /usr/local/sbin/nsmounter Mounting arguments: mount -t nfs -o vers=4.1,noresvport,intr,hard 10.43.186.142:/pvc-254b9a67-3a84-4309-b591-b83bac9c4900 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/3816fa16e001ad6c7497b189d35ec998cc5b5416c8d37ba291636e1f9b007e0e/globalmount Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/3816fa16e001ad6c7497b189d35ec998cc5b5416c8d37ba291636e1f9b007e0e/globalmount: mount point does not exist. dmesg(1) may have more information after failed mount system call. `
To Reproduce
`
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc-rwx spec: accessModes:
- ReadWriteMany resources: requests: storage: 1Gi storageClassName: longhorn
apiVersion: v1 kind: Pod metadata: name: rwx-node-a namespace: default spec: restartPolicy: Always containers:
- name: rwx-node-a
image: busybox
imagePullPolicy: IfNotPresent
command: [“/bin/sh”]
args:
- -c
- | while true; do echo “$(date) node-$(cat /sys/class/net/eth0/address)” >> /rwx-vol/shared-$(cat /sys/class/net/eth0/address).log; sleep $(($RANDOM % 5 + 5)); done volumeMounts:
- name: rwx1 mountPath: /rwx-vol volumes:
- name: rwx1 persistentVolumeClaim: claimName: pvc-rwx `
Expected behavior
Pod successfully mounting the rwx pvc.
Support bundle for troubleshooting
Environment
The os environment is alpine linux kernel 5.10.186, k3s installed in an oci container atop alpine, and then longhorn installed within that k3s container.
- Longhorn version: 1.4.x
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s version v1.26.3+k3s1 (01ea3ff2)
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 1
- Node config
- OS type and version:
- Kernel version: 5.10.186
- CPU per node:
- Memory per node: 16GB
- Disk type(e.g. SSD/NVMe/HDD): SATA SSD
- Network bandwidth between the nodes: n/a
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster: 1
- Impacted Longhorn resources:
- Volume names:
Additional context
If I manually create the mount’s path and run the same mount command the nfs export allows the pod to start writing to it.
eg. ` mkdir -p /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/3816fa16e001ad6c7497b189d35ec998cc5b5416c8d37ba291636e1f9b007e 0e/globalmount; mount -t nfs -o vers=4.1,noresvport,intr,hard 10.43.186.142:/pvc-254b9a67-3a84-4309-b591-b83bac9c4900 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longh orn.io/3816fa16e001ad6c7497b189d35ec998cc5b5416c8d37ba291636e1f9b007e0e/globalmount
tail -f /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/3816fa16e001ad6c7497b189d35ec998cc5b5416c8d37ba291636e1f9b007e0
e/globalmount/shared-18:66:da:0b:b4:e6.log Thu Oct 5 22:37:16 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:37:24 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:37:32 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:37:40 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:37:48 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:37:56 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:38:04 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:38:12 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:38:20 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:38:28 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:38:36 UTC 2023 node-ee:4e:68:2a:14:f9
Additionally if I create a second pod referencing this rwx volume, the second pod mounts it without issue since the mount point exists. `
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 67 (23 by maintainers)
For anyone coming across this running on Ubuntu 22.04 LTS (Jammy Jellyfish), we have had this occur after an upgrade to the 5.15.0-94 kernel.
After rolling back to 5.15.0-92 our RWX volumes mount fine again on the affected cluster.
Thanks for the tcpdump. It did help. I believe the bug is fixed by https://lore.kernel.org/all/20231009145901.99260-1-olga.kornievskaia@gmail.com/ which was posted about a week ago. It doesn’t seem to have been picked up by the nfs maintainers yet so it might be a few weeks before it lands in upstream-stable. I have submitted to the SUSE stable kernel which feeds into tumbleweed, so the next kernel released for tumbleweed should have it. Until then (which will have a 6.5.7 number (or later)) I suggest using the 6.5.5 kernel.
I think I face the same or somehow similar issue.
I’m using https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner to setup K8s on Hetzner based on MicroOS and a custom installed Longhorn (1.5.1 and 1.4.2), I’m facing this issue in multiple clusters since today.
I see this error message in the system journal:
Any help or suggestion is appreciated!
@james-munson Let’s make sure our doc has this notice. Besides doc, also helps improve our preflight check script to give users a warning about that.
Yep. I’ve got the same issue currently. After 195 days of no issues, suddenly it updated and stopped working… Fun.
Edit: rolled back to 5.15.0-91 and it resolved the issue. Everything is working again.
Seems to be the issue here as well. One of my nodes which does not have latest kernel does not have this problem. Thanks for the heads up
I would like to add another data point. I have rebooted my servers with kernel
6.5.5-200.fc38.x86_64and the NFS issue went away. So something between versions 6.5.5 and 6.5.6 introduced the bug.@janosmiko Thank you for the clarification.
The issue looks like a kernel issue and is checked by @neilbrown currently according to the bugzilla ticket and the experiments.
Hello @neilbrown please let me know if anything I can help in the client side. Thank you.
In addition, we will prepare a knowledge base doc for the workaround.
cc @innobead
For those facing this issue, here’s a (rollback)solution that worked for me:
ssh -i ~/.ssh/<your_private_key> root@<machine-ip>sudo snapper listsudo snapper rollback <previous_version>sudo rebootAfter some time, all the volumes should be attached again.
Thanks to @janosmiko for highlighting this issue.
I’ve tried restarting my cluster twice since and things are holding up fine 😊
@derekbit thanks for the super quick answer 🙌
I’ve tried upgrading to kernel v6.5.7 as per the article, as it was easier tu upgrade to a newer version, but it causes the same issue.
I’ll try downgrading so.
@derekbit its the same problem i reported on #8018 ! kernel version: 5.15.0.97 ubuntu 20.04
@andrewheberle, @dbeltman and @ZandercraftGames
The issue is same as https://longhorn.io/kb/troubleshooting-rwx-volume-fails-to-attached-caused-by-protocol-not-supported/. Please see https://github.com/longhorn/longhorn/issues/6887#issuecomment-1946197456 for more information.
cc @james-munson @yardenshoham
It may be https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2049689 So theoretically a kernel upgrade should fix it
I figured out my issue is due to the mount being completed in the wrong mount namespace (pid 1). I have a patch ready to resolve my issue which I’ll submit soon.
@derekbit ,
I just verified it again, and looks like downgrading the kernel solved the issue finally. 🎉 (Maybe the node where the nfs-server run was still not restarted.)
What I previously tested:
nfs-clientpackage and I was on a newer MicroOS then 20231010.@derekbit I tried to downgrade the kernel and restarted the nodes but the issue still persist.
See here: https://bugzilla.opensuse.org/show_bug.cgi?id=1216201
Edit: maybe it’s not in the kernel-default package, but other kernel related packages?
It was not loaded previously, but even after loading it, it fails:
This can be related: https://bugzilla.opensuse.org/show_bug.cgi?id=1214540
I have the same problem since yesterday, using MicroOS. Also fixed it with a snapper rollback. The problematic version is: