longhorn: [BUG] longhorn rwx volume fails to mount on first pod

Describe the bug (🐛 if you encounter this issue)

Deployed a rwx pvc and pod. The pod fails to mount the pvc.

` Events: Type Reason Age From Message

Normal Scheduled 5m41s default-scheduler Successfully assigned default/rwx-node-a to 8c857dec-7cb8-45cb-8e20-a6c7569446c4 Normal SuccessfulAttachVolume 5m29s attachdetach-controller AttachVolume.Attach succeeded for volume “pvc-254b9a67-3a84-4309-b591-b83bac9c4900” Warning FailedMount 5m12s (x6 over 5m28s) kubelet MountVolume.MountDevice failed for volume “pvc-254b9a67-3a84-4309-b591-b83bac9c4900” : rpc error: code = Aborted desc = volume pvc-254b9a67-3a84-4309-b591-b83bac9c4900 share not yet available Warning FailedMount 84s (x2 over 3m38s) kubelet Unable to attach or mount volumes: unmounted volumes=[rwx1], unattached volumes=[rwx1 kube-api-access-bkm5r]: timed out waiting for the condition Warning FailedMount 78s (x4 over 4m56s) kubelet MountVolume.MountDevice failed for volume “pvc-254b9a67-3a84-4309-b591-b83bac9c4900” : rpc error: code = Internal desc = mount failed: exit status 32 Mounting command: /usr/local/sbin/nsmounter Mounting arguments: mount -t nfs -o vers=4.1,noresvport,intr,hard 10.43.186.142:/pvc-254b9a67-3a84-4309-b591-b83bac9c4900 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/3816fa16e001ad6c7497b189d35ec998cc5b5416c8d37ba291636e1f9b007e0e/globalmount Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/3816fa16e001ad6c7497b189d35ec998cc5b5416c8d37ba291636e1f9b007e0e/globalmount: mount point does not exist. dmesg(1) may have more information after failed mount system call. `

To Reproduce

`

apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc-rwx spec: accessModes:

ReadWriteMany resources: requests: storage: 1Gi storageClassName: longhorn

apiVersion: v1 kind: Pod metadata: name: rwx-node-a namespace: default spec: restartPolicy: Always containers:

name: rwx-node-a image: busybox imagePullPolicy: IfNotPresent command: [“/bin/sh”] args:
- -c
- | while true; do echo “$(date) node-$(cat /sys/class/net/eth0/address)” >> /rwx-vol/shared-$(cat /sys/class/net/eth0/address).log; sleep $(($RANDOM % 5 + 5)); done volumeMounts:
- name: rwx1 mountPath: /rwx-vol volumes:
name: rwx1 persistentVolumeClaim: claimName: pvc-rwx `

Expected behavior

Pod successfully mounting the rwx pvc.

Support bundle for troubleshooting

Environment

The os environment is alpine linux kernel 5.10.186, k3s installed in an oci container atop alpine, and then longhorn installed within that k3s container.

Longhorn version: 1.4.x
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s version v1.26.3+k3s1 (01ea3ff2)
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 1
Node config
- OS type and version:
- Kernel version: 5.10.186
- CPU per node:
- Memory per node: 16GB
- Disk type(e.g. SSD/NVMe/HDD): SATA SSD
- Network bandwidth between the nodes: n/a
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster: 1
Impacted Longhorn resources:
- Volume names:

Additional context

If I manually create the mount’s path and run the same mount command the nfs export allows the pod to start writing to it.

eg. ` mkdir -p /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/3816fa16e001ad6c7497b189d35ec998cc5b5416c8d37ba291636e1f9b007e 0e/globalmount; mount -t nfs -o vers=4.1,noresvport,intr,hard 10.43.186.142:/pvc-254b9a67-3a84-4309-b591-b83bac9c4900 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longh orn.io/3816fa16e001ad6c7497b189d35ec998cc5b5416c8d37ba291636e1f9b007e0e/globalmount

tail -f /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/3816fa16e001ad6c7497b189d35ec998cc5b5416c8d37ba291636e1f9b007e0

e/globalmount/shared-18:66:da:0b:b4:e6.log Thu Oct 5 22:37:16 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:37:24 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:37:32 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:37:40 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:37:48 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:37:56 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:38:04 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:38:12 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:38:20 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:38:28 UTC 2023 node-ee:4e:68:2a:14:f9 Thu Oct 5 22:38:36 UTC 2023 node-ee:4e:68:2a:14:f9

Additionally if I create a second pod referencing this rwx volume, the second pod mounts it without issue since the mount point exists. `

About this issue

Original URL
State: closed
Created 9 months ago
Comments: 67 (23 by maintainers)

Most upvoted comments

For anyone coming across this running on Ubuntu 22.04 LTS (Jammy Jellyfish), we have had this occur after an upgrade to the 5.15.0-94 kernel.

After rolling back to 5.15.0-92 our RWX volumes mount fine again on the affected cluster.

+19

andrewheberle on Feb 12, 2024

Thanks for the tcpdump. It did help. I believe the bug is fixed by https://lore.kernel.org/all/20231009145901.99260-1-olga.kornievskaia@gmail.com/ which was posted about a week ago. It doesn’t seem to have been picked up by the nfs maintainers yet so it might be a few weeks before it lands in upstream-stable. I have submitted to the SUSE stable kernel which feeds into tumbleweed, so the next kernel released for tumbleweed should have it. Until then (which will have a 6.5.7 number (or later)) I suggest using the 6.5.5 kernel.

neilbrown on Oct 16, 2023

I think I face the same or somehow similar issue.

I’m using https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner to setup K8s on Hetzner based on MicroOS and a custom installed Longhorn (1.5.1 and 1.4.2), I’m facing this issue in multiple clusters since today.

I see this error message in the system journal:

Oct 11 07:42:23 dev-worker-1 k3s[1294]: Mounting command: /usr/local/sbin/nsmounter
Oct 11 07:42:23 dev-worker-1 k3s[1294]: Mounting arguments: mount -t nfs -o vers=4.1,noresvport,intr,hard 10.43.207.185:/pvc-13538170-4278-4467-b2b0-1f1ba6f54a4c /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/185c34f566c2eca6e8c7c6a2ede2094c076d7d25ddae286dc633eeef80551af0/globalmount
Oct 11 07:42:23 dev-worker-1-autoscaled-small-19baf778f50efd8c k3s[1294]: Output: mount.nfs: Protocol not supported for 10.43.207.185:/pvc-13538170-4278-4467-b2b0-1f1ba6f54a4c on /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/185c34f566c2eca6e8c7c6a2ede2094c076d7d25ddae286dc633eeef80551af0/globalmount

Any help or suggestion is appreciated!

janosmiko on Oct 11, 2023

@james-munson Let’s make sure our doc has this notice. Besides doc, also helps improve our preflight check script to give users a warning about that.

innobead on Feb 15, 2024

For anyone coming across this running on Ubuntu 22.04 LTS (Jammy Jellyfish), we have had this occur after an upgrade to the 5.15.0-94 kernel.

After rolling back to 5.15.0-92 our RWX volumes mount fine again on the affected cluster.

Yep. I’ve got the same issue currently. After 195 days of no issues, suddenly it updated and stopped working… Fun.

Edit: rolled back to 5.15.0-91 and it resolved the issue. Everything is working again.

ZandercraftGames on Feb 15, 2024

For anyone coming across this running on Ubuntu 22.04 LTS (Jammy Jellyfish), we have had this occur after an upgrade to the 5.15.0-94 kernel.

After rolling back to 5.15.0-92 our RWX volumes mount fine again on the affected cluster.

Seems to be the issue here as well. One of my nodes which does not have latest kernel does not have this problem. Thanks for the heads up

dbeltman on Feb 12, 2024

I would like to add another data point. I have rebooted my servers with kernel 6.5.5-200.fc38.x86_64 and the NFS issue went away. So something between versions 6.5.5 and 6.5.6 introduced the bug.

rome-user on Oct 16, 2023

@janosmiko Thank you for the clarification.

The issue looks like a kernel issue and is checked by @neilbrown currently according to the bugzilla ticket and the experiments.

Hello @neilbrown please let me know if anything I can help in the client side. Thank you.

In addition, we will prepare a knowledge base doc for the workaround.

cc @innobead

derekbit on Oct 16, 2023

For those facing this issue, here’s a (rollback)solution that worked for me:

Connect to the affected machine via SSH: ssh -i ~/.ssh/<your_private_key> root@<machine-ip>
Once connected, list the snapshots to identify the previous version: sudo snapper list
Roll back to the desired previous version: sudo snapper rollback <previous_version>
Reboot the machine sudo reboot

After some time, all the volumes should be attached again.

Thanks to @janosmiko for highlighting this issue.

mahmuzicamel on Oct 12, 2023

@vincentkelleher even if you restart the node again, won’t cause a problem ?

I’ve tried restarting my cluster twice since and things are holding up fine 😊

vincentkelleher on Feb 29, 2024

@derekbit thanks for the super quick answer 🙌

I’ve tried upgrading to kernel v6.5.7 as per the article, as it was easier tu upgrade to a newer version, but it causes the same issue.

I’ll try downgrading so.

vincentkelleher on Feb 28, 2024

@derekbit its the same problem i reported on #8018 ! kernel version: 5.15.0.97 ubuntu 20.04

Siradjedd on Feb 28, 2024

@andrewheberle, @dbeltman and @ZandercraftGames

The issue is same as https://longhorn.io/kb/troubleshooting-rwx-volume-fails-to-attached-caused-by-protocol-not-supported/. Please see https://github.com/longhorn/longhorn/issues/6887#issuecomment-1946197456 for more information.

cc @james-munson @yardenshoham

derekbit on Feb 15, 2024

It may be https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2049689 So theoretically a kernel upgrade should fix it

yardenshoham on Feb 15, 2024

I figured out my issue is due to the mount being completed in the wrong mount namespace (pid 1). I have a patch ready to resolve my issue which I’ll submit soon.

andrewd-zededa on Feb 1, 2024

@derekbit ,

I just verified it again, and looks like downgrading the kernel solved the issue finally. 🎉 (Maybe the node where the nfs-server run was still not restarted.)

What I previously tested:

it worked for me when I downgraded MicroOS to a snapshot before 20231010. Eg 20231008.
it did not work when I just downgraded the nfs-client package and I was on a newer MicroOS then 20231010.

janosmiko on Oct 16, 2023

@derekbit I tried to downgrade the kernel and restarted the nodes but the issue still persist.

See here: https://bugzilla.opensuse.org/show_bug.cgi?id=1216201

Edit: maybe it’s not in the kernel-default package, but other kernel related packages?

janosmiko on Oct 16, 2023

It was not loaded previously, but even after loading it, it fails:

worker-autoscaled-medium-47604670eec959dc:~ # lsmod | grep nfs
worker-autoscaled-medium-47604670eec959dc:~ # modprobe nfs
worker-autoscaled-medium-47604670eec959dc:~ # modprobe nfsv4
worker-autoscaled-medium-47604670eec959dc:~ # lsmod | grep nfs
nfsv4                1241088  0
dns_resolver           16384  1 nfsv4
nfs                   634880  1 nfsv4
lockd                 184320  1 nfs
sunrpc                892928  3 nfsv4,lockd,nfs
fscache               405504  1 nfs
netfs                  69632  2 fscache,nfs

Oct 12 07:29:02 worker-autoscaled-medium-47604670eec959dc k3s[1291]: Mounting arguments: mount -t nfs -o vers=4.1,noresvport,intr,hard 10.43.108.246:/pvc-35f9342e-2c29-4858-944e-e30410105151 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/bde63d3d40aa181876e33213ba79359708f2dbfcb84a3e8c09bb30bd56097da3/globalmount
Oct 12 07:29:02 worker-autoscaled-medium-47604670eec959dc k3s[1291]: Output: mount.nfs: Protocol not supported for 10.43.108.246:/pvc-35f9342e-2c29-4858-944e-e30410105151 on /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/bde63d3d40aa181876e33213ba79359708f2dbfcb84a3e8c09bb30bd56097da3/globalmount
^C

janosmiko on Oct 12, 2023

I have the same problem since yesterday, using MicroOS. Also fixed it with a snapper rollback. The problematic version is:

NAME="openSUSE MicroOS"
# VERSION="20231010"
ID="opensuse-microos"
ID_LIKE="suse opensuse opensuse-tumbleweed"
VERSION_ID="20231010"
PRETTY_NAME="openSUSE MicroOS"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:microos:20231010"
BUG_REPORT_URL="https://bugzilla.opensuse.org"
SUPPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"
DOCUMENTATION_URL="https://en.opensuse.org/Portal:MicroOS"
LOGO="distributor-logo-MicroOS"

> cat /boot/config-`uname -r`| grep CONFIG_NFS_
CONFIG_NFS_FS=m
CONFIG_NFS_V2=m
CONFIG_NFS_V3=m
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=m
CONFIG_NFS_SWAP=y
CONFIG_NFS_V4_1=y
CONFIG_NFS_V4_2=y
CONFIG_NFS_V4_1_IMPLEMENTATION_ID_DOMAIN="kernel.org"
# CONFIG_NFS_V4_1_MIGRATION is not set
CONFIG_NFS_V4_SECURITY_LABEL=y
CONFIG_NFS_FSCACHE=y
# CONFIG_NFS_USE_LEGACY_DNS is not set
CONFIG_NFS_USE_KERNEL_DNS=y
CONFIG_NFS_DEBUG=y
# CONFIG_NFS_DISABLE_UDP_SUPPORT is not set
# CONFIG_NFS_V4_2_READ_PLUS is not set
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_NFS_V4_2_SSC_HELPER=y

primeXchange on Oct 12, 2023