longhorn: [BUG] MountVolume.MountDevice failed for volume Output: mount.nfs: Protocol not supported
Describe the bug (š if you encounter this issue)
My pods (using a specific volume) is no longer starting (they used to), and I get an error stating the following:
MountVolume.MountDevice failed for volume "pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58" : rpc error: code = Internal desc = mount failed: exit status 32 Mounting command: /usr/local/sbin/nsmounter Mounting arguments: mount -t nfs -o vers=4.1,noresvport,intr,hard 10.43.76.13:/pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/12de6c62f175ce990f279cc34d4f579f36032e64e5d21392aa59f1ed192758cd/globalmount Output: mount.nfs: Protocol not supported for 10.43.76.13:/pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58 on /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/12de6c62f175ce990f279cc34d4f579f36032e64e5d21392aa59f1ed192758cd/globalmount
To Reproduce
I donāt know how to reproduce this error. It happens for one of my volumes, and I donāt know how to resolve it.
The configuration I have is the following:
apiVersion: v1
kind: PersistentVolume
metadata:
name: pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
uid: 080fa053-2048-41fb-9ed7-c4f9e9214cc1
resourceVersion: '318384351'
creationTimestamp: '2023-10-03T19:53:29Z'
annotations:
longhorn.io/volume-scheduling-error: ''
pv.kubernetes.io/provisioned-by: driver.longhorn.io
volume.kubernetes.io/provisioner-deletion-secret-name: ''
volume.kubernetes.io/provisioner-deletion-secret-namespace: ''
selfLink: /api/v1/persistentvolumes/pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
status:
phase: Bound
spec:
capacity:
storage: 10Gi
csi:
driver: driver.longhorn.io
volumeHandle: pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
fsType: ext4
volumeAttributes:
dataLocality: disabled
fromBackup: ''
fsType: ext4
numberOfReplicas: '2'
share: 'true'
staleReplicaTimeout: '30'
storage.kubernetes.io/csiProvisionerIdentity: 1696306401569-8081-driver.longhorn.io
accessModes:
- ReadWriteOnce
- ReadWriteMany
claimRef:
kind: PersistentVolumeClaim
namespace: attic
name: attic-db
uid: d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
apiVersion: v1
resourceVersion: '318373363'
persistentVolumeReclaimPolicy: Delete
storageClassName: longhorn
volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: attic-db
namespace: attic
uid: d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
resourceVersion: '318373431'
creationTimestamp: '2023-10-03T19:53:27Z'
labels:
kustomize.toolkit.fluxcd.io/name: 99-attic
kustomize.toolkit.fluxcd.io/namespace: flux-system
annotations:
pv.kubernetes.io/bind-completed: 'yes'
pv.kubernetes.io/bound-by-controller: 'yes'
volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
volume.kubernetes.io/storage-provisioner: driver.longhorn.io
selfLink: /api/v1/namespaces/attic/persistentvolumeclaims/attic-db
status:
phase: Bound
accessModes:
- ReadWriteOnce
- ReadWriteMany
capacity:
storage: 10Gi
spec:
accessModes:
- ReadWriteOnce
- ReadWriteMany
resources:
requests:
storage: 10Gi
volumeName: pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
storageClassName: longhorn
volumeMode: Filesystem
this is then used in a pod (owned by a deployment) with the following config:
volumes:
- name: db
persistentVolumeClaim:
claimName: attic-db
Expected behavior
The volume should get mounted normally and the pod start
Support bundle for troubleshooting
supportbundle_fb30f00f-8c5c-46cf-9f98-f097746ddc7e_2023-10-13T21-09-15Z.zip
Environment
- Longhorn version: v1.5.1
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s v1.28.2+k3s1
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 3
- Node config
- OS type and version: OpenSuse MicroOS
- Kernel version: 6.5.6-1-default
- CPU per node: 3
- Memory per node: 8G
- Disk type(e.g. SSD/NVMe/HDD): SSD
- Network bandwidth between the nodes: All nodes are VMs on the same physical machine, so I donāt think thereās much of a limit.
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
- Number of Longhorn volumes in the cluster: 6
- Impacted Longhorn resources: 1
- Volume names: pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
Additional context
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 37 (10 by maintainers)
We have exactly the same problem with the same Kernel version.
@lcapka @baskinsy
Thanks for your information.
Iāve reviewed your symptom description and the kernel 5.15.0-94 code. It appears the issue is related to an NFS commit, not the bug reported at https://bugs.launchpad.net/bugs/2052842.
The fix is applied to 5.15.0-100.110, but it seems not released yet.
@derekbit
Hi!
The issue is connected with RWX volumes that are using NFS in Longhorn. Our cluster has a few nodes, all based on Ubuntu 22.04 and we are running currently RKE2 v1.27.10 +rke2r1. All worked till 5.15.0-92 and since 5.15.0-94 it stopped. The longhorn has been (unfortunately) installed using RKE2ās apps so itās on the latest available version via RKE2 apps which is 102.3.1+up1.5.3. The NFS is installed via apt package nfs-common and has latest version Ubuntu offers which is 2.6.1.
The problem can be simulated even out of the cluster itself when a pod is in back-off loop because of mount error. In our case we tried manually run the comment (we found it in the failing PODās kubectl describe info) on the Ubuntu node directly.
Example from the node machine:
We have even captured and looked into TCP frames using tcpdump but honestly - no error are visible there and we donāt have NFS protocol experts in our team. Anyway - you can find the tcpdump output attached.
mount-nfs-longhorn.dump.zip
Iām not sure whether it helps or not but here it is. Let me know what information you might need.
Letās track the issue in https://github.com/longhorn/longhorn/issues/7931.
Nope still the same with -97 kernel, we have to wait for -100 as it seems.
@james-munson
Can you help add the problematic version in the KB doc and update the check environment script as @innobead metioned in https://github.com/longhorn/longhorn/issues/6857#issuecomment-1945507658? Thank you.
Downgrade to
5.15.0-92fixed it, but this seems to be related to a kernel changeā¦Itās this one:
nfs-client-2.6.3-39.5See the related issue and discussion for more useful info: https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/discussions/1018 https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/issues/1016
1.6.0 also fixed the issue for me
Hi @baskinsy, since this issue is connected to NFS and Kernel and NFS-backed volumes are only for ReadWriteMany (RWX) access mode. Volumes with access mode ReadWriteOnce (RWO) are unaffected.
And I can confirm that as well with our production Longhorn deployment.
Could be related to this
https://bugs.launchpad.net/bugs/2052842
as I noticed, that no block devices were created after the longhorn volume was mounted under
/dev/longhorn.@derekbit IMHO this needs to be reopened as it might effect other users with other distris that implement that kernel version.
I am experiencing the same issue after updating my Ubuntu 20.04 workers to kernel 5.15.0-94-generic which was just released from Canonical. Maybe a faulty backport?
Same here. Volumes are successfully mounting again. Feel free to close this issue.
Thanks for this info. I updated all of my hosts from kernel 6.5.7 to 6.5.9 and this has resolved my issue with RWX volumes.
@derekbit Awesome! Thanks for letting us know!
@Alxandr @whezzel @a13xie The issue is identified as a linux kernel bug in kernel 6.5.6 and please refer to https://github.com/longhorn/longhorn/issues/6857#issuecomment-1765376826 .
It is fixed in linux kernel 6.5.7 and newer. Before an OS distro release with a newer kernel, you can downgrade the kernel to an older one.