longhorn: [BUG] MountVolume.MountDevice failed for volume Output: mount.nfs: Protocol not supported

Describe the bug (šŸ› if you encounter this issue)

My pods (using a specific volume) is no longer starting (they used to), and I get an error stating the following:

MountVolume.MountDevice failed for volume "pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58" : rpc error: code = Internal desc = mount failed: exit status 32 Mounting command: /usr/local/sbin/nsmounter Mounting arguments: mount -t nfs -o vers=4.1,noresvport,intr,hard 10.43.76.13:/pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/12de6c62f175ce990f279cc34d4f579f36032e64e5d21392aa59f1ed192758cd/globalmount Output: mount.nfs: Protocol not supported for 10.43.76.13:/pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58 on /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/12de6c62f175ce990f279cc34d4f579f36032e64e5d21392aa59f1ed192758cd/globalmount

To Reproduce

I don’t know how to reproduce this error. It happens for one of my volumes, and I don’t know how to resolve it.

The configuration I have is the following:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
  uid: 080fa053-2048-41fb-9ed7-c4f9e9214cc1
  resourceVersion: '318384351'
  creationTimestamp: '2023-10-03T19:53:29Z'
  annotations:
    longhorn.io/volume-scheduling-error: ''
    pv.kubernetes.io/provisioned-by: driver.longhorn.io
    volume.kubernetes.io/provisioner-deletion-secret-name: ''
    volume.kubernetes.io/provisioner-deletion-secret-namespace: ''
  selfLink: /api/v1/persistentvolumes/pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
status:
  phase: Bound
spec:
  capacity:
    storage: 10Gi
  csi:
    driver: driver.longhorn.io
    volumeHandle: pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
    fsType: ext4
    volumeAttributes:
      dataLocality: disabled
      fromBackup: ''
      fsType: ext4
      numberOfReplicas: '2'
      share: 'true'
      staleReplicaTimeout: '30'
      storage.kubernetes.io/csiProvisionerIdentity: 1696306401569-8081-driver.longhorn.io
  accessModes:
    - ReadWriteOnce
    - ReadWriteMany
  claimRef:
    kind: PersistentVolumeClaim
    namespace: attic
    name: attic-db
    uid: d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
    apiVersion: v1
    resourceVersion: '318373363'
  persistentVolumeReclaimPolicy: Delete
  storageClassName: longhorn
  volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: attic-db
  namespace: attic
  uid: d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
  resourceVersion: '318373431'
  creationTimestamp: '2023-10-03T19:53:27Z'
  labels:
    kustomize.toolkit.fluxcd.io/name: 99-attic
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  annotations:
    pv.kubernetes.io/bind-completed: 'yes'
    pv.kubernetes.io/bound-by-controller: 'yes'
    volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
    volume.kubernetes.io/storage-provisioner: driver.longhorn.io
  selfLink: /api/v1/namespaces/attic/persistentvolumeclaims/attic-db
status:
  phase: Bound
  accessModes:
    - ReadWriteOnce
    - ReadWriteMany
  capacity:
    storage: 10Gi
spec:
  accessModes:
    - ReadWriteOnce
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  volumeName: pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
  storageClassName: longhorn
  volumeMode: Filesystem

this is then used in a pod (owned by a deployment) with the following config:

      volumes:
        - name: db
          persistentVolumeClaim:
            claimName: attic-db

Expected behavior

The volume should get mounted normally and the pod start

Support bundle for troubleshooting

supportbundle_fb30f00f-8c5c-46cf-9f98-f097746ddc7e_2023-10-13T21-09-15Z.zip

Environment

  • Longhorn version: v1.5.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s v1.28.2+k3s1
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: OpenSuse MicroOS
    • Kernel version: 6.5.6-1-default
    • CPU per node: 3
    • Memory per node: 8G
    • Disk type(e.g. SSD/NVMe/HDD): SSD
    • Network bandwidth between the nodes: All nodes are VMs on the same physical machine, so I don’t think there’s much of a limit.
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: 6
  • Impacted Longhorn resources: 1
    • Volume names: pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 37 (10 by maintainers)

Most upvoted comments

I am experiencing the same issue after updating my Ubuntu 20.04 workers to kernel 5.15.0-94-generic which was just released from Canonical. Maybe a faulty backport?

We have exactly the same problem with the same Kernel version.

@lcapka @baskinsy

Thanks for your information.

I’ve reviewed your symptom description and the kernel 5.15.0-94 code. It appears the issue is related to an NFS commit, not the bug reported at https://bugs.launchpad.net/bugs/2052842.

The fix is applied to 5.15.0-100.110, but it seems not released yet.

@derekbit

Hi!

The issue is connected with RWX volumes that are using NFS in Longhorn. Our cluster has a few nodes, all based on Ubuntu 22.04 and we are running currently RKE2 v1.27.10 +rke2r1. All worked till 5.15.0-92 and since 5.15.0-94 it stopped. The longhorn has been (unfortunately) installed using RKE2’s apps so it’s on the latest available version via RKE2 apps which is 102.3.1+up1.5.3. The NFS is installed via apt package nfs-common and has latest version Ubuntu offers which is 2.6.1.

The problem can be simulated even out of the cluster itself when a pod is in back-off loop because of mount error. In our case we tried manually run the comment (we found it in the failing POD’s kubectl describe info) on the Ubuntu node directly.

Events:
  Type     Reason       Age                   From     Message
  ----     ------       ----                  ----     -------
  Warning  FailedMount  8m5s (x752 over 25h)  kubelet  MountVolume.MountDevice failed for volume "pvc-f7f7aa18-999e-44aa-8391-00df9e099168" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: /usr/local/sbin/nsmounter
Mounting arguments: mount -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,softerr 10.43.50.95:/pvc-f7f7aa18-999e-44aa-8391-00df9e099168 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/b5aa5195e7e7969102a79de3e5f76163fa63e95687985a08e93f840019b74f48/globalmount
Output: mount.nfs: Protocol not supported
  Warning  FailedMount  3m25s (x671 over 25h)  kubelet  Unable to attach or mount volumes: unmounted volumes=[ghost-data], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition

Example from the node machine:

# mount -v -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,softerr 10.43.50.95:/pvc-f7f7aa18-999e-44aa-8391-00df9e099168 /mnt/x
mount.nfs: timeout set for Thu Feb 15 11:17:14 2024
mount.nfs: trying text-based options 'vers=4.1,noresvport,timeo=600,retrans=5,softerr,addr=10.43.50.95,clientaddr=10.77.1.9'
mount.nfs: mount(2): Protocol not supported
mount.nfs: Protocol not supported

We have even captured and looked into TCP frames using tcpdump but honestly - no error are visible there and we don’t have NFS protocol experts in our team. Anyway - you can find the tcpdump output attached.

mount-nfs-longhorn.dump.zip

I’m not sure whether it helps or not but here it is. Let me know what information you might need.

@derekbit let’s not reopen an existing resolved issue in a released milestone. Create another issue instead. Can just use #6857 (comment).

Let’s track the issue in https://github.com/longhorn/longhorn/issues/7931.

Nope still the same with -97 kernel, we have to wait for -100 as it seems.

@james-munson

Can you help add the problematic version in the KB doc and update the check environment script as @innobead metioned in https://github.com/longhorn/longhorn/issues/6857#issuecomment-1945507658? Thank you.

Downgrade to 5.15.0-92 fixed it, but this seems to be related to a kernel change…

1.6.0 also fixed the issue for me

Hi @baskinsy, since this issue is connected to NFS and Kernel and NFS-backed volumes are only for ReadWriteMany (RWX) access mode. Volumes with access mode ReadWriteOnce (RWO) are unaffected.

And I can confirm that as well with our production Longhorn deployment.

Could be related to this

https://bugs.launchpad.net/bugs/2052842

as I noticed, that no block devices were created after the longhorn volume was mounted under /dev/longhorn.

@derekbit IMHO this needs to be reopened as it might effect other users with other distris that implement that kernel version.

I am experiencing the same issue after updating my Ubuntu 20.04 workers to kernel 5.15.0-94-generic which was just released from Canonical. Maybe a faulty backport?

Same here. Volumes are successfully mounting again. Feel free to close this issue.

Thanks for this info. I updated all of my hosts from kernel 6.5.7 to 6.5.9 and this has resolved my issue with RWX volumes.

@derekbit Awesome! Thanks for letting us know!

@Alxandr @whezzel @a13xie The issue is identified as a linux kernel bug in kernel 6.5.6 and please refer to https://github.com/longhorn/longhorn/issues/6857#issuecomment-1765376826 .

It is fixed in linux kernel 6.5.7 and newer. Before an OS distro release with a newer kernel, you can downgrade the kernel to an older one.