longhorn: [BUG] MountVolume.MountDevice failed for volume Output: mount.nfs: Protocol not supported

Describe the bug (🐛 if you encounter this issue)

My pods (using a specific volume) is no longer starting (they used to), and I get an error stating the following:

MountVolume.MountDevice failed for volume "pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58" : rpc error: code = Internal desc = mount failed: exit status 32 Mounting command: /usr/local/sbin/nsmounter Mounting arguments: mount -t nfs -o vers=4.1,noresvport,intr,hard 10.43.76.13:/pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/12de6c62f175ce990f279cc34d4f579f36032e64e5d21392aa59f1ed192758cd/globalmount Output: mount.nfs: Protocol not supported for 10.43.76.13:/pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58 on /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/12de6c62f175ce990f279cc34d4f579f36032e64e5d21392aa59f1ed192758cd/globalmount

To Reproduce

I don’t know how to reproduce this error. It happens for one of my volumes, and I don’t know how to resolve it.

The configuration I have is the following:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
  uid: 080fa053-2048-41fb-9ed7-c4f9e9214cc1
  resourceVersion: '318384351'
  creationTimestamp: '2023-10-03T19:53:29Z'
  annotations:
    longhorn.io/volume-scheduling-error: ''
    pv.kubernetes.io/provisioned-by: driver.longhorn.io
    volume.kubernetes.io/provisioner-deletion-secret-name: ''
    volume.kubernetes.io/provisioner-deletion-secret-namespace: ''
  selfLink: /api/v1/persistentvolumes/pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
status:
  phase: Bound
spec:
  capacity:
    storage: 10Gi
  csi:
    driver: driver.longhorn.io
    volumeHandle: pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
    fsType: ext4
    volumeAttributes:
      dataLocality: disabled
      fromBackup: ''
      fsType: ext4
      numberOfReplicas: '2'
      share: 'true'
      staleReplicaTimeout: '30'
      storage.kubernetes.io/csiProvisionerIdentity: 1696306401569-8081-driver.longhorn.io
  accessModes:
    - ReadWriteOnce
    - ReadWriteMany
  claimRef:
    kind: PersistentVolumeClaim
    namespace: attic
    name: attic-db
    uid: d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
    apiVersion: v1
    resourceVersion: '318373363'
  persistentVolumeReclaimPolicy: Delete
  storageClassName: longhorn
  volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: attic-db
  namespace: attic
  uid: d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
  resourceVersion: '318373431'
  creationTimestamp: '2023-10-03T19:53:27Z'
  labels:
    kustomize.toolkit.fluxcd.io/name: 99-attic
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  annotations:
    pv.kubernetes.io/bind-completed: 'yes'
    pv.kubernetes.io/bound-by-controller: 'yes'
    volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
    volume.kubernetes.io/storage-provisioner: driver.longhorn.io
  selfLink: /api/v1/namespaces/attic/persistentvolumeclaims/attic-db
status:
  phase: Bound
  accessModes:
    - ReadWriteOnce
    - ReadWriteMany
  capacity:
    storage: 10Gi
spec:
  accessModes:
    - ReadWriteOnce
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  volumeName: pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58
  storageClassName: longhorn
  volumeMode: Filesystem

this is then used in a pod (owned by a deployment) with the following config:

      volumes:
        - name: db
          persistentVolumeClaim:
            claimName: attic-db

Expected behavior

The volume should get mounted normally and the pod start

Support bundle for troubleshooting

supportbundle_fb30f00f-8c5c-46cf-9f98-f097746ddc7e_2023-10-13T21-09-15Z.zip

Environment

Longhorn version: v1.5.1
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s v1.28.2+k3s1
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 3
Node config
- OS type and version: OpenSuse MicroOS
- Kernel version: 6.5.6-1-default
- CPU per node: 3
- Memory per node: 8G
- Disk type(e.g. SSD/NVMe/HDD): SSD
- Network bandwidth between the nodes: All nodes are VMs on the same physical machine, so I don’t think there’s much of a limit.
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 6
Impacted Longhorn resources: 1
- Volume names: pvc-d11e720d-2d0d-48d0-8b11-82ddf4ca7a58

Additional context

About this issue

Original URL
State: closed
Created 9 months ago
Comments: 37 (10 by maintainers)

Most upvoted comments

I am experiencing the same issue after updating my Ubuntu 20.04 workers to kernel 5.15.0-94-generic which was just released from Canonical. Maybe a faulty backport?

We have exactly the same problem with the same Kernel version.

lcapka on Feb 8, 2024

@lcapka @baskinsy

Thanks for your information.

I’ve reviewed your symptom description and the kernel 5.15.0-94 code. It appears the issue is related to an NFS commit, not the bug reported at https://bugs.launchpad.net/bugs/2052842.

The fix is applied to 5.15.0-100.110, but it seems not released yet.

derekbit on Feb 15, 2024

@derekbit

Hi!

The issue is connected with RWX volumes that are using NFS in Longhorn. Our cluster has a few nodes, all based on Ubuntu 22.04 and we are running currently RKE2 v1.27.10 +rke2r1. All worked till 5.15.0-92 and since 5.15.0-94 it stopped. The longhorn has been (unfortunately) installed using RKE2’s apps so it’s on the latest available version via RKE2 apps which is 102.3.1+up1.5.3. The NFS is installed via apt package nfs-common and has latest version Ubuntu offers which is 2.6.1.

The problem can be simulated even out of the cluster itself when a pod is in back-off loop because of mount error. In our case we tried manually run the comment (we found it in the failing POD’s kubectl describe info) on the Ubuntu node directly.

Events:
  Type     Reason       Age                   From     Message
  ----     ------       ----                  ----     -------
  Warning  FailedMount  8m5s (x752 over 25h)  kubelet  MountVolume.MountDevice failed for volume "pvc-f7f7aa18-999e-44aa-8391-00df9e099168" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: /usr/local/sbin/nsmounter
Mounting arguments: mount -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,softerr 10.43.50.95:/pvc-f7f7aa18-999e-44aa-8391-00df9e099168 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/b5aa5195e7e7969102a79de3e5f76163fa63e95687985a08e93f840019b74f48/globalmount
Output: mount.nfs: Protocol not supported
  Warning  FailedMount  3m25s (x671 over 25h)  kubelet  Unable to attach or mount volumes: unmounted volumes=[ghost-data], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition

Example from the node machine:

# mount -v -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,softerr 10.43.50.95:/pvc-f7f7aa18-999e-44aa-8391-00df9e099168 /mnt/x
mount.nfs: timeout set for Thu Feb 15 11:17:14 2024
mount.nfs: trying text-based options 'vers=4.1,noresvport,timeo=600,retrans=5,softerr,addr=10.43.50.95,clientaddr=10.77.1.9'
mount.nfs: mount(2): Protocol not supported
mount.nfs: Protocol not supported

We have even captured and looked into TCP frames using tcpdump but honestly - no error are visible there and we don’t have NFS protocol experts in our team. Anyway - you can find the tcpdump output attached.

mount-nfs-longhorn.dump.zip

I’m not sure whether it helps or not but here it is. Let me know what information you might need.

lcapka on Feb 15, 2024

@derekbit let’s not reopen an existing resolved issue in a released milestone. Create another issue instead. Can just use #6857 (comment).

Let’s track the issue in https://github.com/longhorn/longhorn/issues/7931.

derekbit on Feb 15, 2024

Nope still the same with -97 kernel, we have to wait for -100 as it seems.

baskinsy on Feb 27, 2024

@james-munson

Can you help add the problematic version in the KB doc and update the check environment script as @innobead metioned in https://github.com/longhorn/longhorn/issues/6857#issuecomment-1945507658? Thank you.

derekbit on Feb 15, 2024

Downgrade to 5.15.0-92 fixed it, but this seems to be related to a kernel change…

Isotop7 on Feb 11, 2024

It’s this one: nfs-client-2.6.3-39.5

janosmiko on Oct 14, 2023

1.6.0 also fixed the issue for me

rotilho on Mar 5, 2024

Hi @baskinsy, since this issue is connected to NFS and Kernel and NFS-backed volumes are only for ReadWriteMany (RWX) access mode. Volumes with access mode ReadWriteOnce (RWO) are unaffected.

And I can confirm that as well with our production Longhorn deployment.

vojtechmares on Feb 27, 2024

Could be related to this

https://bugs.launchpad.net/bugs/2052842

as I noticed, that no block devices were created after the longhorn volume was mounted under /dev/longhorn.

@derekbit IMHO this needs to be reopened as it might effect other users with other distris that implement that kernel version.

Isotop7 on Feb 13, 2024

I am experiencing the same issue after updating my Ubuntu 20.04 workers to kernel 5.15.0-94-generic which was just released from Canonical. Maybe a faulty backport?

baskinsy on Feb 8, 2024

Same here. Volumes are successfully mounting again. Feel free to close this issue.

Alxandr on Oct 27, 2023

Thanks for this info. I updated all of my hosts from kernel 6.5.7 to 6.5.9 and this has resolved my issue with RWX volumes.

whezzel on Oct 27, 2023

@derekbit Awesome! Thanks for letting us know!

k8ieone on Oct 27, 2023

@Alxandr @whezzel @a13xie The issue is identified as a linux kernel bug in kernel 6.5.6 and please refer to https://github.com/longhorn/longhorn/issues/6857#issuecomment-1765376826 .

It is fixed in linux kernel 6.5.7 and newer. Before an OS distro release with a newer kernel, you can downgrade the kernel to an older one.

derekbit on Oct 27, 2023