longhorn: [BUG] RWX volume is hang on Photon OS

Describe the bug

When there are multiple pods on the same node access a RWX volume and doing IOs. The mount point of the RWX volume inside the pod is frozen and pod cannot read/write

To Reproduce

Install Photon OS real time flavor https://packages.vmware.com/photon/5.0/GA/iso/photon-rt-5.0-dde71ec57.x86_64.iso
Install K3s and Longhorn 1.6.0 on it

Deploy this workload:

# Long-running test with prometheus metric exporter to monitor the data
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-1
spec:
  volumeMode: Filesystem
  storageClassName: longhorn  # replace with your storage class
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: test-workload
  namespace: default
spec:
  serviceName: test-workload
  replicas: 1
  selector:
    matchLabels:
      app: test-workload
  podManagementPolicy: Parallel
  template:
    metadata:
      labels:
        app: test-workload
    spec:
      nodeName: photon-02 # Replace this one by one of the node in your cluster to force all pods of this workload on that node
      containers:
        - name: kbench-1
          image: phanle1010/kbench:broadcom
          imagePullPolicy: Always
          env:
            - name: MODE
              value: "random-write-iops"
            - name: OUTPUT
              value: /test-result/pvc1
            - name: FILE_NAME
              value: "/pvc1/test.img"
            - name: SIZE
              value: "1638M" # 80% of the pvc-1's size
            - name: CPU_IDLE_PROF
              value: "disabled"
            - name: SKIP_PARSE
              value: "true"
            - name: LONG_RUN
              value: "true"
            - name: RATE_IOPS
              value: "1167"
          volumeMounts:
            - name: pvc1
              mountPath: /pvc1/
            - name: shared-data
              mountPath: /test-result
        - name: metric-exporter
          image: phanle1010/kbench:broadcom
          imagePullPolicy: Always
          command:
            - metric-exporter
            - -d
            - start
          env:
            - name: DATA_DIR
              value: /test-result
            - name: VOLUME_ACCESS_MODE
              value: rwx
            - name: TEST_MODE
              value: write-only
            - name: RATE_LIMIT_TYPE
              value: rate-limit
          ports:
            - containerPort: 8080
              name: metrics
          volumeMounts:
            - name: shared-data
              mountPath: /test-result
      volumes:
        - name: shared-data
          emptyDir: {}
        - name: pvc1
          persistentVolumeClaim:
            claimName: pvc-1

Scale up the stateful set to 6
After 2-6 mins, exec into the pods and check that the mount point of RWX is frozen. I.e., running ls -l /pvc1 is hang

Expected behavior

Should not hang

Environment

Longhorn version: Longhorn v1.6.0
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.7+k3s1
- Number of control plane nodes in the cluster: 1
- Number of worker nodes in the cluster: 4
Node config
- OS type and version: Photon RT OS or Photon minimal at https://github.com/vmware/photon/wiki/Downloading-Photon-OS#downloading-photon-os-50-ga
- Kernel version: 6.1.10-10.ph5-rt OR 6.1.10-10.ph5
- CPU per node: 64
- Memory per node: 503GB
- Disk type (e.g. SSD/NVMe/HDD): NVMe
- Network bandwidth between the nodes (Gbps): 20Gbps
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 1

Additional context

When I added an Ubuntu node to the cluster and deploy the sts on to ubuntu node, RWX volume is not frozen

About this issue

Original URL
State: open
Created 3 months ago
Reactions: 1
Comments: 34 (33 by maintainers)

Most upvoted comments

Update:

The hanging issue actually also happens in Ubuntu. It just takes a lot longer time to stuck. In my case, it is stuck after letting the test run for a night.
After applying the (pending, not merged yet) upstream patches, Photon RT/ Photon non-RT/Ubuntu no longer have the frozen issue

PhanLe1010 on Mar 31, 2024

Update: The culprit might be Longhorn share manager

On Photon non-RT OS with latest kernel 6.1.81-2.ph5, we performed the following tests:

Case 1: Test with Kubernetes external NFS provisioner + hostPath (without any Longhorn component) ✅

This is the yaml to deploy this setup

<be>

apiVersion: v1
kind: ServiceAccount
metadata:
  name: nfs-provisioner
---
kind: Service
apiVersion: v1
metadata:
  name: nfs-provisioner
  labels:
    app: nfs-provisioner
spec:
  ports:
    - name: nfs
      port: 2049
    - name: nfs-udp
      port: 2049
      protocol: UDP
    - name: nlockmgr
      port: 32803
    - name: nlockmgr-udp
      port: 32803
      protocol: UDP
    - name: mountd
      port: 20048
    - name: mountd-udp
      port: 20048
      protocol: UDP
    - name: rquotad
      port: 875
    - name: rquotad-udp
      port: 875
      protocol: UDP
    - name: rpcbind
      port: 111
    - name: rpcbind-udp
      port: 111
      protocol: UDP
    - name: statd
      port: 662
    - name: statd-udp
      port: 662
      protocol: UDP
  selector:
    app: nfs-provisioner
---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: nfs-provisioner
spec:
  selector:
    matchLabels:
      app: nfs-provisioner
  replicas: 1
  strategy:
    type: Recreate 
  template:
    metadata:
      labels:
        app: nfs-provisioner
    spec:
      nodeName: photon-02 # Replace this one by one of the node in your cluster to force all pods of this workload on that node
      serviceAccount: nfs-provisioner
      containers:
        - name: nfs-provisioner
          image: registry.k8s.io/sig-storage/nfs-provisioner:v4.0.8
          ports:
            - name: nfs
              containerPort: 2049
            - name: nfs-udp
              containerPort: 2049
              protocol: UDP
            - name: nlockmgr
              containerPort: 32803
            - name: nlockmgr-udp
              containerPort: 32803
              protocol: UDP
            - name: mountd
              containerPort: 20048
            - name: mountd-udp
              containerPort: 20048
              protocol: UDP
            - name: rquotad
              containerPort: 875
            - name: rquotad-udp
              containerPort: 875
              protocol: UDP
            - name: rpcbind
              containerPort: 111
            - name: rpcbind-udp
              containerPort: 111
              protocol: UDP
            - name: statd
              containerPort: 662
            - name: statd-udp
              containerPort: 662
              protocol: UDP
          securityContext:
            capabilities:
              add:
                - DAC_READ_SEARCH
                - SYS_RESOURCE
          args:
            - "-provisioner=example.com/nfs"
          env:
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: SERVICE_NAME
              value: nfs-provisioner
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          imagePullPolicy: "IfNotPresent"
          volumeMounts:
            - name: export-volume
              mountPath: /export
      volumes:
        - name: export-volume
          hostPath:
            path: /tmp/nfs-provisioner
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: nfs-provisioner-runner
rules:
  - apiGroups: [""]
    resources: ["persistentvolumes"]
    verbs: ["get", "list", "watch", "create", "delete"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["get", "list", "watch", "update"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "update", "patch"]
  - apiGroups: [""]
    resources: ["services", "endpoints"]
    verbs: ["get"]
  - apiGroups: ["extensions"]
    resources: ["podsecuritypolicies"]
    resourceNames: ["nfs-provisioner"]
    verbs: ["use"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: run-nfs-provisioner
subjects:
  - kind: ServiceAccount
    name: nfs-provisioner
     # replace with namespace where provisioner is deployed
    namespace: default
roleRef:
  kind: ClusterRole
  name: nfs-provisioner-runner
  apiGroup: rbac.authorization.k8s.io
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: leader-locking-nfs-provisioner
rules:
  - apiGroups: [""]
    resources: ["endpoints"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: leader-locking-nfs-provisioner
subjects:
  - kind: ServiceAccount
    name: nfs-provisioner
    # replace with namespace where provisioner is deployed
    namespace: default
roleRef:
  kind: Role
  name: leader-locking-nfs-provisioner
  apiGroup: rbac.authorization.k8s.io
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: example-nfs
provisioner: example.com/nfs
mountOptions:
  - vers=4.1
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nfs-pvc
  annotations:
    volume.beta.kubernetes.io/storage-class: "example-nfs"
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 8Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: test-workload
  namespace: default
spec:
  serviceName: test-workload
  replicas: 1
  selector:
    matchLabels:
      app: test-workload
  podManagementPolicy: Parallel
  template:
    metadata:
      labels:
        app: test-workload
    spec:
      nodeName: photon-02 # Replace this one by one of the node in your cluster to force all pods of this workload on that node
      containers:
        - name: kbench-1
          image: phanle1010/kbench:broadcom
          imagePullPolicy: IfNotPresent
          env:
            - name: MODE
              value: "random-write-iops"
            - name: OUTPUT
              value: /tmp/pvc1
            - name: FILE_NAME
              value: "/pvc1/test.img"
            - name: SIZE
              value: "1638M" # 80% of the pvc-1's size
            - name: CPU_IDLE_PROF
              value: "disabled"
            - name: SKIP_PARSE
              value: "true"
            - name: LONG_RUN
              value: "true"
            - name: RATE_IOPS
              value: "1000"
          volumeMounts:
            - name: pvc1
              mountPath: /pvc1/
      volumes:
        - name: pvc1
          persistentVolumeClaim:
            claimName: nfs-pvc

Result: NFS mounts are NOT frozen

Case 2: Test with Kubernetes external NFS provisioner + Longhorn RWO PVC ✅

This is the yaml to deploy this setup

<be>

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-provisioner-storage # longhorn backing pvc
spec:
  storageClassName: longhorn
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: "10Gi" # make this 10% bigger then the workload pvc
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nfs-provisioner
---
kind: Service
apiVersion: v1
metadata:
  name: nfs-provisioner
  labels:
    app: nfs-provisioner
spec:
  ports:
    - name: nfs
      port: 2049
    - name: nfs-udp
      port: 2049
      protocol: UDP
    - name: nlockmgr
      port: 32803
    - name: nlockmgr-udp
      port: 32803
      protocol: UDP
    - name: mountd
      port: 20048
    - name: mountd-udp
      port: 20048
      protocol: UDP
    - name: rquotad
      port: 875
    - name: rquotad-udp
      port: 875
      protocol: UDP
    - name: rpcbind
      port: 111
    - name: rpcbind-udp
      port: 111
      protocol: UDP
    - name: statd
      port: 662
    - name: statd-udp
      port: 662
      protocol: UDP
  selector:
    app: nfs-provisioner
---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: nfs-provisioner
spec:
  selector:
    matchLabels:
      app: nfs-provisioner
  replicas: 1
  strategy:
    type: Recreate 
  template:
    metadata:
      labels:
        app: nfs-provisioner
    spec:
      nodeName: photon-02 # Replace this one by one of the node in your cluster to force all pods of this workload on that node
      serviceAccount: nfs-provisioner
      containers:
        - name: nfs-provisioner
          image: registry.k8s.io/sig-storage/nfs-provisioner:v4.0.8
          ports:
            - name: nfs
              containerPort: 2049
            - name: nfs-udp
              containerPort: 2049
              protocol: UDP
            - name: nlockmgr
              containerPort: 32803
            - name: nlockmgr-udp
              containerPort: 32803
              protocol: UDP
            - name: mountd
              containerPort: 20048
            - name: mountd-udp
              containerPort: 20048
              protocol: UDP
            - name: rquotad
              containerPort: 875
            - name: rquotad-udp
              containerPort: 875
              protocol: UDP
            - name: rpcbind
              containerPort: 111
            - name: rpcbind-udp
              containerPort: 111
              protocol: UDP
            - name: statd
              containerPort: 662
            - name: statd-udp
              containerPort: 662
              protocol: UDP
          securityContext:
            capabilities:
              add:
                - DAC_READ_SEARCH
                - SYS_RESOURCE
          args:
            - "-provisioner=example.com/nfs"
          env:
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: SERVICE_NAME
              value: nfs-provisioner
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          imagePullPolicy: "IfNotPresent"
          volumeMounts:
            - name: export-volume
              mountPath: /export
      volumes:
        # - name: export-volume
        #   hostPath:
        #     path: /tmp/nfs-provisioner
        - name: export-volume
          persistentVolumeClaim:
            claimName: nfs-provisioner-storage
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: nfs-provisioner-runner
rules:
  - apiGroups: [""]
    resources: ["persistentvolumes"]
    verbs: ["get", "list", "watch", "create", "delete"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["get", "list", "watch", "update"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "update", "patch"]
  - apiGroups: [""]
    resources: ["services", "endpoints"]
    verbs: ["get"]
  - apiGroups: ["extensions"]
    resources: ["podsecuritypolicies"]
    resourceNames: ["nfs-provisioner"]
    verbs: ["use"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: run-nfs-provisioner
subjects:
  - kind: ServiceAccount
    name: nfs-provisioner
     # replace with namespace where provisioner is deployed
    namespace: default
roleRef:
  kind: ClusterRole
  name: nfs-provisioner-runner
  apiGroup: rbac.authorization.k8s.io
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: leader-locking-nfs-provisioner
rules:
  - apiGroups: [""]
    resources: ["endpoints"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: leader-locking-nfs-provisioner
subjects:
  - kind: ServiceAccount
    name: nfs-provisioner
    # replace with namespace where provisioner is deployed
    namespace: default
roleRef:
  kind: Role
  name: leader-locking-nfs-provisioner
  apiGroup: rbac.authorization.k8s.io
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: example-nfs
provisioner: example.com/nfs
mountOptions:
  - vers=4.1
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nfs-pvc
  annotations:
    volume.beta.kubernetes.io/storage-class: "example-nfs"
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 8Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: test-workload
  namespace: default
spec:
  serviceName: test-workload
  replicas: 1
  selector:
    matchLabels:
      app: test-workload
  podManagementPolicy: Parallel
  template:
    metadata:
      labels:
        app: test-workload
    spec:
      nodeName: photon-02 # Replace this one by one of the node in your cluster to force all pods of this workload on that node
      containers:
        - name: kbench-1
          image: phanle1010/kbench:broadcom
          imagePullPolicy: IfNotPresent
          env:
            - name: MODE
              value: "random-write-iops"
            - name: OUTPUT
              value: /tmp/pvc1
            - name: FILE_NAME
              value: "/pvc1/test.img"
            - name: SIZE
              value: "1638M" # 80% of the pvc-1's size
            - name: CPU_IDLE_PROF
              value: "disabled"
            - name: SKIP_PARSE
              value: "true"
            - name: LONG_RUN
              value: "true"
            - name: RATE_IOPS
              value: "1000"
          volumeMounts:
            - name: pvc1
              mountPath: /pvc1/
      volumes:
        - name: pvc1
          persistentVolumeClaim:
            claimName: nfs-pvc

Result: NFS mounts are NOT frozen

Case 3: Test with Longhorn RWX PVC ❌

Result: NFS mounts are frozen

Analysis:

From the test results, we can see that the difference between frozen case (case 3) and non frozen cases (case 1 and case 2) is at Longhorn CSI/Longhorn share manager VS Kubernetes external NFS provisioner + nfs-ganesha.
Since the workload are mounting the volume OK at the beginning, the Longhorn CSI plugin might not be the culprit. So Is it possible that something is wrong in our Longhorn share-manager?
The main component of Longhorn share-manager (frozen case) and Kubernetes external NFS provisioner (non-frozen cases) is the nfs-ganesha. I see that Kubernetes external NFS provisioner is using nfs-ganesha: V4.0.8 while Longhorn share manager is using nfs-ganesha: V5.7:
```
# Kubernetes external NFS provisioner 
bash-5.2# ./usr/local/bin/ganesha.nfsd -v
NFS-Ganesha Release = V4.0.8

# Longhorn share manager
share-manager-pvc-0bcdb279-04f7-4a8c-a83b-d301c0e37237:/ # ./usr/local/bin/ganesha.nfsd -v
NFS-Ganesha Release = V5.7
```
In order to confirm whether nfs-ganesha version difference is the culprit, I revert the share-manager to use nfs-ganesha: V4.2.3:
```
share-manager-pvc-adf07290-9d3b-4683-8637-da6cedf8fa6b:/ # ./usr/local/bin/ganesha.nfsd -v
NFS-Ganesha Release = V4.2.3
```
Result: NFS mounts are NOT frozen
I am doing more investigating but I suspect that the root cause might be related to:
- https://github.com/nfs-ganesha/nfs-ganesha/issues/1102
- https://github.com/nfs-ganesha/nfs-ganesha/issues/1063

cc @innobead @shuo-wu @derekbit

PhanLe1010 on Mar 30, 2024

Let’s do this if reasonable:

backport the patch to our downstream v5 for our existing Longhorn releases like 1.6, 1.5 (update the patch release of v5 including the fix if upstream has a newer release)
upgrade to v6 in Longhorn 1.7.0

innobead on Mar 30, 2024

The latest version of nfs-ganesha is v5.7 hitting the fronzen issue. The fix will be included in v6.0.

Downgranding our nfs-ganesha temporarily is on choice. IIRC, v4.x we used before works well after fixing the daemon crash issue. cc @james-munson

Or, backporting the fix in the thread to v5.7 by ourselves is feasible as well.

derekbit on Mar 30, 2024

Update:

Sorry! After repeatedly stress testing, the Photon non-RT OS with latest kernel 6.1.81-2.ph5 still suffers the NFS mounts frozen 😔

cc @innobead @derekbit @shuo-wu

PhanLe1010 on Mar 29, 2024

Do we know about the nfs kernel module? But doubt we are able to check it.

Is it this one?

root@photon-03 [ ~ ]# modinfo nfs
filename:       /lib/modules/6.1.10-10.ph5-rt/kernel/fs/nfs/nfs.ko.xz
license:        GPL
author:         Olaf Kirch <okir@monad.swb.de>
alias:          nfs4
alias:          fs-nfs4
alias:          fs-nfs
depends:        sunrpc,fscache,lockd
retpoline:      Y
intree:         Y
name:           nfs
vermagic:       6.1.10-10.ph5-rt SMP preempt_rt mod_unload modversions 
sig_id:         PKCS#7
signer:         Build time autogenerated kernel key
sig_key:        7D:F5:20:BC:BD:79:8F:D8:46:EF:B2:DB:45:C4:2F:FC:55:E1:22:EE
sig_hashalgo:   sha512
=========
ubuntu@ubuntu-01:~$ modinfo nfs
filename:       /lib/modules/5.15.0-101-generic/kernel/fs/nfs/nfs.ko
license:        GPL
author:         Olaf Kirch <okir@monad.swb.de>
alias:          nfs4
alias:          fs-nfs4
alias:          fs-nfs
srcversion:     E46B121D3C0A67D41B92D03
depends:        fscache,sunrpc,lockd
retpoline:      Y
intree:         Y
name:           nfs
vermagic:       5.15.0-101-generic SMP mod_unload modversions 
sig_id:         PKCS#7
signer:         Build time autogenerated kernel key
sig_key:        7D:31:86:13:F2:76:80:28:1A:E7:90:1A:C6:12:9B:85:D8:E4:8E:E9
sig_hashalgo:   sha512

Looks like the one on Photon RT OS is newer

PhanLe1010 on Mar 27, 2024

Test Plan

These simplified reproduction steps should trigger the issue quickly on both Ubuntu and Photon OS

Install Longhorn v1.6.1/v1.6.0
Deploy this simple test pod with Longhorn RWX PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
spec:
  volumeMode: Filesystem
  storageClassName: longhorn  # replace with your storage class
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: test-workload
  namespace: default
spec:
  serviceName: test-workload
  replicas: 1
  selector:
    matchLabels:
      app: test-workload
  podManagementPolicy: Parallel
  template:
    metadata:
      labels:
        app: test-workload
    spec:
      containers:
        - name: test
          image: ubuntu:xenial
          imagePullPolicy: IfNotPresent
          command: ["/bin/bash", "-c", "sleep 360000000000"]
          volumeMounts:
            - name: test-pvc
              mountPath: /test-pvc
      volumes:
        - name: test-pvc
          persistentVolumeClaim:
            claimName: test-pvc

Exec into the pod this fio command apt update && apt install fio -y && echo "writing 3G file size ..." && dd if=/dev/urandom of=/test-pvc/test.img bs=1M count=3072 && sleep 1 && fio --name=random_write --rw=randwrite --direct=1 --size=3G --numjobs=50 --filename=/test-pvc/test.img
Wait for 30s if it doesn’t stuck run cancel it and run this command again fio --name=random_write --rw=randwrite --direct=1 --size=3G --numjobs=50 --filename=/test-pvc/test.img. Now it should stuck. The mount point /test-pvc/ should be frozen and cannot read/write to it
Repeat the steps with Longhorn master, making sure that it is not stuck

PhanLe1010 on Apr 8, 2024

Remove backport 1.5.x, because v1.5.x is using nfs-ganesh v4.2.3 which is not impacted by the frozen issue. Ref: https://github.com/longhorn/longhorn/issues/8253#issuecomment-2027851926

derekbit on Apr 8, 2024

Thanks @derekbit ! Yeah, I was also looking at the ticket https://github.com/nfs-ganesha/nfs-ganesha/issues/1102 in the above analysis as well

I agree that we either need to pump the nfs-ganesha version or revert it to avoid the hanging issue

PhanLe1010 on Mar 30, 2024

Update:

❌ Disable 1 NIC

Since each machine has 2 NICs and I saw that some time NFS conntion is going to the 2nd NICs while I specify K3s to use the 1st NIC only. I disable the 2nd NIC on all machines --> NFS mounts are still stuck on Phonton RT node

❌ Try with Photon non RT OS

Add a Photon non RT node and test it --> NFS mounts are still stuck on Phonton non RT node

❌ Trying to upgrade kernel and NFS module for Photon node

Case 1: Upgrade kernel, nfs module from 6.1.10-11.ph5 to the latest version 6.1.81-2.ph5 Photon non RT node:

root@photon-04 [ ~ ]# uname -r
6.1.81-2.ph5
root@photon-04 [ ~ ]# cat /etc/os-release 
NAME="VMware Photon OS"
VERSION="5.0"
ID=photon
VERSION_ID=5.0
PRETTY_NAME="VMware Photon OS/Linux"
ANSI_COLOR="1;34"
HOME_URL="https://vmware.github.io/photon/"
BUG_REPORT_URL="https://github.com/vmware/photon/issues"
root@photon-04 [ ~ ]# modinfo nfs
filename:       /lib/modules/6.1.81-2.ph5/kernel/fs/nfs/nfs.ko.xz
license:        GPL
author:         Olaf Kirch <okir@monad.swb.de>
alias:          nfs4
alias:          fs-nfs4
alias:          fs-nfs
srcversion:     6D9CA0D100226F58BDAFF1B
depends:        sunrpc,fscache,lockd
retpoline:      Y
intree:         Y
name:           nfs
vermagic:       6.1.81-2.ph5 SMP preempt mod_unload modversions 
sig_id:         PKCS#7
signer:         Build time autogenerated kernel key
sig_key:        41:DE:42:C5:51:E8:4E:A1:D5:DC:75:E9:1B:67:5D:0A:56:05:DF:4C
sig_hashalgo:   sha512

Result: no hanging (run test for 1 hour) but update https://github.com/longhorn/longhorn/issues/8253#issuecomment-2026383213

Case 2: Upgrade kernel, nfs module from 6.1.10-10.ph5-rt to the latest version 6.1.81-2.ph5-rt Photon RT node: After upgrading, the Photon RT OS node always loose public IPs. This happen on all of our server labs. I am not sure if there is a bug in Photon RT OS new kernel. As the result, I cannot test the upgraded Photon RT OS. I will run the report on upgraded Photon non-RT OS instead. I will create a ticket to Photon repo for this one later

⌛ If all fails and we cannot identify the root cause, I will apply the test plan on the Ubuntu cluster and report the data

Testing …

PhanLe1010 on Mar 29, 2024

Simpler reproduce steps:

Deploy this simple workload

# Long-running test with prometheus metric exporter to monitor the data
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-1
spec:
  volumeMode: Filesystem
  storageClassName: longhorn  # replace with your storage class
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: test-workload
  namespace: default
spec:
  serviceName: test-workload
  replicas: 1
  selector:
    matchLabels:
      app: test-workload
  podManagementPolicy: Parallel
  template:
    metadata:
      labels:
        app: test-workload
    spec:
      nodeName: photon-03 # Replace this one by one of the node in your cluster to force all pods of this workload on that node
      containers:
        - name: kbench-1
          image: ubuntu:xenial
          imagePullPolicy: IfNotPresent
          command: ["/bin/bash", "-c", "while : ;do dd if=/dev/urandom of=/pvc1/test.img bs=4096 count=102400 conv=notrunc; sleep 1; done"]
          volumeMounts:
            - name: pvc1
              mountPath: /pvc1/
      volumes:
        - name: pvc1
          persistentVolumeClaim:
            claimName: pvc-1

Check that read/write to /pvc1 in the pod is ok
Scale it up to 6 pods
Wait for 5 mins. check that read/write to /pvc1 in the pods are frozen

Note: If we replace command: ["/bin/bash", "-c", "while : ;do dd if=/dev/urandom of=/pvc1/test.img bs=4096 count=102400 conv=notrunc; sleep 1; done"] by command: ["sleep", "36000000"], the /pvc1 in pods are not frozen after 5 mins. This led me to believe that the nfs mount is frozen under IO load cc @shuo-wu @derekbit @james-munson

PhanLe1010 on Mar 26, 2024

For https://github.com/longhorn/nfs-ganesha/pull/9, I think we need to be careful because we are cherry picking from upstream PRs but the upstream fixed PRs have not been merged:

They are actively reviewing and might change it so maybe better to wait for them to merge first before we add this to our stable releases. Hope that they will merge in the next few days though.

TODO:

Once upstream merges the PRs, I will check and cherry-pick them again (if needed) to our nfs-ganesha and update Longhorn share manager v1.5.x, v1.6.x, and master to point to new Longhorn nfs-ganesha

PhanLe1010 on Mar 31, 2024

@PhanLe1010 I’ve patched the fixes to nfs-ganesha v5.7 and created a share-image v1.6.0 with it. You can test it, derekbit/longhorn-share-manager:v1.6.0-fix-hang, to see if the frozen issue is gone.

derekbit on Mar 30, 2024

@PhanLe1010 After patching, I think we can continue using the RT-enabled distro for the following testing.

innobead on Mar 30, 2024

What’s NFS server used in NFS provisioned? Is it nfs ganesha? If yes, is the used version the same as us?

Hi @innobead! I see that Kubernetes external NFS provisioner is using nfs-ganesha: V4.0.8 while Longhorn share manager is using nfs-ganesha: V5.7:

# Kubernetes external NFS provisioner 
bash-5.2# ./usr/local/bin/ganesha.nfsd -v
NFS-Ganesha Release = V4.0.8

# Longhorn share manager
share-manager-pvc-0bcdb279-04f7-4a8c-a83b-d301c0e37237:/ # ./usr/local/bin/ganesha.nfsd -v
NFS-Ganesha Release = V5.7

Can we downgrade the version to see if the problem is gone?

Yes, after downgrading nfs-ganesha, the problem is gone. The analysis section above has more details

Can we try to update NFS provisioner using the same version as us?

Looks like the version 4.0.8 is their latest version https://github.com/kubernetes-sigs/nfs-ganesha-server-and-external-provisioner/blob/master/deploy/docker/Dockerfile . So I cannot do this

Is there any latest patch released at ganesha upstream?

I will check this

PhanLe1010 on Mar 30, 2024

What’s NFS server used in NFS provisioned? Is it nfs ganesha? If yes, is the used version the same as us?

Can we try to update NFS provisioner using the same version as us?
Is there any latest patch released at ganesha upstream?
Can we downgrade the version to see if the problem is gone?

innobead on Mar 30, 2024

As discussed, @PhanLe1010 will also check if nfs mount has problems w/o Longhorn.

innobead on Mar 29, 2024

The module version seem to be related to the kernel version. I am wondering if there will be a similar issue when Ubuntu kernel version is the same as that of Photon RT OS.

shuo-wu on Mar 27, 2024

Do we know about the nfs kernel module? But doubt we are able to check it.

innobead on Mar 27, 2024

Update:

I did some tests with different distributions of workload pods and share-manager pod. Result:

share-manager pod’s node	workload pods’ node	state
Photon RT OS	Photon RT OS	nfs mounts are frozen
Photon RT OS	Ubtunu 22.04	nfs mounts are frozen
Ubtunu 22.04	Photon RT OS	nfs mounts are frozen but it takes longer time than the above 2 cases
Ubtunu 22.04	Ubtunu 22.04	nfs mounts are NOT frozen

PhanLe1010 on Mar 26, 2024

A few things can also be checked:

if reduce the replica of the workload, will the stuck IO be resolved at runtime?
during IO stuck, any logs happening in the share manager pod? Are you able to manually mount it in another workload and also encounter the IO stuck issue?

innobead on Mar 26, 2024

Does this Only happen in this specific OS? @PhanLe1010

@innobead it happens in Photon OS RT but doesn’t happen on the Ubuntu 22.04 cluster. I am going to test photon OS non RT to see if it can happen

Is it possible due to nfs host package issue?

Good idea, I will check this

PhanLe1010 on Mar 26, 2024

How about launching a new workload pod with the same image then doing a manual mount in the CSI plugin for the pod?

shuo-wu on Mar 26, 2024

@shuo-wu @james-munson I followed your recommendation of manually mounting the nfs server on same node OR use Ubuntu:

Manually mounting the nfs server on same node (photon-03) still works even though the mount points inside pods are stuck
I install Ubuntu 22.04 on another baremetal node and join that node to this cluster. Then force all pods of the sts on the ubuntu node. Result: no hanging after about 20 mins

PhanLe1010 on Mar 26, 2024