longhorn: [BUG] RWX volume is hang on Photon OS
Describe the bug
When there are multiple pods on the same node access a RWX volume and doing IOs. The mount point of the RWX volume inside the pod is frozen and pod cannot read/write
To Reproduce
- Install Photon OS real time flavor https://packages.vmware.com/photon/5.0/GA/iso/photon-rt-5.0-dde71ec57.x86_64.iso
- Install K3s and Longhorn 1.6.0 on it
- Deploy this workload:
# Long-running test with prometheus metric exporter to monitor the data apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc-1 spec: volumeMode: Filesystem storageClassName: longhorn # replace with your storage class accessModes: - ReadWriteMany resources: requests: storage: 2Gi --- apiVersion: apps/v1 kind: StatefulSet metadata: name: test-workload namespace: default spec: serviceName: test-workload replicas: 1 selector: matchLabels: app: test-workload podManagementPolicy: Parallel template: metadata: labels: app: test-workload spec: nodeName: photon-02 # Replace this one by one of the node in your cluster to force all pods of this workload on that node containers: - name: kbench-1 image: phanle1010/kbench:broadcom imagePullPolicy: Always env: - name: MODE value: "random-write-iops" - name: OUTPUT value: /test-result/pvc1 - name: FILE_NAME value: "/pvc1/test.img" - name: SIZE value: "1638M" # 80% of the pvc-1's size - name: CPU_IDLE_PROF value: "disabled" - name: SKIP_PARSE value: "true" - name: LONG_RUN value: "true" - name: RATE_IOPS value: "1167" volumeMounts: - name: pvc1 mountPath: /pvc1/ - name: shared-data mountPath: /test-result - name: metric-exporter image: phanle1010/kbench:broadcom imagePullPolicy: Always command: - metric-exporter - -d - start env: - name: DATA_DIR value: /test-result - name: VOLUME_ACCESS_MODE value: rwx - name: TEST_MODE value: write-only - name: RATE_LIMIT_TYPE value: rate-limit ports: - containerPort: 8080 name: metrics volumeMounts: - name: shared-data mountPath: /test-result volumes: - name: shared-data emptyDir: {} - name: pvc1 persistentVolumeClaim: claimName: pvc-1 - Scale up the stateful set to 6
- After 2-6 mins, exec into the pods and check that the mount point of RWX is frozen. I.e., running
ls -l /pvc1is hang
Expected behavior
Should not hang
Environment
- Longhorn version: Longhorn v1.6.0
- Impacted volume (PV):
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.7+k3s1
- Number of control plane nodes in the cluster: 1
- Number of worker nodes in the cluster: 4
- Node config
- OS type and version: Photon RT OS or Photon minimal at https://github.com/vmware/photon/wiki/Downloading-Photon-OS#downloading-photon-os-50-ga
- Kernel version: 6.1.10-10.ph5-rt OR 6.1.10-10.ph5
- CPU per node: 64
- Memory per node: 503GB
- Disk type (e.g. SSD/NVMe/HDD): NVMe
- Network bandwidth between the nodes (Gbps): 20Gbps
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
- Number of Longhorn volumes in the cluster: 1
Additional context
When I added an Ubuntu node to the cluster and deploy the sts on to ubuntu node, RWX volume is not frozen
About this issue
- Original URL
- State: open
- Created 3 months ago
- Reactions: 1
- Comments: 34 (33 by maintainers)
Update:
Update: The culprit might be Longhorn share manager
On Photon non-RT OS with latest kernel 6.1.81-2.ph5, we performed the following tests:
Case 1: Test with Kubernetes external NFS provisioner + hostPath (without any Longhorn component) ✅
This is the yaml to deploy this setup
<be>Result: NFS mounts are NOT frozen
Case 2: Test with Kubernetes external NFS provisioner + Longhorn RWO PVC ✅
This is the yaml to deploy this setup
<be>Result: NFS mounts are NOT frozen
Case 3: Test with Longhorn RWX PVC ❌
Result: NFS mounts are frozen
Analysis:
nfs-ganesha. I see that Kubernetes external NFS provisioner is usingnfs-ganesha: V4.0.8while Longhorn share manager is usingnfs-ganesha: V5.7:nfs-ganeshaversion difference is the culprit, I revert the share-manager to usenfs-ganesha: V4.2.3: Result: NFS mounts are NOT frozencc @innobead @shuo-wu @derekbit
Let’s do this if reasonable:
The latest version of nfs-ganesha is v5.7 hitting the fronzen issue. The fix will be included in v6.0.
Downgranding our nfs-ganesha temporarily is on choice. IIRC, v4.x we used before works well after fixing the daemon crash issue. cc @james-munson
Or, backporting the fix in the thread to v5.7 by ourselves is feasible as well.
Update:
Sorry! After repeatedly stress testing, the Photon non-RT OS with latest kernel 6.1.81-2.ph5 still suffers the NFS mounts frozen 😔
cc @innobead @derekbit @shuo-wu
Is it this one?
Looks like the one on Photon RT OS is newer
Test Plan
These simplified reproduction steps should trigger the issue quickly on both Ubuntu and Photon OS
apt update && apt install fio -y && echo "writing 3G file size ..." && dd if=/dev/urandom of=/test-pvc/test.img bs=1M count=3072 && sleep 1 && fio --name=random_write --rw=randwrite --direct=1 --size=3G --numjobs=50 --filename=/test-pvc/test.imgfio --name=random_write --rw=randwrite --direct=1 --size=3G --numjobs=50 --filename=/test-pvc/test.img. Now it should stuck. The mount point/test-pvc/should be frozen and cannot read/write to itRemove
backport 1.5.x, because v1.5.x is using nfs-ganesh v4.2.3 which is not impacted by the frozen issue. Ref: https://github.com/longhorn/longhorn/issues/8253#issuecomment-2027851926Thanks @derekbit ! Yeah, I was also looking at the ticket https://github.com/nfs-ganesha/nfs-ganesha/issues/1102 in the above analysis as well
I agree that we either need to pump the nfs-ganesha version or revert it to avoid the hanging issue
Update:
❌ Disable 1 NIC
Since each machine has 2 NICs and I saw that some time NFS conntion is going to the 2nd NICs while I specify K3s to use the 1st NIC only. I disable the 2nd NIC on all machines --> NFS mounts are still stuck on Phonton RT node
❌ Try with Photon non RT OS
Add a Photon non RT node and test it --> NFS mounts are still stuck on Phonton non RT node
❌ Trying to upgrade kernel and NFS module for Photon node
Case 1: Upgrade kernel, nfs module from
6.1.10-11.ph5to the latest version6.1.81-2.ph5Photon non RT node:Result: no hanging (run test for 1 hour) but update https://github.com/longhorn/longhorn/issues/8253#issuecomment-2026383213
Case 2: Upgrade kernel, nfs module from
6.1.10-10.ph5-rtto the latest version6.1.81-2.ph5-rtPhoton RT node: After upgrading, the Photon RT OS node always loose public IPs. This happen on all of our server labs. I am not sure if there is a bug in Photon RT OS new kernel. As the result, I cannot test the upgraded Photon RT OS. I will run the report on upgraded Photon non-RT OS instead. I will create a ticket to Photon repo for this one later⌛ If all fails and we cannot identify the root cause, I will apply the test plan on the Ubuntu cluster and report the data
Testing …
Simpler reproduce steps:
/pvc1in the pod is ok/pvc1in the pods are frozenNote: If we replace
command: ["/bin/bash", "-c", "while : ;do dd if=/dev/urandom of=/pvc1/test.img bs=4096 count=102400 conv=notrunc; sleep 1; done"]bycommand: ["sleep", "36000000"], the/pvc1in pods are not frozen after 5 mins. This led me to believe that the nfs mount is frozen under IO load cc @shuo-wu @derekbit @james-munsonFor https://github.com/longhorn/nfs-ganesha/pull/9, I think we need to be careful because we are cherry picking from upstream PRs but the upstream fixed PRs have not been merged:
They are actively reviewing and might change it so maybe better to wait for them to merge first before we add this to our stable releases. Hope that they will merge in the next few days though.
TODO:
nfs-ganeshaand update Longhorn share manager v1.5.x, v1.6.x, and master to point to new Longhornnfs-ganesha@PhanLe1010 I’ve patched the fixes to nfs-ganesha v5.7 and created a share-image v1.6.0 with it. You can test it, derekbit/longhorn-share-manager:v1.6.0-fix-hang, to see if the frozen issue is gone.
@PhanLe1010 After patching, I think we can continue using the RT-enabled distro for the following testing.
Hi @innobead! I see that Kubernetes external NFS provisioner is using
nfs-ganesha: V4.0.8while Longhorn share manager is usingnfs-ganesha: V5.7:Yes, after downgrading
nfs-ganesha, the problem is gone. The analysis section above has more detailsLooks like the version 4.0.8 is their latest version https://github.com/kubernetes-sigs/nfs-ganesha-server-and-external-provisioner/blob/master/deploy/docker/Dockerfile . So I cannot do this
I will check this
What’s NFS server used in NFS provisioned? Is it nfs ganesha? If yes, is the used version the same as us?
As discussed, @PhanLe1010 will also check if nfs mount has problems w/o Longhorn.
The module version seem to be related to the kernel version. I am wondering if there will be a similar issue when Ubuntu kernel version is the same as that of Photon RT OS.
Do we know about the nfs kernel module? But doubt we are able to check it.
Update:
I did some tests with different distributions of workload pods and share-manager pod. Result:
A few things can also be checked:
@innobead it happens in Photon OS RT but doesn’t happen on the Ubuntu 22.04 cluster. I am going to test photon OS non RT to see if it can happen
Good idea, I will check this
How about launching a new workload pod with the same image then doing a manual mount in the CSI plugin for the pod?
@shuo-wu @james-munson I followed your recommendation of manually mounting the nfs server on same node OR use Ubuntu: