longhorn: [BUG] Since 1.4.0 RWX volume failing regularly

Describe the bug (🐛 if you encounter this issue)

As per #5183 and work with @derekbit.

If more than one pod utilising the RWX volume (which is kind of the purpose of it 😄 ) the share-manager pod regularly restarts every few minutes resulting in errors from the consuming pods.

This is an existing volume created some time back and updated regularly to latest fixes.

The changes made last were

  • upgrade longhorn to 1.4.0
  • upgrade aks to 1.25.2 (using the create new nodepool method. (which have used before successfully))

Log or Support bundle

supportbundle_cf947ccc-9d7e-490f-8df5-f22db6f413f7_2023-01-06T14-28-35Z.zip Logs using the latest patch from @derekbit longhorn-manager-2.zip

Environment

  • Longhorn version: 1.4.0
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm (upgraded from previously deployed 1.3.2)
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: AKS 1.25.2
    • Number of management node in the cluster:
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: Ubuntu 22.04.1 LTS
    • CPU per node: 2
    • Memory per node: 8Gb
    • Disk type(e.g. SSD/NVMe): Premium SSD LRS
    • Network bandwidth between the nodes: ?
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AKS
  • Number of Longhorn volumes in the cluster: 1

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 3
  • Comments: 39 (26 by maintainers)

Most upvoted comments

I’ve tried to reproduced this on AKS, but still unable to observe share-manager restart yet.

With the following 3 scenarios, the share-manager is running stably without restart.

Just wondering is there any extra settings customized when creating the cluster (since I only create the cluster with default settings)? Is this issue always reproducible in your azure environment (like does this issue only happen in this specific cluster + node pool? or it is always reproducible when you create a new cluster. moreover, if create another new node pool and migrate workloads to this new node pool just like what we did when upgrade k8s https://longhorn.github.io/longhorn-tests/manual/pre-release/managed-kubernetes-clusters/aks/upgrade-k8s/, will the restart sustain?) Thanks!

1.24. I have been using longhorn since 1.1.2 i think and upgraded as a new version was available. you can see here i was using at AKS1.23 #3873 also was using AKS 1.21 here #2787

@tbertenshaw From the dmesg, I saw a tons of error messages ganesha.nfsd[278263]: segfault at 118 ip 00007f1bb2220692 sp 00007f1ba21f3fb0 error 4 in libganesha_nfsd.so.4.2[7f1bb2191000+194000]. It means the nfs-ganesha server in the shared-manager somehow crashed repeatedly.

Also tried to locate the poition in nfs-ganesha

# addr2line -e libganesha_nfsd.so.4.2 -fCi 8F692
dec_nfs4_state_ref
:?

where 8F692 is 00007f1bb2220692-7f1bb2191000. It looks the issue is not from our recovery backend C codes in nfs-ganesha.

I’d report the issue to nfs-ganesha community, but the question is that I cannot reproduce it in our environment and hard to open a ticket…

It seems nfs-ganesha v4.2 has some issues. I can backport the recovery backend codes to nfs-ganesha v3.x and hope @tbertenshaw do us a favor for testing it. WDYT? @tbertenshaw @innobead

yeah not sure what that is or means if i can write to the disks 😄 image

Adding require/qa-reproduce to reproduce this first before @derekbit works on this issue. cc @longhorn/qa

you can see one share-manager’s latest pod instance is 2 hours after the longhorn-manager instances which were created when i edited the daemonset. So implying that that share-manager (the one for the moodle workload) has restarted once at least? Certainly significantly less restarts if just one anyway

Got it, thank you. Will check the restart. Now, we can confirm the frequent restarts are caused by the issue in nfs-ganesha 4.2. cc @innobead

There are at least two issues in this ticket.

The first one is that the share-manager pod is deleted and created repeatedly bacause of the ticket https://github.com/longhorn/longhorn-manager/commit/ed1f74273fb2d86d85d910ec7a08e2654f477536. Immediate checking pod.spec.nodeName (code) after creating a pod (code) would lead to an error that the pod is not scheduled by kubelet yet. For this issue, I added a verifyCreation logic in the share-manager controller, but I doubt there are some issues not resolved in the controller.

The second issue is I mentioned in https://github.com/longhorn/longhorn/issues/5224#issuecomment-1374431192. Actually, not sure if related to the first one or the NFS-ganesha issue. So, the customized images is to clarify if it is introduced by NFS-ganesha first.

cc @innobead

Let’s test it with @tbertenshaw first. I am curious if this is only able to reproduce in a specific env 🤔.

@tbertenshaw From the dmesg, I saw a tons of error messages ganesha.nfsd[278263]: segfault at 118 ip 00007f1bb2220692 sp 00007f1ba21f3fb0 error 4 in libganesha_nfsd.so.4.2[7f1bb2191000+194000]. It means the nfs-ganesha server in the shared-manager somehow crashed repeatedly.

Also tried to locate the poition in nfs-ganesha

# addr2line -e libganesha_nfsd.so.4.2 -fCi 8F692
dec_nfs4_state_ref
:?

where 8F692 is 00007f1bb2220692-7f1bb2191000. It looks the issue is not from our recovery backend C codes in nfs-ganesha.

I’d report the issue to nfs-ganesha community, but the question is that I cannot reproduce it in our environment and hard to open a ticket…

It seems nfs-ganesha v4.2 has some issues. I can backport the recovery backend codes to nfs-ganesha v3.x and hope @tbertenshaw do us a favor for testing it. WDYT? @tbertenshaw @innobead

Hello @tbertenshaw,

As I explained, I’ve built the longhorn-manger image and the share-manager image. If available, you can try them.

  1. Replace the longhorn-manager daemonset image with derekbit/longhorn-manager:1.4.0-rwx
  2. Replace the longhorn-manager daemonset command’s share-manager-image with derekbit/longhorn-share-manager:v_20230107. Here is an example.
      containers:
      - command:
        - longhorn-manager
        ...
        - --share-manager-image
        - derekbit/longhorn-share-manager:v_20230107

You can revert the images back to the original ones after testing. Also help provide us the the support bundle and dmesg -T. Many thanks. 😃

It is fine. I can check the latest support bundle in https://github.com/longhorn/longhorn/issues/5224#issuecomment-1373829619. Appreciated.