longhorn: [BUG] Since 1.4.0 RWX volume failing regularly
Describe the bug (🐛 if you encounter this issue)
As per #5183 and work with @derekbit.
If more than one pod utilising the RWX volume (which is kind of the purpose of it 😄 ) the share-manager pod regularly restarts every few minutes resulting in errors from the consuming pods.
This is an existing volume created some time back and updated regularly to latest fixes.
The changes made last were
- upgrade longhorn to 1.4.0
- upgrade aks to 1.25.2 (using the create new nodepool method. (which have used before successfully))
Log or Support bundle
supportbundle_cf947ccc-9d7e-490f-8df5-f22db6f413f7_2023-01-06T14-28-35Z.zip Logs using the latest patch from @derekbit longhorn-manager-2.zip
Environment
- Longhorn version: 1.4.0
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm (upgraded from previously deployed 1.3.2)
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: AKS 1.25.2
- Number of management node in the cluster:
- Number of worker node in the cluster: 3
- Node config
- OS type and version: Ubuntu 22.04.1 LTS
- CPU per node: 2
- Memory per node: 8Gb
- Disk type(e.g. SSD/NVMe): Premium SSD LRS
- Network bandwidth between the nodes: ?
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AKS
- Number of Longhorn volumes in the cluster: 1
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 3
- Comments: 39 (26 by maintainers)
I’ve tried to reproduced this on AKS, but still unable to observe share-manager restart yet.
With the following 3 scenarios, the share-manager is running stably without restart.
Directly running v1.25.2 cluster (1) create v1.25.2 cluster (2) helm install longhorn v1.4.0 (3) kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/examples/rwx/rwx-nginx-deployment.yaml (4) wait at least 15 mins for share-manager restart
Upgrade longhorn from v1.3.2 to v1.4.0, and upgrade k8s from v1.24.6 to v1.25.2 (1) create v1.24.6 cluster (2) helm install longhorn v1.3.2 (3) kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/examples/rwx/rwx-nginx-deployment.yaml (4) helm upgrade longhorn v1.4.0 (5) upgrade k8s version from v1.24.6 to v1.25.2 following the steps: https://longhorn.github.io/longhorn-tests/manual/pre-release/managed-kubernetes-clusters/aks/upgrade-k8s/ (6) wait at least 15 mins for share-manager restart
Upgrade longhorn from v1.3.2 to v1.4.0, and upgrade k8s from v1.22.11 to v1.23.12, then from v1.23.12 to v1.24.6, then from v1.24.6 to v1.25.2 (1) create v1.22.11 cluster (2) helm install longhorn v1.3.2 (3) kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/examples/rwx/rwx-nginx-deployment.yaml (4) helm upgrade longhorn v1.4.0 (5) upgrade k8s version from v1.22.11 to v1.23.12, then from v1.23.12 to v1.24.6, then from v1.24.6 to v1.25.2 following the steps: https://longhorn.github.io/longhorn-tests/manual/pre-release/managed-kubernetes-clusters/aks/upgrade-k8s/ (6) wait at least 15 mins for share-manager restart
Just wondering is there any extra settings customized when creating the cluster (since I only create the cluster with default settings)? Is this issue always reproducible in your azure environment (like does this issue only happen in this specific cluster + node pool? or it is always reproducible when you create a new cluster. moreover, if create another new node pool and migrate workloads to this new node pool just like what we did when upgrade k8s https://longhorn.github.io/longhorn-tests/manual/pre-release/managed-kubernetes-clusters/aks/upgrade-k8s/, will the restart sustain?) Thanks!
1.24. I have been using longhorn since 1.1.2 i think and upgraded as a new version was available. you can see here i was using at AKS1.23 #3873 also was using AKS 1.21 here #2787
@tbertenshaw From the dmesg, I saw a tons of error messages
ganesha.nfsd[278263]: segfault at 118 ip 00007f1bb2220692 sp 00007f1ba21f3fb0 error 4 in libganesha_nfsd.so.4.2[7f1bb2191000+194000]. It means the nfs-ganesha server in the shared-manager somehow crashed repeatedly.Also tried to locate the poition in nfs-ganesha
where 8F692 is 00007f1bb2220692-7f1bb2191000. It looks the issue is not from our recovery backend C codes in nfs-ganesha.
I’d report the issue to nfs-ganesha community, but the question is that I cannot reproduce it in our environment and hard to open a ticket…
It seems nfs-ganesha v4.2 has some issues. I can backport the recovery backend codes to nfs-ganesha v3.x and hope @tbertenshaw do us a favor for testing it. WDYT? @tbertenshaw @innobead
cc @PhanLe1010
yeah not sure what that is or means if i can write to the disks 😄
Adding
require/qa-reproduceto reproduce this first before @derekbit works on this issue. cc @longhorn/qaGot it, thank you. Will check the restart. Now, we can confirm the frequent restarts are caused by the issue in nfs-ganesha 4.2. cc @innobead
There are at least two issues in this ticket.
The first one is that the share-manager pod is deleted and created repeatedly bacause of the ticket https://github.com/longhorn/longhorn-manager/commit/ed1f74273fb2d86d85d910ec7a08e2654f477536. Immediate checking
pod.spec.nodeName(code) after creating a pod (code) would lead to an error that the pod is not scheduled by kubelet yet. For this issue, I added a verifyCreation logic in the share-manager controller, but I doubt there are some issues not resolved in the controller.The second issue is I mentioned in https://github.com/longhorn/longhorn/issues/5224#issuecomment-1374431192. Actually, not sure if related to the first one or the NFS-ganesha issue. So, the customized images is to clarify if it is introduced by NFS-ganesha first.
cc @innobead
Let’s test it with @tbertenshaw first. I am curious if this is only able to reproduce in a specific env 🤔.
Hello @tbertenshaw,
As I explained, I’ve built the longhorn-manger image and the share-manager image. If available, you can try them.
imagewithderekbit/longhorn-manager:1.4.0-rwxcommand’s share-manager-image withderekbit/longhorn-share-manager:v_20230107. Here is an example.You can revert the images back to the original ones after testing. Also help provide us the the support bundle and
dmesg -T. Many thanks. 😃It is fine. I can check the latest support bundle in https://github.com/longhorn/longhorn/issues/5224#issuecomment-1373829619. Appreciated.
supportbundle_cf947ccc-9d7e-490f-8df5-f22db6f413f7_2023-01-06T15-55-03Z.zip