longhorn: [BUG] Volumes stuck upgrading after 1.5.3 -> 1.6.0 upgrade.
Describe the bug
Upgraded the system from version 1.5.3 to 1.6.0 today, to get around the RWX bug. Got to the part of upgrading engines through UI. Did that to all volumes. The amount of replicas went sky high, and the volumes are stuck in the upgrading state.
As there was a volume stuck detaching, I thought restarting the instance manager responsible for it would maybe break it out. It did not, and made things so-so much worse. Now there’s even more volumes stuck detaching, all the volumes are declared degraded, and rebuilding replicas won’t happen. I cannot now rollback the volumes either, with the error: cannot do live upgrade for a unhealthy volume
To Reproduce
Not sure if it’s reproducible.
Expected behavior
I’ve done this upgrade method since longhorn version 1.0.2. I had expected it to work the same way, as it did before, where it upgrades all the volumes and finishes nicely.
Support bundle for troubleshooting
Support bundle attached.
Environment
- Longhorn version: 1.6.0
- Impacted volume (PV): All of them.
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Kubeadm, 1.27.4
- Number of control plane nodes in the cluster: 3
- Number of worker nodes in the cluster: 3
- Node config
- OS type and version: Flatcar, latest
- Kernel version: 6.1.73
- CPU per node: 128 cores
- Memory per node: 512 GB
- Disk type (e.g. SSD/NVMe/HDD): NVMe
- Network bandwidth between the nodes (Gbps): 100 gbps
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
- Number of Longhorn volumes in the cluster: ~160
Additional context
This is a production system, so I am currently fairly worried.
About this issue
- Original URL
- State: closed
- Created 5 months ago
- Reactions: 3
- Comments: 27 (15 by maintainers)
I am not sure yet, but I think that may also be ineffective.
I cannot reproduce this behavior in my cluster and it is not seen by the upgrade tests in the CI either. It appears schema validation in the API server is rejecting these update requests, but I’m not sure why that would be the case in your cluster and not others. Are you aware of any cluster hardening you have in place that might affect this behavior?
The issue seems pretty similar to https://github.com/longhorn/longhorn/issues/3352, though that one is very old. The fix for it was to make some map fields nullable and to better ensure we submitted empty objects instead of nil to the Go Kubernetes client. Like the current issue, that one only seemed to be encountered in some clusters/versions.
This is my preferred attempt at a temporary fix. Unfortunately, I cannot validate it without being able to reproduce the issue.
Edit it in the following locations so that
nullable: true:I think you should leave the
nullable: truefields. It sounds like thepreserveUnknownFields: falsedoes not, at least, cause anything unexpectedly bad to occur, so we will likely include both changes in the next Longhorn release.@Eilyre, I am pretty sure it will not cause any attach/detach operations, but it makes sense to me to wait until the other issue is resolved, just in case. I cannot easily test it because recent versions of Kubernetes won’t allow me to even set
preserveUnknownFields: trueas a means of reproducing the issue.I think there’s an underlying problem that caused this problem, and is rearing it’s head on our cluster again @ejweber. Attach/detach/deletion operations are getting stuck again, but only when the volume needs to attach to another node. Volume creation goes through, but it won’t be able to attach to the pod.
The replicas stay in a weird state:
The errors in logs are very similar as before:
Failed to get engine proxy of pvc-58297811-5dc4-4ac3-944d-056512745d8d-e-a4e533b1 for volume pvc-58297811-5dc4-4ac3-944d-056512745d8d, but there’s also a lot ofInvalid gRPC metadata. I also do not understand why 4 replicas are kept for a lot of volumes - my default is configured to 3.I sent a new version of support bundle.
@ejweber lets don’t forget to add issues to the upcoming milestone (1.7.0).
I would leave it for now. I suspect that we will want to make the change official in the next version of Longhorn, but I need to investigate a bit more. If you open any additional issues against Longhorn before this is fully resolved, please remind us of these changes up front so we can consider if they have an impact.
Not yet. I am going to experiment with your specific version of Kubernetes and see if I can trigger anything. Hopefully you will be available to answer some additional questions later on if I think of any?
Thanks for opening the issue!
This one: https://github.com/longhorn/longhorn/issues/7183