longhorn: [BUG] Volumes stuck upgrading after 1.5.3 -> 1.6.0 upgrade.

Describe the bug

Upgraded the system from version 1.5.3 to 1.6.0 today, to get around the RWX bug. Got to the part of upgrading engines through UI. Did that to all volumes. The amount of replicas went sky high, and the volumes are stuck in the upgrading state.

As there was a volume stuck detaching, I thought restarting the instance manager responsible for it would maybe break it out. It did not, and made things so-so much worse. Now there’s even more volumes stuck detaching, all the volumes are declared degraded, and rebuilding replicas won’t happen. I cannot now rollback the volumes either, with the error: cannot do live upgrade for a unhealthy volume

To Reproduce

Not sure if it’s reproducible.

Expected behavior

I’ve done this upgrade method since longhorn version 1.0.2. I had expected it to work the same way, as it did before, where it upgrades all the volumes and finishes nicely.

Support bundle for troubleshooting

Support bundle attached.

Environment

Longhorn version: 1.6.0
Impacted volume (PV): All of them.
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Kubeadm, 1.27.4
- Number of control plane nodes in the cluster: 3
- Number of worker nodes in the cluster: 3
Node config
- OS type and version: Flatcar, latest
- Kernel version: 6.1.73
- CPU per node: 128 cores
- Memory per node: 512 GB
- Disk type (e.g. SSD/NVMe/HDD): NVMe
- Network bandwidth between the nodes (Gbps): 100 gbps
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: ~160

Additional context

This is a production system, so I am currently fairly worried.

About this issue

Original URL
State: closed
Created 5 months ago
Reactions: 3
Comments: 27 (15 by maintainers)

Most upvoted comments

I am not sure yet, but I think that may also be ineffective.

I cannot reproduce this behavior in my cluster and it is not seen by the upgrade tests in the CI either. It appears schema validation in the API server is rejecting these update requests, but I’m not sure why that would be the case in your cluster and not others. Are you aware of any cluster hardening you have in place that might affect this behavior?

The issue seems pretty similar to https://github.com/longhorn/longhorn/issues/3352, though that one is very old. The fix for it was to make some map fields nullable and to better ensure we submitted empty objects instead of nil to the Go Kubernetes client. Like the current issue, that one only seemed to be encountered in some clusters/versions.

This is my preferred attempt at a temporary fix. Unfortunately, I cannot validate it without being able to reproduce the issue.

kubectl edit crd instancemanagers.longhorn.io

Edit it in the following locations so that nullable: true:

              instanceEngines:
                additionalProperties:
                  properties:
                    spec:
                      properties:
                        backendStoreDriver:
                          description: 'Deprecated: Replaced by field `dataEngine`.'
                          type: string
                        dataEngine:
                          type: string
                        name:
                          type: string
                      type: object
                    status:
                      properties:
                        conditions:
                          additionalProperties:
                            type: boolean
                          nullable: true
                          type: object
						  
              instanceReplicas:
                additionalProperties:
                  properties:
                    spec:
                      properties:
                        backendStoreDriver:
                          description: 'Deprecated: Replaced by field `dataEngine`.'
                          type: string
                        dataEngine:
                          type: string
                        name:
                          type: string
                      type: object
                    status:
                      properties:
                        conditions:
                          additionalProperties:
                            type: boolean
                          nullable: true
                          type: object
						  
              instances:
                additionalProperties:
                  properties:
                    spec:
                      properties:
                        backendStoreDriver:
                          description: 'Deprecated: Replaced by field `dataEngine`.'
                          type: string
                        dataEngine:
                          type: string
                        name:
                          type: string
                      type: object
                    status:
                      properties:
                        conditions:
                          additionalProperties:
                            type: boolean
                          nullable: true
                          type: object

ejweber on Feb 8, 2024

I think you should leave the nullable: true fields. It sounds like the preserveUnknownFields: false does not, at least, cause anything unexpectedly bad to occur, so we will likely include both changes in the next Longhorn release.

ejweber on Feb 13, 2024

@Eilyre, I am pretty sure it will not cause any attach/detach operations, but it makes sense to me to wait until the other issue is resolved, just in case. I cannot easily test it because recent versions of Kubernetes won’t allow me to even set preserveUnknownFields: true as a means of reproducing the issue.

ejweber on Feb 12, 2024

I think there’s an underlying problem that caused this problem, and is rearing it’s head on our cluster again @ejweber. Attach/detach/deletion operations are getting stuck again, but only when the volume needs to attach to another node. Volume creation goes through, but it won’t be able to attach to the pod.

The replicas stay in a weird state:

The errors in logs are very similar as before: Failed to get engine proxy of pvc-58297811-5dc4-4ac3-944d-056512745d8d-e-a4e533b1 for volume pvc-58297811-5dc4-4ac3-944d-056512745d8d, but there’s also a lot of Invalid gRPC metadata. I also do not understand why 4 replicas are kept for a lot of volumes - my default is configured to 3.

I sent a new version of support bundle.

Eilyre on Feb 12, 2024

@ejweber lets don’t forget to add issues to the upcoming milestone (1.7.0).

innobead on Feb 9, 2024

Should I make this a permanent patch for my environment, or do you think this is a one-off issue?

I would leave it for now. I suspect that we will want to make the change official in the next version of Longhorn, but I need to investigate a bit more. If you open any additional issues against Longhorn before this is fully resolved, please remind us of these changes up front so we can consider if they have an impact.

Also, are there any logs/information I could provide, if needed to figure out why this happens in some situations?

Not yet. I am going to experiment with your specific version of Kubernetes and see if I can trigger anything. Hopefully you will be available to answer some additional questions later on if I think of any?

Thanks for opening the issue!

ejweber on Feb 8, 2024

Upgraded the system from version 1.5.3 to 1.6.0 today, to get around the RWX bug.

What issue number is that?

This one: https://github.com/longhorn/longhorn/issues/7183

Eilyre on Feb 8, 2024