longhorn: [BUG] 1.5.0 Upgrade: Longhorn conversion webhook server fails
Describe the bug (🐛 if you encounter this issue)
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior: Upgrade to longhorn 1.5 from 1.4.2 via Helm
Expected behavior
Successful upgrade
A clear and concise description of what you expected to happen.
Successful upgrade to 1.5.0
Log or Support bundle
new longhorn-manager (longhornio/longhorn-manager:v1.5.0) pods are in a crash loop
old longhorn-admission-webhook (longhornio/longhorn-manager:v1.4.2) pods are in a bad status
Containers with incomplete status: [wait-longhorn-conversion-webhook]
Logs from one of the new manager pods
time="2023-07-07T21:55:20Z" level=info msg="Starting longhorn conversion webhook server"
W0707 21:55:20.669267 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2023-07-07T21:55:20Z" level=info msg="Waiting for conversion webhook to become ready"
time="2023-07-07T21:55:20Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9501/v1/healthz" error="Get \"https://localhost:9501/v1/healthz\": dial tcp 127.0.0.1:9501: connect: connection refused"
time="2023-07-07T21:55:20Z" level=info msg="Active TLS secret longhorn-webhook-tls (ver=405527433) (count 2): map[listener.cattle.io/cn-longhorn-admission-webhook.longhor-59584d:longhorn-admission-webhook.longhorn-system.svc listener.cattle.io/cn-longhorn-conversion-webhook.longho-6a0089:longhorn-conversion-webhook.longhorn-system.svc listener.cattle.io/fingerprint:SHA1=D26B36C25CA9B87164799F17931821667AFF374E]"
time="2023-07-07T21:55:20Z" level=info msg="Listening on :9501"
time="2023-07-07T21:55:21Z" level=info msg="Starting apiregistration.k8s.io/v1, Kind=APIService controller"
time="2023-07-07T21:55:21Z" level=info msg="Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller"
time="2023-07-07T21:55:21Z" level=info msg="Starting /v1, Kind=Secret controller"
time="2023-07-07T21:55:21Z" level=info msg="Building conversion rules..."
time="2023-07-07T21:55:21Z" level=info msg="Updating TLS secret for longhorn-webhook-tls (count: 2): map[listener.cattle.io/cn-longhorn-admission-webhook.longhor-59584d:longhorn-admission-webhook.longhorn-system.svc listener.cattle.io/cn-longhorn-conversion-webhook.longho-6a0089:longhorn-conversion-webhook.longhorn-system.svc listener.cattle.io/fingerprint:SHA1=D26B36C25CA9B87164799F17931821667AFF374E]"
time="2023-07-07T21:55:22Z" level=info msg="Webhook conversion is ready"
time="2023-07-07T21:55:22Z" level=warning msg="Started longhorn conversion webhook server"
time="2023-07-07T21:55:22Z" level=info msg="Starting longhorn admission webhook server"
W0707 21:55:22.672755 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2023-07-07T21:55:22Z" level=info msg="Waiting for admission webhook to become ready"
time="2023-07-07T21:55:22Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
I0707 21:55:22.674234 1 shared_informer.go:311] Waiting for caches to sync for longhorn datastore
W0707 21:55:22.679839 1 reflector.go:533] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
E0707 21:55:22.679874 1 reflector.go:148] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to watch *v1beta2.VolumeAttachment: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
I0707 21:55:23.874892 1 request.go:696] Waited for 1.199308873s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/longhorn.io/v1beta2/backingimagemanagers?limit=500&resourceVersion=0
time="2023-07-07T21:55:24Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
W0707 21:55:25.076523 1 reflector.go:533] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
E0707 21:55:25.076574 1 reflector.go:148] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to watch *v1beta2.VolumeAttachment: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
time="2023-07-07T21:55:26Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
W0707 21:55:26.987506 1 reflector.go:533] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
E0707 21:55:26.987561 1 reflector.go:148] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to watch *v1beta2.VolumeAttachment: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
time="2023-07-07T21:55:28Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:30Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:32Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
W0707 21:55:33.027197 1 reflector.go:533] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
E0707 21:55:33.027234 1 reflector.go:148] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to watch *v1beta2.VolumeAttachment: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
time="2023-07-07T21:55:34Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:36Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:38Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
W0707 21:55:40.367961 1 reflector.go:533] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
E0707 21:55:40.367994 1 reflector.go:148] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to watch *v1beta2.VolumeAttachment: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
time="2023-07-07T21:55:40Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:42Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:44Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:46Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:48Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:50Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:52Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:54Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:56Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:58Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
W0707 21:55:59.335977 1 reflector.go:533] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
E0707 21:55:59.336017 1 reflector.go:148] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to watch *v1beta2.VolumeAttachment: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
time="2023-07-07T21:56:00Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:02Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:04Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:06Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:08Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:10Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:12Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:14Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:16Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:18Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:20Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:22Z" level=fatal msg="Error starting manager: admission webhook is not ready after 1m0s sec"
Workaround
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 72 (27 by maintainers)
@timothystewart6 I feel the upgrade path is not resilient enough from your case. Created a ticket https://github.com/longhorn/longhorn/issues/6294 for checking if we can further improve it. Many thanks for raising the issue.
I needed to disable longhornRecoveryBackend and longhornConversionWebhook in helm chart values! longhorn now is working without any problems
@derekbit Thank you for checking. There isn’t anything that was different on my end for this upgrade vs all other upgrades over the years, this is the same cluster. I understand that this is probably tough to diagnose and I am fine with chalking this up as a one off. I will most likely attempt to upgrade again after the next patch is released. I appreciate all the help from everyone in this thread!
I ran into the first two points after upgrading to 1.5.1. Regarding the third point, some pods indeed restarted multiple times, but the pods starts successfully for me after waiting for a while. Following the multipath guide solves the first two points in my case.
A bit more details about the multipath: I did not ran into multipath issue immediately after the upgrade. It’s when I was upgrading a statefulset later on that needs volume unmount and remount, the multipath issue occurred. It only happened to one of my nodes though. And after addressing the multipath last week, everything’s normal so far.
Also you guys should have published a v1.5.1 with this small fix or even remove this version from GA as a safe upgrade altogether.
Thanks! I will let you know if / when I upgrade to 1.5.x and if I run into issues. I will most likely stay on 1.4.x until 1.5.x is marked as stable.
Our flux-based upgrade also failed but we were fortunate to be able to backup and redeploy v1.4.2…
@esmaeilzadehayub It’s a known issue and please refer to the workaround in https://github.com/longhorn/longhorn/issues/6252.
14 hours later and I have restored all of my volumes and reverted back to 1.4.2
1/5 would not recommend 😅
Thanks @timothystewart6! I am trying to reproduce
@PhanLe1010 Hi. Yes, it’s the only cluster I have longhorn installed in. It failed during the upgrade due to the first logs.