longhorn: [BUG] 1.5.0 Upgrade: Longhorn conversion webhook server fails

Describe the bug (🐛 if you encounter this issue)

A clear and concise description of what the bug is.

To Reproduce

Steps to reproduce the behavior: Upgrade to longhorn 1.5 from 1.4.2 via Helm

Expected behavior

Successful upgrade

A clear and concise description of what you expected to happen.

Successful upgrade to 1.5.0

Log or Support bundle

new longhorn-manager (longhornio/longhorn-manager:v1.5.0) pods are in a crash loop

old longhorn-admission-webhook (longhornio/longhorn-manager:v1.4.2) pods are in a bad status

Containers with incomplete status: [wait-longhorn-conversion-webhook]

Logs from one of the new manager pods

time="2023-07-07T21:55:20Z" level=info msg="Starting longhorn conversion webhook server"
W0707 21:55:20.669267       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2023-07-07T21:55:20Z" level=info msg="Waiting for conversion webhook to become ready"
time="2023-07-07T21:55:20Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9501/v1/healthz" error="Get \"https://localhost:9501/v1/healthz\": dial tcp 127.0.0.1:9501: connect: connection refused"
time="2023-07-07T21:55:20Z" level=info msg="Active TLS secret longhorn-webhook-tls (ver=405527433) (count 2): map[listener.cattle.io/cn-longhorn-admission-webhook.longhor-59584d:longhorn-admission-webhook.longhorn-system.svc listener.cattle.io/cn-longhorn-conversion-webhook.longho-6a0089:longhorn-conversion-webhook.longhorn-system.svc listener.cattle.io/fingerprint:SHA1=D26B36C25CA9B87164799F17931821667AFF374E]"
time="2023-07-07T21:55:20Z" level=info msg="Listening on :9501"
time="2023-07-07T21:55:21Z" level=info msg="Starting apiregistration.k8s.io/v1, Kind=APIService controller"
time="2023-07-07T21:55:21Z" level=info msg="Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller"
time="2023-07-07T21:55:21Z" level=info msg="Starting /v1, Kind=Secret controller"
time="2023-07-07T21:55:21Z" level=info msg="Building conversion rules..."
time="2023-07-07T21:55:21Z" level=info msg="Updating TLS secret for longhorn-webhook-tls (count: 2): map[listener.cattle.io/cn-longhorn-admission-webhook.longhor-59584d:longhorn-admission-webhook.longhorn-system.svc listener.cattle.io/cn-longhorn-conversion-webhook.longho-6a0089:longhorn-conversion-webhook.longhorn-system.svc listener.cattle.io/fingerprint:SHA1=D26B36C25CA9B87164799F17931821667AFF374E]"
time="2023-07-07T21:55:22Z" level=info msg="Webhook conversion is ready"
time="2023-07-07T21:55:22Z" level=warning msg="Started longhorn conversion webhook server"
time="2023-07-07T21:55:22Z" level=info msg="Starting longhorn admission webhook server"
W0707 21:55:22.672755       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2023-07-07T21:55:22Z" level=info msg="Waiting for admission webhook to become ready"
time="2023-07-07T21:55:22Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
I0707 21:55:22.674234       1 shared_informer.go:311] Waiting for caches to sync for longhorn datastore
W0707 21:55:22.679839       1 reflector.go:533] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
E0707 21:55:22.679874       1 reflector.go:148] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to watch *v1beta2.VolumeAttachment: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
I0707 21:55:23.874892       1 request.go:696] Waited for 1.199308873s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/longhorn.io/v1beta2/backingimagemanagers?limit=500&resourceVersion=0
time="2023-07-07T21:55:24Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
W0707 21:55:25.076523       1 reflector.go:533] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
E0707 21:55:25.076574       1 reflector.go:148] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to watch *v1beta2.VolumeAttachment: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
time="2023-07-07T21:55:26Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
W0707 21:55:26.987506       1 reflector.go:533] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
E0707 21:55:26.987561       1 reflector.go:148] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to watch *v1beta2.VolumeAttachment: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
time="2023-07-07T21:55:28Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:30Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:32Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
W0707 21:55:33.027197       1 reflector.go:533] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
E0707 21:55:33.027234       1 reflector.go:148] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to watch *v1beta2.VolumeAttachment: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
time="2023-07-07T21:55:34Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:36Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:38Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
W0707 21:55:40.367961       1 reflector.go:533] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
E0707 21:55:40.367994       1 reflector.go:148] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to watch *v1beta2.VolumeAttachment: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
time="2023-07-07T21:55:40Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:42Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:44Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:46Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:48Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:50Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:52Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:54Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:56Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:55:58Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
W0707 21:55:59.335977       1 reflector.go:533] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
E0707 21:55:59.336017       1 reflector.go:148] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to watch *v1beta2.VolumeAttachment: failed to list *v1beta2.VolumeAttachment: volumeattachments.longhorn.io is forbidden: User "system:serviceaccount:longhorn-system:longhorn-service-account" cannot list resource "volumeattachments" in API group "longhorn.io" at the cluster scope
time="2023-07-07T21:56:00Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:02Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:04Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:06Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:08Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:10Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:12Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:14Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:16Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:18Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:20Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
time="2023-07-07T21:56:22Z" level=fatal msg="Error starting manager: admission webhook is not ready after 1m0s sec"

Workaround

https://github.com/longhorn/longhorn/issues/6252

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 72 (27 by maintainers)

Commits related to this issue

Most upvoted comments

@timothystewart6 I feel the upgrade path is not resilient enough from your case. Created a ticket https://github.com/longhorn/longhorn/issues/6294 for checking if we can further improve it. Many thanks for raising the issue.

@esmaeilzadehayub It’s a known issue and please refer to the workaround in #6252.

I needed to disable longhornRecoveryBackend and longhornConversionWebhook in helm chart values! longhorn now is working without any problems

@derekbit Thank you for checking. There isn’t anything that was different on my end for this upgrade vs all other upgrades over the years, this is the same cluster. I understand that this is probably tough to diagnose and I am fine with chalking this up as a one off. I will most likely attempt to upgrade again after the next patch is released. I appreciate all the help from everyone in this thread!

  • longhorn would think that a node was still attached to a volume, even though it clearly wasn’t
  • sometimes volumes would not attach to a pod when a pod was created (this was probably due to the previous point)

I ran into the first two points after upgrading to 1.5.1. Regarding the third point, some pods indeed restarted multiple times, but the pods starts successfully for me after waiting for a while. Following the multipath guide solves the first two points in my case.

A bit more details about the multipath: I did not ran into multipath issue immediately after the upgrade. It’s when I was upgrading a statefulset later on that needs volume unmount and remount, the multipath issue occurred. It only happened to one of my nodes though. And after addressing the multipath last week, everything’s normal so far.

I couldn’t reproduce since the Longhorn Helm chart is already updated to fix the issue: longhorn/charts@8771f25 Could you fetch the updated repo and provide us the Helm template for verification?

helm repo update
helm template longhorn longhorn/longhorn --version 1.5.0

@PhanLe1010 I’m having the same issue and, while trying to fix it, I’ve notice that this fix is not released in the chart repo.

The asset longhorn-1.5.0.tgz need to be regenerated to include this fix.

For the moment, I’ve just set the replicas count to 0.

Also you guys should have published a v1.5.1 with this small fix or even remove this version from GA as a safe upgrade altogether.

Thanks! I will let you know if / when I upgrade to 1.5.x and if I run into issues. I will most likely stay on 1.4.x until 1.5.x is marked as stable.

Our flux-based upgrade also failed but we were fortunate to be able to backup and redeploy v1.4.2…

@esmaeilzadehayub It’s a known issue and please refer to the workaround in https://github.com/longhorn/longhorn/issues/6252.

14 hours later and I have restored all of my volumes and reverted back to 1.4.2

1/5 would not recommend 😅

Thanks @timothystewart6! I am trying to reproduce

@PhanLe1010 Hi. Yes, it’s the only cluster I have longhorn installed in. It failed during the upgrade due to the first logs.