longhorn: [BUG] Upgrade to 1.5.0 failed: validator.longhorn.io denied the request if having orphan resources

Describe the bug (šŸ› if you encounter this issue)

Upgrade to v1.5.0 failed with the error:

Error starting manager: upgrade resources failed: admission webhook \"validator.longhorn.io\" denied the request: orphan orphan-024ed10f75415525327901b184c46b279fa24fdab23c89ea80e9f6ea7be50c83 spec fields are immutable

To Reproduce

Steps to reproduce the behavior:

Upgrade Longhorn from v1.4.2 to v.1.5.0 via Helm

Expected behavior

Successful upgrade to v1.5.0

Log or Support bundle

Logs from longhorn-manager:

time="2023-07-07T07:04:13Z" level=info msg="Starting longhorn conversion webhook server"
W0707 07:04:13.804064       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2023-07-07T07:04:13Z" level=info msg="Waiting for conversion webhook to become ready"
time="2023-07-07T07:04:13Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9501/v1/healthz" error="Get \"https://localhost:9501/v1/healthz\": dial tcp 127.0.0.1:9501: connect: connection refused"
time="2023-07-07T07:04:13Z" level=info msg="Active TLS secret longhorn-webhook-tls (ver=2687) (count 2): map[listener.cattle.io/cn-longhorn-admission-webhook.longhorn.svc:longhorn-admission-webhook.longhorn.svc listener.cattle.io/cn-longhorn-conversion-webhook.longhorn.svc:longhorn-conversion-webhook.longhorn.svc listener.cattle.io/fingerprint:SHA1=5722A8DEA1DC17BCBDFA7D07C25EB1C0DBB6C4F3]"
time="2023-07-07T07:04:13Z" level=info msg="Listening on :9501"
time="2023-07-07T07:04:15Z" level=info msg="Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller"
time="2023-07-07T07:04:15Z" level=info msg="Starting /v1, Kind=Secret controller"
time="2023-07-07T07:04:15Z" level=info msg="Starting apiregistration.k8s.io/v1, Kind=APIService controller"
time="2023-07-07T07:04:15Z" level=info msg="Building conversion rules..."
time="2023-07-07T07:04:15Z" level=info msg="Updating TLS secret for longhorn-webhook-tls (count: 2): map[listener.cattle.io/cn-longhorn-admission-webhook.longhorn.svc:longhorn-admission-webhook.longhorn.svc listener.cattle.io/cn-longhorn-conversion-webhook.longhorn.svc:longhorn-conversion-webhook.longhorn.svc listener.cattle.io/fingerprint:SHA1=5722A8DEA1DC17BCBDFA7D07C25EB1C0DBB6C4F3]"
time="2023-07-07T07:04:15Z" level=info msg="Webhook conversion is ready"
time="2023-07-07T07:04:15Z" level=warning msg="Started longhorn conversion webhook server"
time="2023-07-07T07:04:15Z" level=info msg="Starting longhorn admission webhook server"
W0707 07:04:15.815424       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2023-07-07T07:04:15Z" level=info msg="Waiting for admission webhook to become ready"
I0707 07:04:15.817401       1 shared_informer.go:311] Waiting for caches to sync for longhorn datastore
time="2023-07-07T07:04:15Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
I0707 07:04:17.018086       1 request.go:696] Waited for 1.198095876s due to client-side throttling, not priority and fairness, request: GET:https://10.3.0.1:443/apis/longhorn.io/v1beta2/sharemanagers?limit=500&resourceVersion=0
time="2023-07-07T07:04:17Z" level=warning msg="Failed to get webhook health endpoint https://localhost:9502/v1/healthz" error="Get \"https://localhost:9502/v1/healthz\": dial tcp 127.0.0.1:9502: connect: connection refused"
I0707 07:04:18.117793       1 shared_informer.go:318] Caches are synced for longhorn datastore
time="2023-07-07T07:04:18Z" level=info msg="Add validaton handler for nodes.longhorn.io (Node)"
time="2023-07-07T07:04:18Z" level=info msg="Add validaton handler for settings.longhorn.io (Setting)"
time="2023-07-07T07:04:18Z" level=info msg="Add validaton handler for recurringjobs.longhorn.io (RecurringJob)"
time="2023-07-07T07:04:18Z" level=info msg="Add validaton handler for backingimages.longhorn.io (BackingImage)"
time="2023-07-07T07:04:18Z" level=info msg="Add validaton handler for volumes.longhorn.io (Volume)"
time="2023-07-07T07:04:18Z" level=info msg="Add validaton handler for orphans.longhorn.io (Orphan)"
time="2023-07-07T07:04:18Z" level=info msg="Add validaton handler for snapshots.longhorn.io (Snapshot)"
time="2023-07-07T07:04:18Z" level=info msg="Add validaton handler for supportbundles.longhorn.io (SupportBundle)"
time="2023-07-07T07:04:18Z" level=info msg="Add validaton handler for systembackups.longhorn.io (SystemBackup)"
time="2023-07-07T07:04:18Z" level=info msg="Add validaton handler for systemrestores.longhorn.io (SystemRestore)"
time="2023-07-07T07:04:18Z" level=info msg="Add validaton handler for volumeattachments.longhorn.io (VolumeAttachment)"
time="2023-07-07T07:04:18Z" level=info msg="Add validaton handler for engines.longhorn.io (Engine)"
time="2023-07-07T07:04:18Z" level=info msg="Add validaton handler for replicas.longhorn.io (Replica)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for backups.longhorn.io (Backup)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for backingimages.longhorn.io (BackingImage)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for backingImageManagers.longhorn.io (BackingImageManager)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for backingimagedatasources.longhorn.io (BackingImageDataSource)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for nodes.longhorn.io (Node)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for volumes.longhorn.io (Volume)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for engines.longhorn.io (Engine)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for recurringjobs.longhorn.io (RecurringJob)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for engineimages.longhorn.io (EngineImage)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for orphans.longhorn.io (Orphan)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for sharemanagers.longhorn.io (ShareManager)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for backupvolumes.longhorn.io (BackupVolume)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for snapshots.longhorn.io (Snapshot)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for replicas.longhorn.io (Replica)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for supportbundles.longhorn.io (SupportBundle)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for systembackups.longhorn.io (SystemBackup)"
time="2023-07-07T07:04:18Z" level=info msg="Add mutation handler for volumeattachments.longhorn.io (VolumeAttachment)"
time="2023-07-07T07:04:18Z" level=info msg="Active TLS secret longhorn-webhook-tls (ver=2687) (count 2): map[listener.cattle.io/cn-longhorn-admission-webhook.longhorn.svc:longhorn-admission-webhook.longhorn.svc listener.cattle.io/cn-longhorn-conversion-webhook.longhorn.svc:longhorn-conversion-webhook.longhorn.svc listener.cattle.io/fingerprint:SHA1=5722A8DEA1DC17BCBDFA7D07C25EB1C0DBB6C4F3]"
time="2023-07-07T07:04:18Z" level=info msg="Listening on :9502"
time="2023-07-07T07:04:19Z" level=info msg="Starting apiregistration.k8s.io/v1, Kind=APIService controller"
time="2023-07-07T07:04:19Z" level=info msg="Starting /v1, Kind=Secret controller"
time="2023-07-07T07:04:19Z" level=info msg="Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller"
time="2023-07-07T07:04:19Z" level=info msg="Building validation rules..."
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:nodes Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc002a21380 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:settings Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc00263ad80 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:recurringjobs Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc00267e000 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:backingimages Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc00267e480 OperationTypes:[CREATE DELETE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:volumes Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc000214900 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:orphans Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc002413a20 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:snapshots Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc002eecfc0 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:supportbundles Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc000ef8700 OperationTypes:[CREATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:systembackups Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc002022380 OperationTypes:[CREATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:systemrestores Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc00240a160 OperationTypes:[CREATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:volumeattachments Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc0025c9cc0 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:engines Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc0026b1800 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:replicas Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc000c4edc0 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=info msg="Building mutation rules..."
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:backups Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc002431400 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:backingimages Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc00267e900 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:backingImageManagers Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc00226bdc0 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:backingimagedatasources Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc0029e0800 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:nodes Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc002a21ba0 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:volumes Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc00260c900 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:engines Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc0023dc300 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:recurringjobs Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc00267ed80 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:engineimages Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc002eaaf00 OperationTypes:[CREATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:orphans Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc00240a580 OperationTypes:[CREATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:sharemanagers Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc00240a9a0 OperationTypes:[CREATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:backupvolumes Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc0029e0e00 OperationTypes:[CREATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:snapshots Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc002382540 OperationTypes:[CREATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:replicas Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc000c4f600 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:supportbundles Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc002383c00 OperationTypes:[CREATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:systembackups Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc0023e3c00 OperationTypes:[CREATE]}"
time="2023-07-07T07:04:19Z" level=debug msg="Add rule for {Name:volumeattachments Scope:Namespaced APIGroup:longhorn.io APIVersion:v1beta2 ObjectType:0xc0026ac140 OperationTypes:[CREATE UPDATE]}"
time="2023-07-07T07:04:19Z" level=info msg="Updating TLS secret for longhorn-webhook-tls (count: 2): map[listener.cattle.io/cn-longhorn-admission-webhook.longhorn.svc:longhorn-admission-webhook.longhorn.svc listener.cattle.io/cn-longhorn-conversion-webhook.longhorn.svc:longhorn-conversion-webhook.longhorn.svc listener.cattle.io/fingerprint:SHA1=5722A8DEA1DC17BCBDFA7D07C25EB1C0DBB6C4F3]"
time="2023-07-07T07:04:19Z" level=debug msg="DesiredSet - No change(2) admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration /longhorn-webhook-validator for  longhorn/longhorn-webhook-ca"
time="2023-07-07T07:04:19Z" level=debug msg="DesiredSet - No change(2) admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration /longhorn-webhook-mutator for  longhorn/longhorn-webhook-ca"
time="2023-07-07T07:04:19Z" level=info msg="Webhook admission is ready"
time="2023-07-07T07:04:19Z" level=warning msg="Started longhorn admission webhook server"
time="2023-07-07T07:04:19Z" level=info msg="Starting longhorn recovery-backend server"
W0707 07:04:19.829494       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0707 07:04:19.831098       1 shared_informer.go:311] Waiting for caches to sync for longhorn datastore
I0707 07:04:22.132427       1 shared_informer.go:318] Caches are synced for longhorn datastore
time="2023-07-07T07:04:23Z" level=info msg="Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller"
time="2023-07-07T07:04:23Z" level=info msg="Starting apiregistration.k8s.io/v1, Kind=APIService controller"
time="2023-07-07T07:04:23Z" level=info msg="Started longhorn recovery-backend server"
W0707 07:04:23.187680       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2023-07-07T07:04:23Z" level=info msg="Recovery-backend server is running at :9503"
time="2023-07-07T07:04:23Z" level=info msg="Checking if the upgrade path from v1.4.2 to v1.5.0 is supported"
I0707 07:04:23.210688       1 leaderelection.go:245] attempting to acquire leader lease longhorn/longhorn-manager-upgrade-lock...
I0707 07:04:23.261286       1 leaderelection.go:255] successfully acquired lease longhorn/longhorn-manager-upgrade-lock
time="2023-07-07T07:04:23Z" level=info msg="Start upgrading"
time="2023-07-07T07:04:23Z" level=info msg="No API version upgrade is needed"
time="2023-07-07T07:04:23Z" level=debug msg="Walking through the resource upgrade path v1.4.x to v1.5.0"
time="2023-07-07T07:04:25Z" level=warning msg="Rejected operation: Request (user: system:serviceaccount:longhorn:longhorn-service-account, longhorn.io/v1beta2, Kind=Orphan, namespace: longhorn, name: orphan-024ed10f75415525327901b184c46b279fa24fdab23c89ea80e9f6ea7be50c83, operation: UPDATE)" error="orphan orphan-024ed10f75415525327901b184c46b279fa24fdab23c89ea80e9f6ea7be50c83 spec fields are immutable" service=admissionWebhook
time="2023-07-07T07:04:25Z" level=debug msg="admit result: UPDATE longhorn.io/v1beta2, Kind=Orphan longhorn/orphan-024ed10f75415525327901b184c46b279fa24fdab23c89ea80e9f6ea7be50c83 user=system:serviceaccount:longhorn:longhorn-service-account allowed=false err=<nil>"
time="2023-07-07T07:04:25Z" level=error msg="Upgrade failed: upgrade resources failed: admission webhook \"validator.longhorn.io\" denied the request: orphan orphan-024ed10f75415525327901b184c46b279fa24fdab23c89ea80e9f6ea7be50c83 spec fields are immutable"
time="2023-07-07T07:04:25Z" level=info msg="Upgrade leader lost: worker-3.test-k8s.iamoffice.lv"
time="2023-07-07T07:04:25Z" level=fatal msg="Error starting manager: upgrade resources failed: admission webhook \"validator.longhorn.io\" denied the request: orphan orphan-024ed10f75415525327901b184c46b279fa24fdab23c89ea80e9f6ea7be50c83 spec fields are immutable"

Environment

  • Longhorn version: 1.4.2
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Vanilla Kubernetes v1.24.15
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)
    • CPU per node: 12CPU
    • Memory per node: 16Gb
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth between the nodes: 10Gbit, x2 bond
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): KVM
  • Number of Longhorn volumes in the cluster: 9

Workaround

We recommend workaround A before upgrade. However, if you already run in the issue, please resolve the issue by workaround B.

A. Delete orphan resources before upgrade.

–

B.

  1. Delete crashloop longhorn-manager pods
kubectl -n longhorn-system delete pod -l app=longhorn-manager
  1. Edit longhorn-webhook-validator validatingwebhookconfigurations
kubectl -n longhorn-system edit validatingwebhookconfigurations longhorn-webhook-validator
  1. Remove UPDATE from orphans resources
...
  - apiGroups:
    - longhorn.io
    apiVersions:
    - v1beta2
    operations:
    - CREATE
    - UPDATE
    resources:
    - orphans
...
  1. Continue the upgrade

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 42 (18 by maintainers)

Commits related to this issue

Most upvoted comments

@DmitryMigunov

Hit a regression issue. There is a workaround.

  1. Edit longhorn-webhook-validator validatingwebhookconfigurations
kubectl -n longhorn-system edit validatingwebhookconfigurations longhorn-webhook-validator
  1. Remove UPDATE from orphans resources
...
  - apiGroups:
    - longhorn.io
    apiVersions:
    - v1beta2
    operations:
    - CREATE
    - UPDATE
    resources:
    - orphans
...
  1. Continue the upgrade

Hi @DmitryMigunov After v1.5.0, webhook and recovery services are merged into longhorn-manager But somehow the deployment templates got left in the helm chart So helm install the deployment back

Workaround:

  • You can safely delete these deployment by kubectl delete deployments.apps longhorn-admission-webhook longhorn-conversion-webhook longhorn-recovery-backend -n longhorn-system

Thanks!

@PhanLe1010

For issue 1, those who use GitOps with declarative configuration can’t delete these deployments manually because the system expects them to exist by virtue of them existing in the Helm Chart, so either the reconciliation of the cluster fails because the resources don’t exist or they get recreated by the system but fail to start and the reconciliation still fails.

Having other applications depend on the health of longhorn (which is failing) means they can’t start

What’s the recommended course of action?

CleanShot 2023-07-08 at 10 53 42@2x

To fix this one, I just set the following in my custom values.yaml, it stops the Deployments from scaling up, as only the pods are the issue, not the deployments.

# these shouldn't exist in the helm chart at all
# see: https://github.com/longhorn/longhorn/issues/6246#issuecomment-1625065311
longhornConversionWebhook:
  replicas: 0
longhornAdmissionWebhook:
  replicas: 0
longhornRecoveryBackend:
  replicas: 0

I don’t think, that it’s a related issue, but now longhorn-recovery-backend and longhorn-conversion-webhook starts crashing with errors:

panic: unrecognized command: conversion-webhook

goroutine 1 [running]:
main.cmdNotFound(0xc000abea78?, {0x7ffe1b01c0e7?, 0xc000abeaa0?})
	/go/src/github.com/longhorn/longhorn-manager/main.go:15 +0x67
github.com/urfave/cli.ShowCommandHelp(0xc000642b00, {0x7ffe1b01c0e7, 0x12})
	/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/urfave/cli/help.go:213 +0x404
github.com/urfave/cli.glob..func1(0x1?)
	/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/urfave/cli/help.go:21 +0x34
github.com/urfave/cli.HandleAction({0x230e860?, 0x295d000?}, 0xc00062efc0?)
	/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/urfave/cli/app.go:524 +0x50
github.com/urfave/cli.(*App).Run(0xc00062efc0, {0xc0001a0000, 0x4, 0x4})
	/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/urfave/cli/app.go:286 +0x7db
main.main()
	/go/src/github.com/longhorn/longhorn-manager/main.go:66 +0x8db
panic: unrecognized command: recovery-backend

goroutine 1 [running]:
main.cmdNotFound(0xc000ad8a78?, {0x7ffd0d0b30c2?, 0xc000ad8aa0?})
	/go/src/github.com/longhorn/longhorn-manager/main.go:15 +0x67
github.com/urfave/cli.ShowCommandHelp(0xc00021a000, {0x7ffd0d0b30c2, 0x10})
	/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/urfave/cli/help.go:213 +0x404
github.com/urfave/cli.glob..func1(0x1?)
	/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/urfave/cli/help.go:21 +0x34
github.com/urfave/cli.HandleAction({0x230e860?, 0x295d000?}, 0xc0003c7500?)
	/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/urfave/cli/app.go:524 +0x50
github.com/urfave/cli.(*App).Run(0xc0003c7500, {0xc000050080, 0x4, 0x4})
	/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/urfave/cli/app.go:286 +0x7db
main.main()
	/go/src/github.com/longhorn/longhorn-manager/main.go:66 +0x8db

@derekbit , Longhorn has been successfully upgraded with your workaround. Thank you.

@PhanLe1010

For issue 1, those who use GitOps with declarative configuration can’t delete these deployments manually because the system expects them to exist by virtue of them existing in the Helm Chart, so either the reconciliation of the cluster fails because the resources don’t exist or they get recreated by the system but fail to start and the reconciliation still fails.

Having other applications depend on the health of longhorn (which is failing) means they can’t start

What’s the recommended course of action?

CleanShot 2023-07-08 at 10 53 42@2x

Verified on v1.5.x-head 20230710

The test steps

https://github.com/longhorn/longhorn/issues/6246#issuecomment-1624970058

Result Passed

  1. I was able to reproduce this issue by following the test steps from version 1.4.2 to 1.5.0 using Helm upgrade.
  2. After creating the orphan resource, we can upgrade Longhorn from version 1.4.2 to 1.5.x using Helm.

PSA

In this ticket, two regressions were found in v1.5.0. The workaround has been provided as below and the issues will be fixed in the upcoming v1.5.1 patch. We recommend users to stop upgrading to v1.5.0 and wait for v1.5.1 release. Thanks for the understanding.

Issue 1: Longhorn v1.5.0 merged webhook and recovery deployments longhorn-manager DaemonSet but they are accidentally left in from Helm templates (due to a Helm chart syncing issue between different repos). The workaround is setting helm value to scale down these deployment https://github.com/longhorn/longhorn/issues/6246#issuecomment-1629855815

Issue 2: We missed updating the webhook logic for orphan CR thus blocking any update to orphan CR. The workaround for this one is https://github.com/longhorn/longhorn/issues/6246#issuecomment-1624944396

@Starttoaster you can please check the workaround in the description for the immutable field issue.

hi everyone, i am facing the same issue, and trying with this edit: kubectl -n longhorn-system edit validatingwebhookconfigurations longhorn-webhook-validator

but the file seems to be reset to its initial version every time so basically the edit is not effective … how to fix this ???

The step is tricky. You probably need to try multiple times. https://github.com/longhorn/longhorn/issues/6246#issue-1792948689

If you never mind, I would recommend using the customized image I built. The steps are https://github.com/longhorn/longhorn/discussions/6281#discussioncomment-6414724

Thanks @absentbri ! That is a better workaround! Updated the PSA

Verified on master-head 20230710

The test steps

https://github.com/longhorn/longhorn/issues/6246#issuecomment-1624970058

Result Passed

  1. After creating the orphan resource, we can upgrade Longhorn from version 1.4.2 to master-head using Helm.

@pchang388 According to the information you provided in #6259 (comment), the upgrade looks successful. I don’t why you rollback to the old version…

For the old version engine image and instance manager after upgrade, you can refer to https://longhorn.io/kb/troubleshooting-some-old-instance-manager-pods-are-still-running-after-upgrade/

BTW, suggest putting the information in the same thread for easily understanding the context.

Thanks for the response, I originally did a rollback as someone else said they were able to fix the current topic ( instance managers fail in 1.5.0 upgrade due to orphaned resources ) with that method. I was experiencing that same issue and did a rollback, as stated in the linked comment/reply , I know it’s documented as unsupported, but I was able to rollback and delete the orphan pvcs and do an upgrade before experiencing the current issue.

But as you stated, this is a known behavior and should resolve itself once you detach the volumes and then the old engine will be recycled. Let me try and detach/re-attach and ensure that happens. EDIT: In the docs, https://longhorn.io/docs/1.5.0/deploy/upgrade/upgrade-engine/, engine upgrade is actually done via UI/manually or through automation you should do this manually or set up longhorn to automatically do it. This was my first upgrade for longhorn and I missed that section. As soon as I upgraded the volumes to use the new engine version, the daemonset was removed:

$ k get daemonsets.apps -n longhorn-system
NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
engine-image-ei-d911131c   5         5         5       5            5           <none>          15h
longhorn-manager           5         5         5       5            5           <none>          154d
longhorn-csi-plugin        5         5         5       5            5           <none>          15h

Deleted the comment in the other thread and will stick to this thread/issue though, thanks

It’s not causing downtime currently, just leaving the cluster in an unreconciled state. Though I can remove the dependency link from my apps to longhorn and it should be okay until v1.5.1 is released.

I think the patch release of chart would be useful:

āœ— Helm upgrade failed: deployments.apps "longhorn-recovery-backend" not found

@DmitryMigunov Just curious, any feature attracts you to upgrade so quickly?

When something is released it’s expected to work and be properly tested beforehand (?). I don’t see any automated CI testing in this repo. We should install (and/or run tests) automatically on the latest 3 stable kubernetes versions, e.g. like https://github.com/kubernetes/ingress-nginx

We have automation CI test https://github.com/longhorn/longhorn-tests and https://ci.longhorn.io/.

The issue is caused by the missing part (orphan resources) in the upgrade path.

Any feedback or contribution is appreciated.

Also note that the helm chart version does not have to be equal to the longhorn version that allows you to fix the chart without releasing a new longhorn version.

@DmitryMigunov Just curious, any feature attracts you to upgrade so quickly?

When something is released it’s expected to work and be properly tested beforehand (?). I don’t see any automated CI testing in this repo. We should install (and/or run tests) automatically on the latest 3 stable kubernetes versions, e.g. like https://github.com/kubernetes/ingress-nginx

Also note that the helm chart version does not have to be equal to the longhorn version that allows you to fix the chart without releasing a new longhorn version.

@ChanYiLin as discussed, let’s create another issue to tackle helm chart issue. cc @longhorn/qa

longhorn-recovery-backend and longhorn-conversion-webhook should not be used and be terminated after upgrade. cc @ChanYiLin can you take a look?

/upgrade/v14xto150/upgrade.go#L33-L35

	if err := upgradeWebhookAndRecoveryService(namespace, kubeClient); err != nil {
		return err
	}