longhorn: [BUG] Unable to finish install or upgrade when `Failed to list *v1beta2.Node: v1beta2.NodeList.Items`

Describe the bug

This happens when Backing Image’s disk status is empty.

spec:
  disks:
    1e3408a7-3318-4244-ba28-7c4cdc302add: {}
    b16a8aa2-63d8-452f-baf5-eded5a2525e7: {}
    c192b677-fd0d-42e1-9021-78ea70904daf: {}

To Reproduce

There is no specific steps but possible one:

  1. Fresh install v1.1.3

  2. Create backing image

  3. Upgrade to master-head before the first backing image download finishes

  4. See error

Expected behavior

Not sure what would be a better behavior to expect here, but since longhorn-manager pods are still waiting for longhorn-admission-webhook pods, user won’t able to do much at this point.

Log or Support bundle

Expand to see the logs.

longhorn-admission-webhook:
longhorn-admission-webhook 2022/02/19 03:44:29 proto: duplicate proto type registered: VersionResponse
longhorn-admission-webhook time="2022-02-19T03:44:29Z" level=info msg="Starting longhorn admission webhook server"
longhorn-admission-webhook W0219 03:44:29.462537       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
longhorn-admission-webhook I0219 03:44:29.465992       1 shared_informer.go:223] Waiting for caches to sync for longhorn datastore
longhorn-admission-webhook E0219 03:44:30.478473       1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
longhorn-admission-webhook E0219 03:44:31.734882       1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
longhorn-admission-webhook E0219 03:44:34.519957       1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
longhorn-admission-webhook E0219 03:44:39.318892       1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
longhorn-admission-webhook E0219 03:44:47.517024       1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
longhorn-admission-webhook E0219 03:45:00.984511       1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
longhorn-admission-webhook E0219 03:45:47.355421       1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
longhorn-admission-webhook E0219 03:46:40.122238       1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
Stream closed EOF for longhorn-system/longhorn-admission-webhook-57c4747b5d-phh8n (wait-longhorn-conversion-webhook)
longhorn-conversion-webhook:
2022/02/19 03:44:09 proto: duplicate proto type registered: VersionResponse
time="2022-02-19T03:44:09Z" level=info msg="Starting longhorn conversion webhook server"
W0219 03:44:09.498153       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2022-02-19T03:44:09Z" level=warning msg="Failed to init Kubernetes secret: secrets \"longhorn-webhook-tls\" not found"
time="2022-02-19T03:44:09Z" level=info msg="generated self-signed CA certificate CN=dynamiclistener-ca,O=dynamiclistener-org: notBefore=2022-02-19 03:44:09.537432722 +0000 UTC notAfter=2032-02-17 03:44:09.537432722 +0000 UTC"
time="2022-02-19T03:44:09Z" level=info msg="Listening on :9443"
time="2022-02-19T03:44:09Z" level=info msg="certificate CN=dynamic,O=dynamic signed by CN=dynamiclistener-ca,O=dynamiclistener-org: notBefore=2022-02-19 03:44:09 +0000 UTC notAfter=2023-02-19 03:44:09 +0000 UTC"
time="2022-02-19T03:44:09Z" level=info msg="Creating new TLS secret for longhorn-webhook-tls (count: 1): map[listener.cattle.io/cn-longhorn-conversion-webhook.longho-6a0089:longhorn-conversion-webhook.longhorn-system.svc listener.cattle.io/fingerprint:SHA1=AD6C82905AA34BFEF2E984609A95E98FCEAAB468]"
time="2022-02-19T03:44:09Z" level=info msg="Active TLS secret longhorn-webhook-tls (ver=4924) (count 1): map[listener.cattle.io/cn-longhorn-conversion-webhook.longho-6a0089:longhorn-conversion-webhook.longhorn-system.svc listener.cattle.io/fingerprint:SHA1=AD6C82905AA34BFEF2E984609A95E98FCEAAB468]"
time="2022-02-19T03:44:10Z" level=info msg="Starting /v1, Kind=Secret controller"
time="2022-02-19T03:44:10Z" level=info msg="Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller"
time="2022-02-19T03:44:10Z" level=info msg="Building conversion rules..."
time="2022-02-19T03:44:10Z" level=info msg="Starting apiregistration.k8s.io/v1, Kind=APIService controller"
time="2022-02-19T03:44:10Z" level=info msg="Updating TLS secret for longhorn-webhook-tls (count: 1): map[listener.cattle.io/cn-longhorn-conversion-webhook.longho-6a0089:longhorn-conversion-webhook.longhorn-system.svc listener.cattle.io/fingerprint:SHA1=AD6C82905AA34BFEF2E984609A95E98FCEAAB468]"
time="2022-02-19T03:44:10Z" level=info msg="Update CRD for backingimages.longhorn.io"
time="2022-02-19T03:44:10Z" level=info msg="Update CRD for backuptargets.longhorn.io"
time="2022-02-19T03:44:10Z" level=info msg="Update CRD for engineimages.longhorn.io"
time="2022-02-19T03:44:10Z" level=info msg="Update CRD for nodes.longhorn.io"
time="2022-02-19T03:44:10Z" level=info msg="Update CRD for volumes.longhorn.io"
time="2022-02-19T03:44:11Z" level=error msg="error decoding src object" error="v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects \" or n, but found {, error found in #10 byte of ...|c302add\":{},\"b16a8aa|..., bigger context ...|{\"disks\":{\"1e3408a7-3318-4244-ba28-7c4cdc302add\":{},\"b16a8aa2-63d8-452f-baf5-eded5a2525e7\":{},\"c192b|..."
time="2022-02-19T03:44:12Z" level=error msg="error decoding src object" error="v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects \" or n, but found {, error found in #10 byte of ...|c302add\":{},\"b16a8aa|..., bigger context ...|{\"disks\":{\"1e3408a7-3318-4244-ba28-7c4cdc302add\":{},\"b16a8aa2-63d8-452f-baf5-eded5a2525e7\":{},\"c192b|..."
time="2022-02-19T03:44:13Z" level=error msg="error decoding src object" error="v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects \" or n, but found {, error found in #10 byte of ...|c302add\":{},\"b16a8aa|..., bigger context ...|{\"disks\":{\"1e3408a7-3318-4244-ba28-7c4cdc302add\":{},\"b16a8aa2-63d8-452f-baf5-eded5a2525e7\":{},\"c192b|..."
time="2022-02-19T03:44:14Z" level=error msg="error decoding src object" error="v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects \" or n, but found {, error found in #10 byte of ...|c302add\":{},\"b16a8aa|..., bigger context ...|{\"disks\":{\"1e3408a7-3318-4244-ba28-7c4cdc302add\":{},\"b16a8aa2-63d8-452f-baf5-eded5a2525e7\":{},\"c192b|..."

Environment

  • Longhorn version: v1.1.3 -> master-head
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: 1.19 k3s
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 3
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): local vm

Additional context

Can be also reproduced on Kubernetes v1.21

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 39 (29 by maintainers)

Commits related to this issue

Most upvoted comments

Th minimum and maximum markers in the CRD manifest introduce the error. The workaround is that move the schema validation checks into the validating webhook.

similar issues

Root cause https://github.com/kubernetes/kubernetes/issues/87675

cc @innobead

I tried to investigate this issue more

Experiment 1 Upload a backingimage bbb, the value of thebbb resource in etcd

 etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --endpoints https://127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt get /registry/longhorn.io/backingimages/longhorn-system/bbb
/registry/longhorn.io/backingimages/longhorn-system/bbb
{"apiVersion":"longhorn.io/v1beta1","kind":"BackingImage","metadata":{"creationTimestamp":"2022-03-01T16:18:36Z","finalizers":["longhorn.io"],"generation":2,"labels":{"longhorn.io/component":"backing-image","longhorn.io/managed-by":"longhorn-manager"},"managedFields":[{"apiVersion":"longhorn.io/v1beta1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:finalizers":{".":{},"v:\"longhorn.io\"":{}},"f:labels":{".":{},"f:longhorn.io/component":{},"f:longhorn.io/managed-by":{}}},"f:spec":{},"f:status":{}},"manager":"longhorn-manager","operation":"Update","time":"2022-03-01T16:18:36Z"}],"name":"bbb","namespace":"longhorn-system","uid":"ea37250e-ea72-447b-873c-b9a1410ec97a"},"spec":{"checksum":"","disks":{"93e0d699-9ae4-4ff9-803a-e77ece8a17ef":{}},"imageURL":"","sourceParameters":{"url":"https://github.com/rancher/k3os/releases/download/v0.11.0/k3os-amd64.iso"},"sourceType":"download"},"status":{"checksum":"0a230fccbcf4acdd600933f94250fc6513bbf207d4e584984fca3e36ce6716a9bf3a3ce6fac3a4d0d08c56609ea8c6f02e9944d96c1080fb0abf0020ad7a268f","diskDownloadProgressMap":null,"diskDownloadStateMap":null,"diskFileStatusMap":{"93e0d699-9ae4-4ff9-803a-e77ece8a17ef":{"lastStateTransitionTime":"2022-03-01T16:20:39Z","message":"","progress":100,"state":"ready"}},"diskLastRefAtMap":{"93e0d699-9ae4-4ff9-803a-e77ece8a17ef":"2022-03-01T16:18:36Z"},"ownerID":"rancher60-worker1","size":534431744,"uuid":"2725336b"}}

After finishing upload, update master-head CRD manifests

root@rancher60-master:~/crd_webhook# etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --endpoints https://127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt get /registry/longhorn.io/backingimages/longhorn-system/bbb
/registry/longhorn.io/backingimages/longhorn-system/bbb
{"apiVersion":"longhorn.io/v1beta1","kind":"BackingImage","metadata":{"creationTimestamp":"2022-03-01T16:18:36Z","finalizers":["longhorn.io"],"generation":2,"labels":{"longhorn.io/component":"backing-image","longhorn.io/managed-by":"longhorn-manager"},"managedFields":[{"apiVersion":"longhorn.io/v1beta1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:finalizers":{".":{},"v:\"longhorn.io\"":{}},"f:labels":{".":{},"f:longhorn.io/component":{},"f:longhorn.io/managed-by":{}}},"f:spec":{},"f:status":{}},"manager":"longhorn-manager","operation":"Update","time":"2022-03-01T16:18:36Z"}],"name":"bbb","namespace":"longhorn-system","uid":"ea37250e-ea72-447b-873c-b9a1410ec97a"},"spec":{"checksum":"","disks":{"93e0d699-9ae4-4ff9-803a-e77ece8a17ef":{}},"imageURL":"","sourceParameters":{"url":"https://github.com/rancher/k3os/releases/download/v0.11.0/k3os-amd64.iso"},"sourceType":"download"},"status":{"checksum":"0a230fccbcf4acdd600933f94250fc6513bbf207d4e584984fca3e36ce6716a9bf3a3ce6fac3a4d0d08c56609ea8c6f02e9944d96c1080fb0abf0020ad7a268f","diskDownloadProgressMap":null,"diskDownloadStateMap":null,"diskFileStatusMap":{"93e0d699-9ae4-4ff9-803a-e77ece8a17ef":{"lastStateTransitionTime":"2022-03-01T16:20:39Z","message":"","progress":100,"state":"ready"}},"diskLastRefAtMap":{"93e0d699-9ae4-4ff9-803a-e77ece8a17ef":"2022-03-01T16:18:36Z"},"ownerID":"rancher60-worker1","size":534431744,"uuid":"2725336b"}}

Experiment 2 Upload a backingimage bbb, the value of thebbb resource in etcd

/registry/longhorn.io/backingimages/longhorn-system/bbb
{"apiVersion":"longhorn.io/v1beta1","kind":"BackingImage","metadata":{"creationTimestamp":"2022-03-01T16:40:11Z","finalizers":["longhorn.io"],"generation":2,"labels":{"longhorn.io/component":"backing-image","longhorn.io/managed-by":"longhorn-manager"},"managedFields":[{"apiVersion":"longhorn.io/v1beta1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:finalizers":{".":{},"v:\"longhorn.io\"":{}},"f:labels":{".":{},"f:longhorn.io/component":{},"f:longhorn.io/managed-by":{}}},"f:spec":{},"f:status":{}},"manager":"longhorn-manager","operation":"Update","time":"2022-03-01T16:40:11Z"}],"name":"bbb","namespace":"longhorn-system","uid":"f2e52b5a-52e6-4881-890d-f90d377663aa"},"spec":{"checksum":"","disks":{"144171df-1157-412d-acc5-047deb1ff79d":{}},"imageURL":"","sourceParameters":{"url":"https://github.com/rancher/k3os/releases/download/v0.11.0/k3os-amd64.iso"},"sourceType":"download"},"status":{"checksum":"","diskDownloadProgressMap":null,"diskDownloadStateMap":null,"diskFileStatusMap":{"144171df-1157-412d-acc5-047deb1ff79d":{"lastStateTransitionTime":"2022-03-01T16:40:31Z","message":"","progress":1,"state":"in-progress"}},"diskLastRefAtMap":{"144171df-1157-412d-acc5-047deb1ff79d":"2022-03-01T16:40:11Z"},"ownerID":"rancher60-worker1","size":534431744,"uuid":"89f245d3"}}

Update master-head CRD manifests while the upload is in progress

root@rancher60-master:~/crd_webhook/before-issue-3562-install# etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --endpoints https://127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt get /registry/longhorn.io/backingimages/longhorn-system/bbb
/registry/longhorn.io/backingimages/longhorn-system/bbb
{"apiVersion":"longhorn.io/v1beta2","kind":"BackingImage","metadata":{"creationTimestamp":"2022-03-01T16:40:11Z","finalizers":["longhorn.io"],"generation":3,"labels":{"longhorn.io/component":"backing-image","longhorn.io/managed-by":"longhorn-manager"},"managedFields":[{"apiVersion":"longhorn.io/v1beta1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:finalizers":{".":{},"v:\"longhorn.io\"":{}},"f:labels":{".":{},"f:longhorn.io/component":{},"f:longhorn.io/managed-by":{}}},"f:spec":{},"f:status":{}},"manager":"longhorn-manager","operation":"Update","time":"2022-03-01T16:40:11Z"}],"name":"bbb","namespace":"longhorn-system","uid":"f2e52b5a-52e6-4881-890d-f90d377663aa"},"spec":{"checksum":"","disks":{"09d9c6de-13ef-4d7a-b8ef-f3622ddfc228":{},"144171df-1157-412d-acc5-047deb1ff79d":{}},"imageURL":"","sourceParameters":{"url":"https://github.com/rancher/k3os/releases/download/v0.11.0/k3os-amd64.iso"},"sourceType":"download"},"status":{"checksum":"","diskDownloadProgressMap":{},"diskDownloadStateMap":{},"diskFileStatusMap":{"09d9c6de-13ef-4d7a-b8ef-f3622ddfc228":{"lastStateTransitionTime":"2022-03-01T16:41:25Z","message":"pod spec node ID rancher60-master doesn't match the desired node ID rancher60-worker1","progress":51,"state":"failed"},"144171df-1157-412d-acc5-047deb1ff79d":{"lastStateTransitionTime":"2022-03-01T16:41:23Z","message":"","progress":0,"state":"unknown"}},"diskLastRefAtMap":{"09d9c6de-13ef-4d7a-b8ef-f3622ddfc228":"2022-03-01T16:41:24Z","144171df-1157-412d-acc5-047deb1ff79d":"2022-03-01T16:40:11Z"},"ownerID":"rancher60-worker1","size":534431744,"uuid":"89f245d3"}}

apiVersion field became longhorn.io/v1beta2.


In Experiment 1, the apiVersion was not changed after updating CRDs.

But, in Experiment 2, apiVersion became longhorn.io/v1beta2 and Spec/Status still followed v1beta1 schema. The change of apiVersion resulted in the decode failure in “conversion webhook” because of the mismatch of apiVersion and Spec/Status.

A problem raised by the investigation is why the v1beta1 API Update/UpdateStatus operation can change the apiVersion to v1beta1 and is without any error.

cc @jenting @shuo-wu @innobead

@derekbit Yes, creating new Backing Image and Volume with Backing Image will have no problem.

@derekbit with the latest fix the installtion will no longer stuck.

However, the backing image will fail, as there is no Image URL.

failed to process backing image file: failed to reuse file via the fetch call, then reset the work directory and exited: stat /data/backing-images/bi-v113-e075f796/backing: no such file or 
Expand to see the output of describe backingimage/bi-v113
Name:         bi-v113
Namespace:    longhorn-system
Labels:       longhorn.io/component=backing-image
              longhorn.io/managed-by=longhorn-manager
Annotations:  <none>
API Version:  longhorn.io/v1beta2
Kind:         BackingImage
Metadata:
  Creation Timestamp:  2022-04-06T08:06:00Z
  Finalizers:
    longhorn.io
  Generation:  6
  Managed Fields:
    API Version:  longhorn.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"longhorn.io":
        f:labels:
          .:
          f:longhorn.io/component:
          f:longhorn.io/managed-by:
    Manager:      longhorn-manager
    Operation:    Update
    Time:         2022-04-06T08:06:00Z
    API Version:  longhorn.io/v1beta2
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:imageURL:
        f:sourceParameters:
          f:url:
      f:status:
        f:diskFileStatusMap:
          f:809baab6-0308-4b0b-9dc5-4528090c8185:
            .:
            f:lastStateTransitionTime:
            f:message:
            f:progress:
            f:state:
          f:9517cb60-9d84-4980-8fc6-916797bc1b2a:
            .:
            f:lastStateTransitionTime:
            f:message:
            f:progress:
            f:state:
          f:dcaa228c-53f1-4986-9105-b28bd3b291b1:
            .:
            f:lastStateTransitionTime:
            f:message:
            f:progress:
            f:state:
    Manager:         longhorn-manager
    Operation:       Update
    Time:            2022-04-06T08:09:24Z
  Resource Version:  6981
  Self Link:         /apis/longhorn.io/v1beta2/namespaces/longhorn-system/backingimages/bi-v113
  UID:               66b8bf0d-fbdd-4aa9-b8dd-a723a620423f
Spec:
  Checksum:  
  Disks:
    809baab6-0308-4b0b-9dc5-4528090c8185:  
    9517cb60-9d84-4980-8fc6-916797bc1b2a:  
    dcaa228c-53f1-4986-9105-b28bd3b291b1:  
  Image URL:                               
  Source Parameters:
    URL:        
  Source Type:  download
Status:
  Checksum:  
  Disk Download Progress Map:
  Disk Download State Map:
  Disk File Status Map:
    809baab6-0308-4b0b-9dc5-4528090c8185:
      Last State Transition Time:  2022-04-06T08:09:21Z
      Message:                     failed to process backing image file: failed to reuse file via the fetch call, then reset the work directory and exited: stat /data/backing-images/bi-v113-e075f796/backing: no such file or directory
      Progress:                    0
      State:                       failed
    9517cb60-9d84-4980-8fc6-916797bc1b2a:
      Last State Transition Time:  2022-04-06T08:09:24Z
      Message:                     failed to process backing image file: failed to reuse file via the fetch call, then reset the work directory and exited: stat /data/backing-images/bi-v113-e075f796/backing: no such file or directory
      Progress:                    0
      State:                       failed
    dcaa228c-53f1-4986-9105-b28bd3b291b1:
      Last State Transition Time:  2022-04-06T08:09:20Z
      Message:                     failed to process backing image file: failed to reuse file via the fetch call, then reset the work directory and exited: stat /data/backing-images/bi-v113-e075f796/backing: no such file or directory
      Progress:                    0
      State:                       failed
  Disk Last Ref At Map:
  Owner ID:  it2-node2
  Size:      534431744
  Uuid:      e075f796
Events:
  Type    Reason  Age    From                               Message
  ----    ------  ----   ----                               -------
  Normal  Update  9m32s  longhorn-backing-image-controller  Initialized UUID to e075f796
  Normal  Create  9m6s   longhorn-backing-image-controller  created default backing image manager backing-image-manager-7298-809b in disk 809baab6-0308-4b0b-9dc5-4528090c8185 on node it2-node1
  Normal  Create  9m6s   longhorn-backing-image-controller  created default backing image manager backing-image-manager-7298-dcaa in disk dcaa228c-53f1-4986-9105-b28bd3b291b1 on node it2-node3
  Normal  Create  9m6s   longhorn-backing-image-controller  created default backing image manager backing-image-manager-7298-9517 in disk 9517cb60-9d84-4980-8fc6-916797bc1b2a on node it2-node2
  Normal  Update  8m40s  longhorn-backing-image-controller  Set size to 534431744
  Normal  Delete  7m8s   longhorn-backing-image-controller  delete old backing image manager backing-image-manager-7298-809b in disk 809baab6-0308-4b0b-9dc5-4528090c8185 on node it2-node1
  Normal  Delete  7m7s   longhorn-backing-image-controller  delete old backing image manager backing-image-manager-7298-9517 in disk 9517cb60-9d84-4980-8fc6-916797bc1b2a on node it2-node2
  Normal  Delete  7m7s   longhorn-backing-image-controller  delete old backing image manager backing-image-manager-7298-dcaa in disk dcaa228c-53f1-4986-9105-b28bd3b291b1 on node it2-node3
  Normal  Create  6m53s  longhorn-backing-image-controller  created default backing image manager backing-image-manager-03a6-809b in disk 809baab6-0308-4b0b-9dc5-4528090c8185 on node it2-node1
  Normal  Create  6m53s  longhorn-backing-image-controller  created default backing image manager backing-image-manager-03a6-9517 in disk 9517cb60-9d84-4980-8fc6-916797bc1b2a on node it2-node2
  Normal  Create  6m53s  longhorn-backing-image-controller  created default backing image manager backing-image-manager-03a6-dcaa in disk dcaa228c-53f1-4986-9105-b28bd3b291b1 on node it2-node3

Also result in volume stuck at attaching.

NAME          STATE       ROBUSTNESS   SCHEDULED   SIZE         NODE        AGE
bi-vol-v113   attaching   unknown                  1073741824   it2-node2   10m

And the BI with empty URL was first reported here: #3510 Maybe we can track this in the previous issue? cc @shuo-wu

Yeah, another good finding from @kaxing

As @derekbit mentioned, there is some conversion enhancement after 1.18, so even we use recent versions of helm v2/v3 which adopts newer k8s API lib w/ the fix, we will still hit the issues in previous 1.18 and older version cluster.

As per we are still supporting 1.18, so the temporary solution is like @derekbit mentioned, but we need to adjust back when we bump the version support from 1.18.

@derekbit there is a different set of errors coming up when upgrading with helm. (testing with today’s master-head image *)

With the same testing steps but install using helm3 then upgrade to master-head via local chart(longhorn repo).

$ helm upgrade longhorn ./chart -n longhorn-system                                     127 ↵ master
Error: UPGRADE FAILED: cannot patch "nodes.longhorn.io" with kind CustomResourceDefinition:  "" is invalid: patch: Invalid value: "map[metadata:map[annotations:map[controller-gen.kubebuilder.io/version:v0.7.0] creationTimestamp:<nil> labels:map[app.kubernetes.io/version:v1.3.0-dev helm.sh/chart:longhorn-1.3.0-dev longhorn-manager:]] spec:map[conversion:map[strategy:Webhook webhook:map[clientConfig:map[service:map[name:longhorn-conversion-webhook namespace:longhorn-system path:/v1/webhook/conversion port:9443]] conversionReviewVersions:[v1beta2 v1beta1]]] versions:[map[additionalPrinterColumns:[map[description:Indicate whether the node is ready jsonPath:.status.conditions['Ready']['status'] name:Ready type:string] map[description:Indicate whether the user disabled/enabled replica scheduling for the node jsonPath:.spec.allowScheduling name:AllowScheduling type:boolean] map[description:Indicate whether Longhorn can schedule replicas on the node jsonPath:.status.conditions['Schedulable']['status'] name:Schedulable type:string] map[jsonPath:.metadata.creationTimestamp name:Age type:date]] name:v1beta1 schema:map[openAPIV3Schema:map[description:Node is where Longhorn stores Longhorn node object. properties:map[apiVersion:map[description:APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources type:string] kind:map[description:Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds type:string] metadata:map[type:object] spec:map[x-kubernetes-preserve-unknown-fields:true] status:map[x-kubernetes-preserve-unknown-fields:true]] type:object]] served:true storage:false subresources:map[status:map[]]] map[additionalPrinterColumns:[map[description:Indicate whether the node is ready jsonPath:.status.conditions[?(@.type=='Ready')].status name:Ready type:string] map[description:Indicate whether the user disabled/enabled replica scheduling for the node jsonPath:.spec.allowScheduling name:AllowScheduling type:boolean] map[description:Indicate whether Longhorn can schedule replicas on the node jsonPath:.status.conditions[?(@.type=='Schedulable')].status name:Schedulable type:string] map[jsonPath:.metadata.creationTimestamp name:Age type:date]] name:v1beta2 schema:map[openAPIV3Schema:map[description:Node is where Longhorn stores Longhorn node object. properties:map[apiVersion:map[description:APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources type:string] kind:map[description:Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds type:string] metadata:map[type:object] spec:map[description:NodeSpec defines the desired state of the Longhorn node properties:map[allowScheduling:map[type:boolean] disks:map[additionalProperties:map[properties:map[allowScheduling:map[type:boolean] evictionRequested:map[type:boolean] path:map[type:string] storageReserved:map[format:int64 type:integer] tags:map[items:map[type:string] type:array]] type:object] type:object] engineManagerCPURequest:map[minimum:0 type:integer] evictionRequested:map[type:boolean] name:map[type:string] replicaManagerCPURequest:map[minimum:0 type:integer] tags:map[items:map[type:string] type:array]] type:object] status:map[description:NodeStatus defines the observed state of the Longhorn node properties:map[conditions:map[items:map[properties:map[lastProbeTime:map[description:Last time we probed the condition. type:string] lastTransitionTime:map[description:Last time the condition transitioned from one status to another. type:string] message:map[description:Human-readable message indicating details about last transition. type:string] reason:map[description:Unique, one-word, CamelCase reason for the condition's last transition. type:string] status:map[description:Status is the status of the condition. Can be True, False, Unknown. type:string] type:map[description:Type is the type of the condition. type:string]] type:object] nullable:true type:array] diskStatus:map[additionalProperties:map[properties:map[conditions:map[items:map[properties:map[lastProbeTime:map[description:Last time we probed the condition. type:string] lastTransitionTime:map[description:Last time the condition transitioned from one status to another. type:string] message:map[description:Human-readable message indicating details about last transition. type:string] reason:map[description:Unique, one-word, CamelCase reason for the condition's last transition. type:string] status:map[description:Status is the status of the condition. Can be True, False, Unknown. type:string] type:map[description:Type is the type of the condition. type:string]] type:object] nullable:true type:array] diskUUID:map[type:string] scheduledReplica:map[additionalProperties:map[format:int64 type:integer] nullable:true type:object] storageAvailable:map[format:int64 type:integer] storageMaximum:map[format:int64 type:integer] storageScheduled:map[format:int64 type:integer]] type:object] nullable:true type:object] region:map[type:string] zone:map[type:string]] type:object]] type:object]] served:true storage:true subresources:map[status:map[]]]]] status:map[acceptedNames:map[kind: plural:] conditions:[] storedVersions:[]]]": cannot convert int64 to float64 && cannot patch "volumes.longhorn.io" with kind CustomResourceDefinition:  "" is invalid: patch: Invalid value: "map[metadata:map[annotations:map[controller-gen.kubebuilder.io/version:v0.7.0] creationTimestamp:<nil> labels:map[app.kubernetes.io/version:v1.3.0-dev helm.sh/chart:longhorn-1.3.0-dev longhorn-manager:]] spec:map[conversion:map[strategy:Webhook webhook:map[clientConfig:map[service:map[name:longhorn-conversion-webhook namespace:longhorn-system path:/v1/webhook/conversion port:9443]] conversionReviewVersions:[v1beta2 v1beta1]]] versions:[map[additionalPrinterColumns:[map[description:The state of the volume jsonPath:.status.state name:State type:string] map[description:The robustness of the volume jsonPath:.status.robustness name:Robustness type:string] map[description:The scheduled condition of the volume jsonPath:.status.conditions['scheduled']['status'] name:Scheduled type:string] map[description:The size of the volume jsonPath:.spec.size name:Size type:string] map[description:The node that the volume is currently attaching to jsonPath:.status.currentNodeID name:Node type:string] map[jsonPath:.metadata.creationTimestamp name:Age type:date]] name:v1beta1 schema:map[openAPIV3Schema:map[description:Volume is where Longhorn stores volume object. properties:map[apiVersion:map[description:APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources type:string] kind:map[description:Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds type:string] metadata:map[type:object] spec:map[x-kubernetes-preserve-unknown-fields:true] status:map[x-kubernetes-preserve-unknown-fields:true]] type:object]] served:true storage:false subresources:map[status:map[]]] map[additionalPrinterColumns:[map[description:The state of the volume jsonPath:.status.state name:State type:string] map[description:The robustness of the volume jsonPath:.status.robustness name:Robustness type:string] map[description:The scheduled condition of the volume jsonPath:.status.conditions[?(@.type=='Schedulable')].status name:Scheduled type:string] map[description:The size of the volume jsonPath:.spec.size name:Size type:string] map[description:The node that the volume is currently attaching to jsonPath:.status.currentNodeID name:Node type:string] map[jsonPath:.metadata.creationTimestamp name:Age type:date]] name:v1beta2 schema:map[openAPIV3Schema:map[description:Volume is where Longhorn stores volume object. properties:map[apiVersion:map[description:APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources type:string] kind:map[description:Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds type:string] metadata:map[type:object] spec:map[description:VolumeSpec defines the desired state of the Longhorn volume properties:map[Standby:map[type:boolean] accessMode:map[enum:[rwo rwx] type:string] backingImage:map[type:string] baseImage:map[description:Deprecated. Rename to BackingImage type:string] dataLocality:map[enum:[disabled best-effort] type:string] dataSource:map[type:string] disableFrontend:map[type:boolean] diskSelector:map[items:map[type:string] type:array] encrypted:map[type:boolean] engineImage:map[type:string] fromBackup:map[type:string] frontend:map[enum:[blockdev iscsi ] type:string] lastAttachedBy:map[type:string] migratable:map[type:boolean] migrationNodeID:map[type:string] nodeID:map[type:string] nodeSelector:map[items:map[type:string] type:array] numberOfReplicas:map[minimum:1 type:integer] recurringJobs:map[description:Deprecated. Replaced by a separate resource named \"RecurringJob\" items:map[description:VolumeRecurringJobSpec is a deprecated struct. TODO: Should be removed when recurringJobs gets removed from the volume       spec. properties:map[concurrency:map[type:integer] cron:map[type:string] groups:map[items:map[type:string] type:array] labels:map[additionalProperties:map[type:string] type:object] name:map[type:string] retain:map[type:integer] task:map[enum:[snapshot backup] type:string]] type:object] type:array] replicaAutoBalance:map[enum:[ignored disabled least-effort best-effort] type:string] revisionCounterDisabled:map[type:boolean] size:map[format:int64 type:string] staleReplicaTimeout:map[type:integer]] type:object] status:map[description:VolumeStatus defines the observed state of the Longhorn volume properties:map[actualSize:map[format:int64 type:integer] cloneStatus:map[properties:map[snapshot:map[type:string] sourceVolume:map[type:string] state:map[type:string]] type:object] conditions:map[items:map[properties:map[lastProbeTime:map[description:Last time we probed the condition. type:string] lastTransitionTime:map[description:Last time the condition transitioned from one status to another. type:string] message:map[description:Human-readable message indicating details about last transition. type:string] reason:map[description:Unique, one-word, CamelCase reason for the condition's last transition. type:string] status:map[description:Status is the status of the condition. Can be True, False, Unknown. type:string] type:map[description:Type is the type of the condition. type:string]] type:object] nullable:true type:array] currentImage:map[type:string] currentNodeID:map[type:string] expansionRequired:map[type:boolean] frontendDisabled:map[type:boolean] isStandby:map[type:boolean] kubernetesStatus:map[properties:map[lastPVCRefAt:map[type:string] lastPodRefAt:map[type:string] namespace:map[description:determine if PVC/Namespace is history or not type:string] pvName:map[type:string] pvStatus:map[type:string] pvcName:map[type:string] workloadsStatus:map[description:determine if Pod/Workload is history or not items:map[properties:map[podName:map[type:string] podStatus:map[type:string] workloadName:map[type:string] workloadType:map[type:string]] type:object] nullable:true type:array]] type:object] lastBackup:map[type:string] lastBackupAt:map[type:string] lastDegradedAt:map[type:string] ownerID:map[type:string] pendingNodeID:map[type:string] remountRequestedAt:map[type:string] restoreInitiated:map[type:boolean] restoreRequired:map[type:boolean] robustness:map[type:string] shareEndpoint:map[type:string] shareState:map[type:string] state:map[type:string]] type:object]] type:object]] served:true storage:true subresources:map[status:map[]]]]] status:map[acceptedNames:map[kind: plural:] conditions:[] storedVersions:[]]]": cannot convert int64 to float64

Upgrade can be finished if install from manifest file. But the backing image the under going volume will be failed, image Logs from backing-image-manager-*:

time="2022-03-31T09:18:09Z" level=info msg="Backing Image Manager listening to 0.0.0.0:8000"
time="2022-03-31T09:18:16Z" level=info msg="Backing Image Manager: prepare to start backing image update watch" component=backing-image-manager
time="2022-03-31T09:18:16Z" level=info msg="Backing Image Manager: backing image update watch started" component=backing-image-manager
time="2022-03-31T09:18:16Z" level=info msg="Backing Image Manager: prepare to fetch backing image" backingImage=bi-v113 component=backing-image-manager sourceFileName=
time="2022-03-31T09:18:16Z" level=info msg="Backing Image: start to fetch backing image" component=backing-image expectedChecksum= name=bi-v113 size=534431744 uuid=a3a257a1 workDir=/data/backing-images/bi-v113-a3a257a1
time="2022-03-31T09:18:16Z" level=error msg="Backing Image: failed to process backing image file: failed to reuse file via the fetch call, then reset the work directory and exited: stat /data/backing-images/bi-v113-a3a257a1/backing: no such file or directory" component=backing-image expectedChecksum= name=bi-v113 size=534431744 uuid=a3a257a1 workDir=/data/backing-images/bi-v113-a3a257a1
time="2022-03-31T09:18:16Z" level=error msg="Backing Image Manager: failed to start fetching backing image" backingImage=bi-v113 component=backing-image-manager error="failed to fetch backing image: failed to reuse file via the fetch call, then reset the work directory and exited: stat /data/backing-images/bi-v113-a3a257a1/backing: no such file or directory" sourceFileName=

* longhornio/longhorn-manager@sha256:121ec7cdf95f3f702f1d0a41436aab1b1150255ae36302304ea20687ebf0bf7a

On the other hand, If we move the spec.conversion=Webhook and the related fields to the CRD manifest, is it helpful for this issue? Can the webhook services start after applying CRD manifest?

I set the pec.conversion=Webhook and caBundle in the CRD manifest by manual. The original upgrade strategy (apply 01-prerequisite and 02-components) works well.

So the prob becomes how to deal with caBundle.

cc. @jenting @shuo-wu @PhanLe1010

Summary of the discussion with @PhanLe1010

The CRD manifest does not include the spec.conversion=Webhook and related fields. Thus, after deploying the CRD manifest, the apiserver thinks the v1beta1 and v1beta2 has same schema and only updates the apiVersion. This is the root cause of https://github.com/longhorn/longhorn/issues/3631#issuecomment-1055640716.

Ref: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#webhook-conversion

If there are no schema changes, the default None conversion strategy may be used and only the apiVersion field will be modified when serving different versions.

Then, the spec.conversion=Webhook and the related fields are updated in the conversion webhook, but the resources with incorrect apiVersion are not convertible after the update. https://github.com/longhorn/longhorn-manager/blob/d3df3a36542b36e9b5d73d7ff692fa083bbaeeec/webhook/server/server.go#L218

So, the feasible upgrade steps can be

  1. Apply the manifests in 01-prerequisite except for 01-prerequisite/03-crd.yaml
  2. Start the webhook services by applying 02-components/02-backend.yaml and 02-components/05-webhook.yaml.
  3. If the webhook services are running, apply 01-prerequisite/03-crd.yaml and remaining manifests in 02-components

The hinderance of the solution is how to guarantee the order of applying the manifests.


On the other hand, If we move the spec.conversion=Webhook and the related fields to the CRD manifest, is it helpful for this issue? Can the webhook services start after applying CRD manifest?

cc. @PhanLe1010 @jenting @shuo-wu @innobead

Then we still need to rely on Longhorn components to update the CRD YAML deployed by users. As I mentioned above, this may cause some issues. This is similar to the case Longhorn gives up updating tolerations & selectors for some components. For more details, you can check the related PR: https://github.com/longhorn/longhorn-manager/pull/842

Conversion webhook server updates the CRD spec when server up, we could consider update the storage version in the conversion webhook server as well.

Do you mean asking users to apply YAML files one by one and step and step during the upgrade? I don’t think adding complexity for the upgrade is a good idea. Besides, asking longhorn manager pods to mutate the YAML deployed by users (the CRD) may cause some other issues as well. Some users will rely on some tools to monitor if the deployed YAML is changed by someone else then throw some errors.

No, we don’t need to ask users to apply YAML files one by one and step and step. The storage field of v1beta1 and v1beta2 in the CRD manifests will be changed to true and false, respectively. User can upgrade Longhorn as usual. While starting the longhorn-manager, the storage field of v1beta1 and v1beta2CRDs are updated to false and true in the daemon. Do you think it makes sense?

It seems like a promising solution. Have you tried?

Yes, I tried to simulate by the steps. It works as expected.

@derekbit the abnormal resource is the backing image. And yes, if the backing image downloading is not finished before upgrading process, I can reproduce this consistently.