longhorn: [BUG] Unable to finish install or upgrade when `Failed to list *v1beta2.Node: v1beta2.NodeList.Items`
Describe the bug
This happens when Backing Image’s disk status is empty.
spec:
disks:
1e3408a7-3318-4244-ba28-7c4cdc302add: {}
b16a8aa2-63d8-452f-baf5-eded5a2525e7: {}
c192b677-fd0d-42e1-9021-78ea70904daf: {}
To Reproduce
There is no specific steps but possible one:
-
Fresh install v1.1.3
-
Create backing image
-
Upgrade to master-head before the first backing image download finishes
-
-
See error
Expected behavior
Not sure what would be a better behavior to expect here, but since longhorn-manager pods are still waiting for longhorn-admission-webhook pods, user won’t able to do much at this point.
Log or Support bundle
Expand to see the logs.
longhorn-admission-webhook:
longhorn-admission-webhook 2022/02/19 03:44:29 proto: duplicate proto type registered: VersionResponse
longhorn-admission-webhook time="2022-02-19T03:44:29Z" level=info msg="Starting longhorn admission webhook server"
longhorn-admission-webhook W0219 03:44:29.462537 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
longhorn-admission-webhook I0219 03:44:29.465992 1 shared_informer.go:223] Waiting for caches to sync for longhorn datastore
longhorn-admission-webhook E0219 03:44:30.478473 1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
longhorn-admission-webhook E0219 03:44:31.734882 1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
longhorn-admission-webhook E0219 03:44:34.519957 1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
longhorn-admission-webhook E0219 03:44:39.318892 1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
longhorn-admission-webhook E0219 03:44:47.517024 1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
longhorn-admission-webhook E0219 03:45:00.984511 1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
longhorn-admission-webhook E0219 03:45:47.355421 1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
longhorn-admission-webhook E0219 03:46:40.122238 1 reflector.go:178] github.com/longhorn/longhorn-manager/k8s/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1beta2.BackingImage: v1beta2.BackingImageList.Items: []v1beta2.BackingImage: v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects " or n, but found {, error found in #10 byte of ...|c302add":{},"b16a8aa|..., bigger context ...|{"disks":{"1e3408a7-3318-4244-ba28-7c4cdc302add":{},"b16a8aa2-63d8-452f-baf5-eded5a2525e7":{},"c192b|...
Stream closed EOF for longhorn-system/longhorn-admission-webhook-57c4747b5d-phh8n (wait-longhorn-conversion-webhook)
longhorn-conversion-webhook:
2022/02/19 03:44:09 proto: duplicate proto type registered: VersionResponse
time="2022-02-19T03:44:09Z" level=info msg="Starting longhorn conversion webhook server"
W0219 03:44:09.498153 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2022-02-19T03:44:09Z" level=warning msg="Failed to init Kubernetes secret: secrets \"longhorn-webhook-tls\" not found"
time="2022-02-19T03:44:09Z" level=info msg="generated self-signed CA certificate CN=dynamiclistener-ca,O=dynamiclistener-org: notBefore=2022-02-19 03:44:09.537432722 +0000 UTC notAfter=2032-02-17 03:44:09.537432722 +0000 UTC"
time="2022-02-19T03:44:09Z" level=info msg="Listening on :9443"
time="2022-02-19T03:44:09Z" level=info msg="certificate CN=dynamic,O=dynamic signed by CN=dynamiclistener-ca,O=dynamiclistener-org: notBefore=2022-02-19 03:44:09 +0000 UTC notAfter=2023-02-19 03:44:09 +0000 UTC"
time="2022-02-19T03:44:09Z" level=info msg="Creating new TLS secret for longhorn-webhook-tls (count: 1): map[listener.cattle.io/cn-longhorn-conversion-webhook.longho-6a0089:longhorn-conversion-webhook.longhorn-system.svc listener.cattle.io/fingerprint:SHA1=AD6C82905AA34BFEF2E984609A95E98FCEAAB468]"
time="2022-02-19T03:44:09Z" level=info msg="Active TLS secret longhorn-webhook-tls (ver=4924) (count 1): map[listener.cattle.io/cn-longhorn-conversion-webhook.longho-6a0089:longhorn-conversion-webhook.longhorn-system.svc listener.cattle.io/fingerprint:SHA1=AD6C82905AA34BFEF2E984609A95E98FCEAAB468]"
time="2022-02-19T03:44:10Z" level=info msg="Starting /v1, Kind=Secret controller"
time="2022-02-19T03:44:10Z" level=info msg="Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller"
time="2022-02-19T03:44:10Z" level=info msg="Building conversion rules..."
time="2022-02-19T03:44:10Z" level=info msg="Starting apiregistration.k8s.io/v1, Kind=APIService controller"
time="2022-02-19T03:44:10Z" level=info msg="Updating TLS secret for longhorn-webhook-tls (count: 1): map[listener.cattle.io/cn-longhorn-conversion-webhook.longho-6a0089:longhorn-conversion-webhook.longhorn-system.svc listener.cattle.io/fingerprint:SHA1=AD6C82905AA34BFEF2E984609A95E98FCEAAB468]"
time="2022-02-19T03:44:10Z" level=info msg="Update CRD for backingimages.longhorn.io"
time="2022-02-19T03:44:10Z" level=info msg="Update CRD for backuptargets.longhorn.io"
time="2022-02-19T03:44:10Z" level=info msg="Update CRD for engineimages.longhorn.io"
time="2022-02-19T03:44:10Z" level=info msg="Update CRD for nodes.longhorn.io"
time="2022-02-19T03:44:10Z" level=info msg="Update CRD for volumes.longhorn.io"
time="2022-02-19T03:44:11Z" level=error msg="error decoding src object" error="v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects \" or n, but found {, error found in #10 byte of ...|c302add\":{},\"b16a8aa|..., bigger context ...|{\"disks\":{\"1e3408a7-3318-4244-ba28-7c4cdc302add\":{},\"b16a8aa2-63d8-452f-baf5-eded5a2525e7\":{},\"c192b|..."
time="2022-02-19T03:44:12Z" level=error msg="error decoding src object" error="v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects \" or n, but found {, error found in #10 byte of ...|c302add\":{},\"b16a8aa|..., bigger context ...|{\"disks\":{\"1e3408a7-3318-4244-ba28-7c4cdc302add\":{},\"b16a8aa2-63d8-452f-baf5-eded5a2525e7\":{},\"c192b|..."
time="2022-02-19T03:44:13Z" level=error msg="error decoding src object" error="v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects \" or n, but found {, error found in #10 byte of ...|c302add\":{},\"b16a8aa|..., bigger context ...|{\"disks\":{\"1e3408a7-3318-4244-ba28-7c4cdc302add\":{},\"b16a8aa2-63d8-452f-baf5-eded5a2525e7\":{},\"c192b|..."
time="2022-02-19T03:44:14Z" level=error msg="error decoding src object" error="v1beta2.BackingImage.Spec: v1beta2.BackingImageSpec.Disks: ReadString: expects \" or n, but found {, error found in #10 byte of ...|c302add\":{},\"b16a8aa|..., bigger context ...|{\"disks\":{\"1e3408a7-3318-4244-ba28-7c4cdc302add\":{},\"b16a8aa2-63d8-452f-baf5-eded5a2525e7\":{},\"c192b|..."
Environment
- Longhorn version: v1.1.3 -> master-head
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: 1.19 k3s
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): local vm
Additional context
Can be also reproduced on Kubernetes v1.21
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 39 (29 by maintainers)
Commits related to this issue
- charts: add conversion to crds Longhorn #3631 Signed-off-by: Derek Su <derek.su@suse.com> — committed to derekbit/longhorn by derekbit 2 years ago
- charts: add conversion to crds Longhorn #3631 Signed-off-by: Derek Su <derek.su@suse.com> — committed to longhorn/longhorn by derekbit 2 years ago
Th
minimumandmaximummarkers in the CRD manifest introduce the error. The workaround is that move the schema validation checks into the validating webhook.similar issues
Root cause https://github.com/kubernetes/kubernetes/issues/87675
cc @innobead
I tried to investigate this issue more
Experiment 1 Upload a backingimage
bbb, the value of thebbbresource in etcdAfter finishing upload, update
master-headCRD manifestsExperiment 2 Upload a backingimage bbb, the value of thebbb resource in etcd
Update
master-headCRD manifests while the upload is in progressapiVersionfield becamelonghorn.io/v1beta2.In Experiment 1, the
apiVersionwas not changed after updating CRDs.But, in Experiment 2,
apiVersionbecamelonghorn.io/v1beta2andSpec/Statusstill followedv1beta1schema. The change ofapiVersionresulted in the decode failure in “conversion webhook” because of the mismatch ofapiVersionandSpec/Status.A problem raised by the investigation is why the
v1beta1API Update/UpdateStatus operation can change theapiVersionto v1beta1 and is without any error.cc @jenting @shuo-wu @innobead
@derekbit Yes, creating new Backing Image and Volume with Backing Image will have no problem.
@derekbit with the latest fix the installtion will no longer stuck.
However, the backing image will fail, as there is no Image URL.
Expand to see the output of describe backingimage/bi-v113
Also result in volume stuck at attaching.
And the BI with empty URL was first reported here: #3510 Maybe we can track this in the previous issue? cc @shuo-wu
Yeah, another good finding from @kaxing
As @derekbit mentioned, there is some conversion enhancement after 1.18, so even we use recent versions of helm v2/v3 which adopts newer k8s API lib w/ the fix, we will still hit the issues in previous 1.18 and older version cluster.
As per we are still supporting 1.18, so the temporary solution is like @derekbit mentioned, but we need to adjust back when we bump the version support from 1.18.
@derekbit there is a different set of errors coming up when upgrading with helm. (testing with today’s master-head image *)
With the same testing steps but install using helm3 then upgrade to master-head via local chart(longhorn repo).
Upgrade can be finished if install from manifest file. But the backing image the under going volume will be failed,
Logs from backing-image-manager-*:
* longhornio/longhorn-manager@sha256:121ec7cdf95f3f702f1d0a41436aab1b1150255ae36302304ea20687ebf0bf7a
I set the
pec.conversion=WebhookandcaBundlein the CRD manifest by manual. The original upgrade strategy (apply01-prerequisiteand02-components) works well.So the prob becomes how to deal with
caBundle.cc. @jenting @shuo-wu @PhanLe1010
Summary of the discussion with @PhanLe1010
The CRD manifest does not include the
spec.conversion=Webhookand related fields. Thus, after deploying the CRD manifest, the apiserver thinks thev1beta1andv1beta2has same schema and only updates theapiVersion. This is the root cause of https://github.com/longhorn/longhorn/issues/3631#issuecomment-1055640716.Ref: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#webhook-conversion
Then, the
spec.conversion=Webhookand the related fields are updated in the conversion webhook, but the resources with incorrectapiVersionare not convertible after the update. https://github.com/longhorn/longhorn-manager/blob/d3df3a36542b36e9b5d73d7ff692fa083bbaeeec/webhook/server/server.go#L218So, the feasible upgrade steps can be
01-prerequisiteexcept for01-prerequisite/03-crd.yaml02-components/02-backend.yamland02-components/05-webhook.yaml.01-prerequisite/03-crd.yamland remaining manifests in02-componentsThe hinderance of the solution is how to guarantee the order of applying the manifests.
On the other hand, If we move the
spec.conversion=Webhookand the related fields to the CRD manifest, is it helpful for this issue? Can the webhook services start after applying CRD manifest?cc. @PhanLe1010 @jenting @shuo-wu @innobead
Then we still need to rely on Longhorn components to update the CRD YAML deployed by users. As I mentioned above, this may cause some issues. This is similar to the case Longhorn gives up updating tolerations & selectors for some components. For more details, you can check the related PR: https://github.com/longhorn/longhorn-manager/pull/842
Conversion webhook server updates the CRD spec when server up, we could consider update the storage version in the conversion webhook server as well.
Yes, I tried to simulate by the steps. It works as expected.
@derekbit the abnormal resource is the backing image. And yes, if the backing image downloading is not finished before upgrading process, I can reproduce this consistently.