longhorn: [BUG] Restoring backup fails when validating volume spec

Describe the bug (🐛 if you encounter this issue)

I have periodic backups to S3. Today I thought I’d try restoring one today but it fails when trying to validate the volume spec:

unable to create volume: unable to create volume foo: Volume.longhorn.io "foo" is invalid: [spec.snapshotDataIntegrity: Unsupported value: "": supported values: "ignored", "disabled", "enabled", "fast-check", spec.unmapMarkSnapChainRemoved: Unsupported value: "": supported values: "ignored", "disabled", "enabled", spec.dataLocality: Unsupported value: "": supported values: "disabled", "best-effort", "strict-local", spec.replicaAutoBalance: Unsupported value: "": supported values: "ignored", "disabled", "least-effort", "best-effort"]

As far as I can tell I cannot control these values when restoring in the UI, so I’m assuming it is passing empty string to mean “inherit” but the validation is rejecting that.

To Reproduce

Steps to reproduce the behavior:

  1. Create a period backup job image

  2. Go to the backups list for a volume and attempt to restore one image

  3. Restore config I used: image

  4. Observe backup restore failure image

Expected behavior

The backup to be restored.

Log or Support bundle

If applicable, add the Longhorn managers’ log or support bundle when the issue happens. You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

  • Longhorn version: 1.4.0
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: kubeadm cluster, selfhosted
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 4
  • Node config
    • OS type and version: Ubuntu 20.04.5 LTS
    • CPU per node: 10
    • Memory per node: 24GB
    • Disk type(e.g. SSD/NVMe): NVME
    • Network bandwidth between the nodes: 10gbit
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): KVM
  • Number of Longhorn volumes in the cluster: 11

Additional context

Backups are stored via Minio/S3.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 19 (12 by maintainers)

Most upvoted comments

Got it. It looks the default values of the newly added spec fields in LH v1.4.0 are not set correctly.

spec.snapshotDataIntegrity
spec.unmapMarkSnapChainRemoved
spec.dataLocality

The message also complains the wrong value of spec.replicaAutoBalance added prior to v1.4.0. 😕

spec.replicaAutoBalance

Yep, deleted the admission pods and they were recreated and now it works. Now I feel like an idiot 😄 sorry for the bother friends.

@derekbit I do remember we have default values covered in the mutating webhooks, so what’s missing here?

Yes, we mutate the values if they are empty strings. https://github.com/longhorn/longhorn-manager/blob/v1.4.0/webhook/resources/volume/mutator.go#L101

Update: I created a volume and backed up it in v1.3.2 and sucessfully restored it in v1.4.0.

It fails even with backups created by 1.4.0, as in this case they are backups of a volume originally created by 1.3.2 (or possibly older, hard to verify)

No, the last upgrade was only the kubectl apply script as above. Not sure why the pods were not regenerated honestly. Maybe it’s time to rebuild my control plane.

Ah that I can tell you as I keep the upgrade script in version control

#!/bin/bash

# Deploy Longhorn
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.4.0/deploy/longhorn.yaml

# Set LongHorn as default storage class (Remember to do this on each upgrade too!)
kubectl patch storageclass longhorn -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

# Add a custom ingress to allow external prometheus to scrape metrics
kubectl apply -f metrics-ingress.yaml

Just did a quick test. The validatingWebhookConfiguration and mutatingWebhookConfiguration are regenerated after upgrade. Not sure why your environemtn didn’t do the regeneration. 😕

@derekbit I do remember we have default values covered in the mutating webhooks, so what’s missing here?

Yes, we mutate the values if they are empty strings. https://github.com/longhorn/longhorn-manager/blob/v1.4.0/webhook/resources/volume/mutator.go#L101

Update: I created a volume and backed up it in v1.3.2 and sucessfully restored it in v1.4.0.

I also tried to reproduce the issue, but I couldn’t reproduce it either.

The test steps

  1. Deploy Longhorn v1.3.2
  2. Create and attach 1 volume.
  3. Setup S3 backup target
  4. Create volume backup
  5. Create a period backup job
  6. Upgrade Longhorn to v1.4.0
  7. Do live upgrade for volume.
  8. restore the volume backup supportbundle_4dbfe34e-2d8a-49f3-a3f1-46d63149b27b_2023-01-30T02-57-20Z.zip