longhorn: [BUG] Upgrade engine --> spec.restoreVolumeRecurringJob and spec.snapshotDataIntegrity Unsupported value

Describe the bug (🐛 if you encounter this issue)

After a migration from Longhorm 1.3.2 to Longhorm 1.4.0, I am trying to upgrade engine of my volume and I get folling errors

cannot upgrade engine for volume XXXX using image rancher/mirrored-longhornio-longhorn-engine:v1.4.0:
Volume.longhorn.io "XXXX" is invalid:
spec.restoreVolumeRecurringJob: Unsupported value: "": supported values: "ignored", "enabled", "disabled",
spec.snapshotDataIntegrity: Unsupported value: "": supported values: "ignored", "disabled", "enabled", "fast-check"

When I take a look to Snapshot Data Integrity and Allow snapshots removal during trim parameter, options are empty. If I try to change the value, the error rize again

It is like that for all volumes

To Reproduce

Steps to reproduce the behavior:

  1. In Rancher, use Cluster Tools
  2. Edit Longhorn package
  3. Change the version from 1.3.2 to 1.4.0
  4. Click Next
  5. Click Update
  6. Wait until the update process finish without errors
  7. Wait a wile than all pods restart of all nodes
  8. On Longhorn UI, on each volume try to upgrade Engine Image to rancher/mirrored-longhornio-longhorn-engine:v1.4.0
  9. The following error rises cannot upgrade engine for volume XXXX using image rancher/mirrored-longhornio-longhorn-engine:v1.4.0: Volume.longhorn.io "XXXX" is invalid: [spec.snapshotDataIntegrity: Unsupported value: "": supported values: "ignored", "disabled", "enabled", "fast-check", spec.restoreVolumeRecurringJob: Unsupported value: "": supported values: "ignored", "enabled", "disabled"]

Expected behavior

An upgrade of the engine of the volume to v1.4.0

Environment

  • Longhorn version: 1.4.0
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog App
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.24.9+k3s2
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 5
  • Node config
    • OS type and version: Debian 11
    • CPU per node: AMD and ARM
    • Memory per node: 16 Go / 8Go
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth between the nodes: 10Gb
  • Number of Longhorn volumes in the cluster: 8

Workaround

https://github.com/longhorn/longhorn/issues/5485#issuecomment-1499639915

Additional context

After a rollback to version 1.3.2 (using Rancher Catalog app), everything return to a stable state

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 29 (15 by maintainers)

Most upvoted comments

For example:

$ cat patch.yaml
spec:
  replicaAutoBalance: ignored
  restoreVolumeRecurringJob: ignored
  snapshotDataIntegrity: ignored
  unmapMarkSnapChainRemoved: ignored

$ kubectl patch -n longhorn-system lhv pvc-xxxxxxxxxx --patch "$(cat patch.yaml)" --type=merge

@innobead Moved. Sorry for neglecting the state change. Yeah, https://github.com/longhorn/longhorn/issues/5762 is the root cause, and the issue also make the upgrade path more complete.

The patches are directly updated to the API server. https://github.com/longhorn/longhorn-manager/blob/v1.4.0/upgrade/v13xto140/upgrade.go#L59 It is probably caused by the listers’ delayed updates. Before the volume lister is updated, an update of a volume. Spec can hit the issue.

@derekbit from the code, we use the rest API client directly, so it should not be related to lister/delayed update?

if _, err = lhClient.LonghornV1beta2().Volumes(namespace).Update(context.TODO(), &v, metav1.UpdateOptions{}); err != nil { return errors.Wrapf(err, “failed to update volume %v”, v.Name) }

@innobead This is my original thought.

  1. upgrade path updates volume.spec using the rest client api (code)
  2. controller gets volume from the lister (code). The lister is not updated yet, so the volume.spec is not the latest and some fields such as restoreVolumeRecurringJob are empty.
  3. controller updates the volume by the rest client api (code). The validate complains about the empty values.

But you’re right, the issue happened in the upgrade path, so the root cause should be

  • volume.spec are got and updated in v102tov110 and v111to120, but the empty values are actually patched in v13xto140. Hence, it hits the issue before the later patch. But it’s still weird, the upgrade of the first user is from v1.3.2 to v1.4.0.

I would not close this as I think it’s a genuine bug. Those fields should either be added automatically upon upgrade, or at least the validator should pretend they’re there with the default values, instead of breaking.

Great!!! I applyed the patch to all my pvc, it works like a charm

A big thanks

Derek means re-trying reproducing step 8 for one volume then generating a Longhorn support bundle.

I think webhook is enabled, how can I confirm it? (pods are running and there is not error in logs) Deleted the admission pods, they were recreated, but the issue is still there

Do you need the result of kubectl get MutatingWebhookConfiguration longhorn-webhook-mutator -o yaml to help?