longhorn: [BUG] Error when upgrading from 1.1.2 to 1.2.0 - Operation cannot be fulfilled on volumes.longhorn.io \"pvc-edf41777-589d-4806-baca-b91d0a6c0d3c\": the object has been modified; please apply your changes to the latest version and try again

Describe the bug When upgrading from 1.1.2 to 1.2.0 with preexisting volumes using Helm, I receive this error in the manager pod:

time="2021-08-31T22:26:19Z" level=error msg="Upgrade failed: upgrade CRD failed: upgrade from v1.1.0 to v1.2.0: upgrade recurring jobs failed: upgrade from v1.1.0 to v1.2.0: translate volume recurringJobs to volume labels failed: failed to update pvc-edf41777-589d-4806-baca-b91d0a6c0d3c volume: Operation cannot be fulfilled on volumes.longhorn.io \"pvc-edf41777-589d-4806-baca-b91d0a6c0d3c\": the object has been modified; please apply your changes to the latest version and try again"
time="2021-08-31T22:26:19Z" level=info msg="Upgrade leader lost: showman"
time="2021-08-31T22:26:19Z" level=fatal msg="Error starting manager: upgrade CRD failed: upgrade from v1.1.0 to v1.2.0: upgrade recurring jobs failed: upgrade from v1.1.0 to v1.2.0: translate volume recurringJobs to volume labels failed: failed to update pvc-edf41777-589d-4806-baca-b91d0a6c0d3c volume: Operation cannot be fulfilled on volumes.longhorn.io \"pvc-edf41777-589d-4806-baca-b91d0a6c0d3c\": the object has been modified; please apply your changes to the latest version and try again"

It seems this is localized to a single node in the cluster (showman). Most of the volumes have migrated engines already.

image

I have tried deleting the offending manager pod, but have not taken any other troubleshooting steps.

To Reproduce Steps to reproduce the behavior:

  1. Create a Longhorn cluster via Helm
  2. Create PVs
  3. (assumed) Create backup and snapshot tasks for the PVs
  4. Upgrade to 1.2.0
  5. Observe the error on one of the nodes during/after upgrading

Expected behavior Upgrading from 1.1.2 to 1.2.0 when using Helm should result in backup jobs being created successfully for all volumes.

Log

longhorn-support-bundle_5be709ef-a54d-46ea-971f-a9f7c4fb4b26_2021-08-31T22-55-49Z.zip

Environment:

Longhorn version: 1.2.0 (partial upgrade from 1.1.2)
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: kubeadm
    Kubernetes version: 1.19.9
    Number of management node in the cluster: 3
    Number of worker node in the cluster: 4
Node config
    OS type and version: Ubuntu 20.04.1 LTS
    CPU per node: 8
    Memory per node: 16GB
    Disk type(e.g. SSD/NVMe): NVMe
    Network bandwidth between the nodes: 2.5Gbit
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 23

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 25 (16 by maintainers)

Most upvoted comments

Is there a particular reason this got removed from 1.2.1 milestone, @yasker? Right now, it takes several hours for my cluster to become fully functional again whenever I have to restart some of its nodes. And I haven’t seen any workaround being mentioned. Or do I just misinterpret the information here?

It seems that we are doing unnecessary volume CR update here. More specifically, it seems that we still update the volume even if it already has correct label. As a result, we increase the chance of hitting the error the object has been modified; please apply your changes to the latest version and try again

cc @c3y1huang

Validation: PASSED

Upgrading from v1.1.1, v1.1.2 to latest branch included the fix the RecurringJob CRDs will be created.


Please be noted, I can only reproduce the recurring jobs CRD upgrading problem as below,

  • Set up k3s version v1.19.14+k3s1 cluster with following spec.:
  • 1 control node
    • cpus 2
    • mem 4096M
    • disk 16G
  • 3 worker node
    • cpus 1
    • 3072M
    • disk 25G
  • Install Longhorn release v1.1.1 or v1.1.2
  • Create a few volumes and set up recurring jobs for backup and snapshot
  • Upgrading Longhorn with latest manifest files from latest master branch
Recurring Job does not exist Screenshot for the reference

image

Yes, I’m going to watch to confirm it gets restored to the right number of replicas but it looks like everything’s nominal now.