longhorn: [TASK] Handle the CRD validation error

What’s the task? Please describe

Hit two CRD validation errors on Kubernetes v1.19.7+k3s1 while upgrading from v1.2.2 to v1beta2 version. The errors are not encountered on Kubernetes v1.21.6+k3s1.

upgradeBackupTargets=failed to update for BackupTarget status default: BackupTarget.longhorn.io \"default\" is invalid: status.lastSyncedAt: Invalid value: \"null\": status.lastSyncedAt in body must be of type string: \"null\""
failed to update=Node.longhorn.io \“ku50-master\” is invalid: [spec.disks.default-disk-745ff2be8d312b52.tags: Invalid value: \“null\“: spec.disks.default-disk-745ff2be8d312b52.tags in body must be of type array: \“null\“, status.diskStatus: Invalid value: \“null\“: status.diskStatus in body must be of type object: \“null\“]”

Need to check which Kubernetes versions have the same issue and make sure backward compatibility across K8s versions.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 26 (19 by maintainers)

Most upvoted comments

Validation - PASSED

Upgrading from v1.1.3 with 1 volume, 1 backing image in-use:

v1.17.7+k3s1 longhorn-admission-webhook E0218 14:32:43.770666 1 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.CSIDriver: the server could not find the requested resource
v1.18.20+k3s1 No upgrade blocker
v1.19.16+k3s1 No upgrade blocker
v1.20.14+k3s1 No upgrade blocker
v1.21.8+k3s1 No upgrade blocker

As for more complex resource scenario will be covered in release test

Which one do you prefer/suggest? Support empty string for this field? Or add upgrade logic to convert from an empty string to "disabled"? (I think either one should backport to longhorn v1.2.x, right?)

Since this is an upgrade case then adding it to the upgrade path should just address the issue. Should be backported to v1.2.x.

I see, then we need to consider assigning the default value to it which will implement in the upgrade path or by defaulter webhook. Need to think further about which one is the proper way for all the Longhorn CR resources.

Validation - FAILED

Unable to upgrade to v1.3.0-master-head there is volume created from previous versions.

Upgrading with following path via helm3: v1.1.2 → v1.1.3 → v1.2.3 → master-head

longhorn-manager log:

2022/01/07 07:25:36 proto: duplicate proto type registered: VersionResponse
W0107 07:25:36.611108       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2022-01-07T07:25:36Z" level=info msg="cannot list the content of the src directory /var/lib/rancher/longhorn/engine-binaries for the copy, will do nothing: Failed to execute: nsenter [--mount=/host/proc/1/ns/mnt --net=/host/proc/1/ns/net bash -c ls /var/lib/rancher/longhorn/engine-binaries/*], output , stderr, ls: cannot access '/var/lib/rancher/longhorn/engine-binaries/*': No such file or directory\n, error exit status 2"
I0107 07:25:36.658324       1 leaderelection.go:242] attempting to acquire leader lease  longhorn-system/longhorn-manager-upgrade-lock...
I0107 07:25:36.705577       1 leaderelection.go:252] successfully acquired lease longhorn-system/longhorn-manager-upgrade-lock
time="2022-01-07T07:25:36Z" level=info msg="Start upgrading"
time="2022-01-07T07:25:36Z" level=info msg="Upgrading from longhorn.io/v1beta1 to longhorn.io/v1beta2"
time="2022-01-07T07:25:36Z" level=error msg="Upgrade failed: upgrade API version failed: upgrade from v1beta1 to v1beta2: failed: unable to fix up volumes: Volume.longhorn.io \"bi-test-v113-2\" is invalid: spec.replicaAutoBalance: Unsupported value: \"\": supported values: \"ignored\", \"disabled\", \"least-effort\", \"best-effort\""
time="2022-01-07T07:25:36Z" level=info msg="Upgrade leader lost: c1-worker3"
time="2022-01-07T07:25:36Z" level=fatal msg="Error starting manager: upgrade API version failed: upgrade from v1beta1 to v1beta2: failed: unable to fix up volumes: Volume.longhorn.io \"bi-test-v113-2\" is invalid: spec.replicaAutoBalance: Unsupported value: \"\": supported values: \"ignored\", \"disabled\", \"least-effort\", \"best-effort\""
backing image spec:
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
  creationTimestamp: "2022-01-07T06:32:03Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2022-01-07T07:29:50Z"
  finalizers:
  - longhorn.io
  generation: 4
  labels:
    longhornvolume: bi-test-v113-2
    recurring-job-group.longhorn.io/default: enabled
  name: bi-test-v113-2
  namespace: longhorn-system
  resourceVersion: "47532"
  selfLink: /apis/longhorn.io/v1beta2/namespaces/longhorn-system/volumes/bi-test-v113-2
  uid: 751cbbcb-2730-49dc-8997-89cd6c35f432
spec:
  Standby: false
  accessMode: rwo
  backingImage: bi2-v113
  baseImage: ""
  dataLocality: disabled
  dataSource: ""
  disableFrontend: false
  diskSelector: []
  encrypted: false
  engineImage: longhornio/longhorn-engine:v1.1.3
  fromBackup: ""
  frontend: blockdev
  lastAttachedBy: ""
  migratable: false
  migrationNodeID: ""
  nodeID: c1-worker3
  nodeSelector: []
  numberOfReplicas: 3
  replicaAutoBalance: ""
  revisionCounterDisabled: false
  size: "12582912"
  staleReplicaTimeout: 20
status:
  actualSize: 0
  cloneStatus:
    snapshot: ""
    sourceVolume: ""
    state: ""
  conditions: {}
  currentImage: longhornio/longhorn-engine:v1.1.3
  currentNodeID: ""
  expansionRequired: false
  frontendDisabled: false
  isStandby: false
  kubernetesStatus:
    lastPVCRefAt: ""
    lastPodRefAt: ""
    namespace: default
    pvName: bi-test-v113-2
    pvStatus: Bound
    pvcName: bi-test-v113-2
    workloadsStatus: null
  lastBackup: ""
  lastBackupAt: ""
  lastDegradedAt: ""
  ownerID: c1-worker3
  pendingNodeID: ""
  remountRequestedAt: ""
  restoreInitiated: false
  restoreRequired: false
  robustness: faulted
  shareEndpoint: ""
  shareState: ""
  state: detached

Trying to resolve the volume error manually, but unable to make status.conditions valid,

# volumes.longhorn.io "bi-test-v113" was not valid:
# * spec.replicaAutoBalance: Unsupported value: "": supported values: "ignored", "disabled", "least-effort", "best-effort"
# * status.conditions: Invalid value: "object": status.conditions in body must be of type array: "object"

Created an issue at: #3426


A possible related problem while trying to upgrade v1.2.3 to master branch, longhorn-manager will not able to bring up:

longhorn-driver-deployer-79f57fd669-2gxp6   0/1     Init:0/1           0          7m25s
longhorn-manager-88cqv                      0/1     CrashLoopBackOff   6          7m25s
longhorn-manager-b5f8t                      0/1     CrashLoopBackOff   6          7m25s
longhorn-manager-r9m2r                      0/1     CrashLoopBackOff   6          7m21s
longhorn-ui-68f4dc986b-gk45s                1/1     Running            0          7m26s

And with following identical logs across longhorn-manager identical pods:

2021/12/17 04:38:02 proto: duplicate proto type registered: VersionResponse
W1217 04:38:02.507156       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2021-12-17T04:38:02Z" level=info msg="cannot list the content of the src directory /var/lib/rancher/longhorn/engine-binaries for the copy, will do nothing: Failed to execute: nsenter [--mount=/host/proc/1/ns/mnt --net=/host/proc/1/ns/net bash -c ls /var/lib/rancher/longhorn/engine-binaries/*], output , stderr, ls: cannot access '/var/lib/rancher/longhorn/engine-binaries/*': No such file or directory\n, error exit status 2"
I1217 04:38:02.525466       1 leaderelection.go:242] attempting to acquire leader lease  longhorn-system/longhorn-manager-upgrade-lock...
I1217 04:38:02.562394       1 leaderelection.go:252] successfully acquired lease longhorn-system/longhorn-manager-upgrade-lock
time="2021-12-17T04:38:02Z" level=info msg="Start upgrading"
time="2021-12-17T04:38:02Z" level=info msg="Upgrading from longhorn.io/v1beta1 to longhorn.io/v1beta2"
time="2021-12-17T04:38:02Z" level=info msg="Finished upgrading volumes"
time="2021-12-17T04:38:02Z" level=info msg="Finished upgrading engineImages"
time="2021-12-17T04:38:02Z" level=info msg="Finished upgrading backupTargets"
time="2021-12-17T04:38:02Z" level=info msg="Finished upgrading nodes"
time="2021-12-17T04:38:02Z" level=error msg="Upgrade failed: upgrade API version failed: upgrade from v1beta1 to v1beta2: failed: unable to fix up backingImages: v1beta1.BackingImageList.Items: []v1beta1.BackingImage: v1beta1.BackingImage.Spec: v1beta1.BackingImageSpec.Disks: ReadString: expects \" or n, but found {, error found in #10 byte of ...|a2a42dd\":{}},\"imageU|..., bigger context ...|,\"disks\":{\"6cb312da-bd8e-48f3-8ef0-fb635a2a42dd\":{}},\"imageURL\":\"\",\"sourceParameters\":{},\"sourceType|..."
time="2021-12-17T04:38:02Z" level=info msg="Upgrade leader lost: wk2"
time="2021-12-17T04:38:02Z" level=fatal msg="Error starting manager: upgrade API version failed: upgrade from v1beta1 to v1beta2: failed: unable to fix up backingImages: v1beta1.BackingImageList.Items: []v1beta1.BackingImage: v1beta1.BackingImage.Spec: v1beta1.BackingImageSpec.Disks: ReadString: expects \" or n, but found {, error found in #10 byte of ...|a2a42dd\":{}},\"imageU|..., bigger context ...|,\"disks\":{\"6cb312da-bd8e-48f3-8ef0-fb635a2a42dd\":{}},\"imageURL\":\"\",\"sourceParameters\":{},\"sourceType|..."

Steps to reproduce: 0. With a fresh rke1 v1.20 cluster(1+3 nodes)

  1. Install v1.2.3 via kubectl with longhorn.yaml
  2. Wait till workload deployed finished and healthy
  3. Upgrade to master-head images with kubectl

We need to consider the nightly E2E test on different Kubernetes versions as well. (k8s 1.18+) cc @innobead @longhorn/qa

For supported OSs and K8s distro versions, more testing is always welcome but need to consider resources/costs as well.

Right now, we have OS matrix, so probably we need to have K8s version matrix as well only for supported versions. cc @khushboo-rancher

We need to consider the nightly E2E test on different Kubernetes versions as well. (k8s 1.18+) cc @innobead @longhorn/qa

Kubernetes version Upgradable
v1.22.2+k3s1 Y
v1.21.0+k3s1 Y
v1.20.0+k3s2 Y
v1.19.16+k3s1 N

To upgrade to v1beta2 and pass the CRD validation if not set + nullable marker to fields, the Kubernetes cluster should be at least v1.20.

cc @innobead