longhorn: [BUG] Volumes are stuck in undefined state after upgrade to v1.5.1

Describe the bug (🐛 if you encounter this issue)

We recently did an upgrade of Longhorn from v1.3.2 to v1.5.1 (following the upgrade path 1.3.2 -> 1.4.3 -> 1.5.1). While we did not notice any issues with most functionality, every time we use kubectl to create a new volume using our manifests, it is stuck in an undefined state.

To Reproduce

  1. Update your Volume definition from v1.3.2 with the additional fields of 1.5.1
  2. Try to apply the volume using kubectl apply
  3. Volume will stay stuck in a “not ready” state
  4. See below for additional info

Expected behavior

Volumes that are using the proper manifest definition for v1.5.1 should be correctly scheduled and show as healthy in the UI.

Support bundle for troubleshooting

supportbundle_15096343-6063-4eee-9671-dcf564e53625_2023-08-23T15-24-00Z.zip

Environment

  • Longhorn version: v1.5.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K8s v1.26.5, setup with Kubespray v2.22.1
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 2 Longhorn dedicated ones, 2 generic worker nodes
  • Node config
    • OS type and version: Ubuntu 22.04.3 LTS
    • Kernel version: 5.15.0-79-generic
    • CPU per node: 16 vCPUs
    • Memory per node: 64GB for Longhorn dedicated nodes
    • Disk type(e.g. SSD/NVMe/HDD): 1 has a SSD volume, other has HDD
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster: 12

Additional context

The attached support bundle has a mix of volumes that had to be manually edited so they could be healthy, volumes that are stuck using the v1.3.2 manifest, as well as two volumes (mysql-vol-dev-3 and mongo-vol-dev-3 that have been deployed using the new values on the Volume manifest for v1.5.1

We tried getting more info using kubetail longhorn-manager -n longhorn-system, and we found many entries similar to these ones related to the volumes in stuck state:

[longhorn-manager-skrtd] E0823 15:03:40.092813       1 volume_controller.go:231] failed to sync longhorn-system/mongo-vol-dev-3: Volume.longhorn.io "mongo-vol-dev-3" is invalid: [spec.backupCompressionMethod: Unsupported value: "": supported values: "none", "lz4", "gzip", spec.replicaSoftAntiAffinity: Unsupported value: "": supported values: "ignored", "enabled", "disabled", spec.restoreVolumeRecurringJob: Unsupported value: "": supported values: "ignored", "enabled", "disabled", spec.unmapMarkSnapChainRemoved: Unsupported value: "": supported values: "ignored", "disabled", "enabled", spec.backendStoreDriver: Unsupported value: "": supported values: "v1", "v2", spec.snapshotDataIntegrity: Unsupported value: "": supported values: "ignored", "disabled", "enabled", "fast-check", spec.replicaZoneSoftAntiAffinity: Unsupported value: "": supported values: "ignored", "enabled", "disabled", spec.offlineReplicaRebuilding: Unsupported value: "": supported values: "ignored", "disabled", "enabled"]

below you can see our current volume definition, using the fields that are listed above in the error message:

apiVersion: longhorn.io/v1beta1
kind: Volume
metadata:
  finalizers:
  - longhorn.io
  generation: 1
  labels:
    app: lab
    longhornvolume: mongo-vol-dev-3
    type: database
  managedFields:
  - apiVersion: longhorn.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .: {}
          v:"longhorn.io": {}
        f:labels:
          .: {}
          f:longhornvolume: {}
      f:spec: {}
      f:status: {}
    manager: longhorn-manager
    operation: Update
  name: mongo-vol-dev-3
  namespace: longhorn-system
spec:
  Standby: false
  accessMode: rwo
  backendStoreDriver: v1
  backingImage: ""
  backupCompressionMethod: gzip
  dataLocality: disabled
  dataSource: ""
  disableFrontend: false
  diskSelector: []
  encrypted: false
  engineImage: longhornio/longhorn-engine:v1.5.1
  fromBackup: ""
  frontend: blockdev
  lastAttachedBy: ""
  nodeID:
  nodeSelector: []
  numberOfReplicas: 2
  offlineReplicaRebuilding: disabled
  replicaAutoBalance: best-effort
  replicaSoftAntiAffinity: ignored
  replicaZoneSoftAntiAffinity: ignored
  restoreVolumeRecurringJob: ignored
  revisionCounterDisabled: false
  size: "10737418240"
  snapshotDataIntegrity: ignored
  staleReplicaTimeout: 20
  unmapMarkSnapChainRemoved: ignored

We also have a PV, PVC definition as follows:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: mongo-pv-dev-3
  namespace: lab-dev-3
  labels:
    type: database
    app: lab
    volume_for: mongo-pvc-dev-3
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: longhorn
  csi:
    driver: driver.longhorn.io
    fsType: xfs
    volumeAttributes:
      numberOfReplicas: '2'
      staleReplicaTimeout: '30'
    volumeHandle: mongo-vol-dev-3
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mongo-pvc-dev-3
  namespace: lab-dev-3
  labels:
    type: database
    app: lab
spec:
  selector:
    matchLabels:
      volume_for: mongo-pvc-dev-3
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 10Gi

Things we have tried so far:

  • Deploying the version 1.3.2 of the engine image and updating the volume to use that makes it become ready, then upgrading it to use engine 1.5.1 works
  • Restoring a volume from backup works, even if it was a backup from v1.3.2
  • Completely deleting PV, PVC and volume and recreating all from zero does not work
  • When trying to delete the volume that is in that stuck state, you have to manually edit the volume to use the previous engine image (1.3.2) so it can finish deleting, same is needed when trying to scale down workloads
  • Manually creating a volume from the UI works
  • Also, during the upgrade, we had a situation where the volume did not detach, even tough we had scale down all replicas that were using volumes to zero, and the instance that was being shown on UI that the volume was attached to no longer existed.
  • We also tried to delete longhorn manager pods to see if they had any weird caching issue on them, but the problem persisted.
  • We noticed that deploying a v1.3.2 engine volume, without the updates, and manually upgrading to 1.5.1 adds the missing fields to the volume, but adding then right away on the Kubernetes manifest does not work as the error persists.

As a note about our usage, our E2E test setup requires us to currently delete the volume, as well as the PV, PVC to apply some specifically database dumps. We noticed that after the upgrade, if there’s no need to delete the volume, things keep working as expected, but the moment you have. This is all running on an internally, self hosted Kubernetes cluster.

Some screenshots of the UI with the issue: image The detached volumes here are cases where we had to manually downgrade the engine, then get it back to 1.5.1 using the UI so we could scale down the workloads.

image This is a volume that is stuck after trying to create it using the manifest

If this is all an error on our manifest definition after the upgrade, I would kindly ask to point us on a general direction of what should be changed or if there are any examples available.

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Comments: 22 (10 by maintainers)

Most upvoted comments

Ah! Sorry for the misunderstanding.

I created a volume using your PVC manifest. The volume spec fields are mutated correctly. A weird point in your support bundle is I found mutating webhook actually issued a mutation request to a problematic volume mysql-vol-test-2, but the fields are still empty.

2023-08-23T14:53:11.538739988Z time="2023-08-23T14:53:11Z" level=info msg="Request (user: user-c8stk, longhorn.io/v1beta2, Kind=Volume, namespace: longhorn-system, name: mysql-vol-test-2, operation: UPDATE) patchOps: [{\"op\": \"replace\", \"path\": \"/spec/unmapMarkSnapChainRemoved\", \"value\": \"ignored\"},{\"op\": \"replace\", \"path\": \"/spec/snapshotDataIntegrity\", \"value\": \"ignored\"},{\"op\": \"replace\", \"path\": \"/spec/restoreVolumeRecurringJob\", \"value\": \"ignored\"},{\"op\": \"replace\", \"path\": \"/spec/replicaSoftAntiAffinity\", \"value\": \"ignored\"},{\"op\": \"replace\", \"path\": \"/spec/replicaZoneSoftAntiAffinity\", \"value\": \"ignored\"},{\"op\": \"replace\", \"path\": \"/spec/backendStoreDriver\", \"value\": \"v1\"},{\"op\": \"replace\", \"path\": \"/spec/backupCompressionMethod\", \"value\": \"gzip\"},{\"op\": \"replace\", \"path\": \"/spec/offlineReplicaRebuilding\", \"value\": \"disabled\"},{\"op\": \"replace\", \"path\": \"/metadata/labels\", \"value\": {\"app\":\"labrador\",\"longhornvolume\":\"mysql-vol-test-2\",\"recurring-job-group.longhorn.io/default\":\"enabled\",\"setting.longhorn.io/remove-snapshots-during-filesystem-trim\":\"ignored\",\"setting.longhorn.io/snapshot-data-integrity\":\"ignored\",\"type\":\"database\"}}]" service=admissionWebhook

@deadpyxel We are fixing the root cause of the issue and please follow up https://github.com/longhorn/longhorn/issues/6294.

Currently, the workaround is manually patching the resources.

Would it be good to wipe out all volumes currently running on Longhorn and then creating a new support bundle were we have only that simple deployment done?

@deadpyxel Do you mean remove all volumes from the cluster and create a single simple deployment for analysis? If yes, it would be good.

@deadpyxel Hmm, the mutation in this simple case worked as expected. Can you provide an example manifest (like the simple case) that fails to create a running and attached volume?

@derekbit is the log from the support bundle? if yes, that would be interesting to see the cause.

Yes, from @deadpyxel’s support bundle.

Looks weird. I found some volume.spec fields are not upgraded or mutated correctly. These problematic volumes resources are

  spec:
    Standby: false
    accessMode: rwo
    backendStoreDriver: "null"
    backingImage: "null"
    backupCompressionMethod: "null"
    dataLocality: disabled
    dataSource: "null"
    disableFrontend: false
    diskSelector: []
    encrypted: false
    engineImage: longhornio/longhorn-engine:v1.5.1
    fromBackup: "null"
    frontend: blockdev
    lastAttachedBy: "null"
    migratable: false
    migrationNodeID: "null"
    nodeID: "null"
    nodeSelector: []
    numberOfReplicas: 2
    offlineReplicaRebuilding: "null"
    replicaAutoBalance: best-effort
    replicaSoftAntiAffinity: "null"
    replicaZoneSoftAntiAffinity: "null"
    restoreVolumeRecurringJob: "null"
    revisionCounterDisabled: false
    size: "10737418240"
    snapshotDataIntegrity: "null"
    staleReplicaTimeout: 20
    unmapMarkSnapChainRemoved: "null"

The issue is tracked in https://github.com/longhorn/longhorn/issues/6294


The workaround to your case

  1. Scale down the applications using the problematic volumes

  2. Disable admission webhook temporarily

kubectl -n longhorn-system edit validatingwebhookconfigurations longhorn-webhook-validator

Find

  - apiGroups:
    - longhorn.io
    apiVersions:
    - v1beta2
    operations:
    - CREATE
    - UPDATE
    resources:
    - volumes
    scope: Namespaced

Then, remove UPDATE policy of volumes

  1. Mutate the volume.spec with “null” value one by one
  backendStoreDriver: v1
  backupCompressionMethod: gzip
  dataLocality: disabled
  offlineReplicaRebuilding: disabled
  replicaAutoBalance: ignored
  replicaSoftAntiAffinity: ignored
  replicaZoneSoftAntiAffinity: ignored
  restoreVolumeRecurringJob: ignored
  snapshotDataIntegrity: ignored
  unmapMarkSnapChainRemoved: ignored
  1. Enable admission webhook by adding UPDATE back that is removed in step 2

You can try to fix one of volumes first and see if it works.