longhorn: [BUG] Volumes are stuck in undefined state after upgrade to v1.5.1
Describe the bug (🐛 if you encounter this issue)
We recently did an upgrade of Longhorn from v1.3.2 to v1.5.1 (following the upgrade path 1.3.2 -> 1.4.3 -> 1.5.1). While we did not notice any issues with most functionality, every time we use kubectl to create a new volume using our manifests, it is stuck in an undefined state.
To Reproduce
- Update your Volume definition from v1.3.2 with the additional fields of 1.5.1
- Try to apply the volume using
kubectl apply - Volume will stay stuck in a “not ready” state
- See below for additional info
Expected behavior
Volumes that are using the proper manifest definition for v1.5.1 should be correctly scheduled and show as healthy in the UI.
Support bundle for troubleshooting
supportbundle_15096343-6063-4eee-9671-dcf564e53625_2023-08-23T15-24-00Z.zip
Environment
- Longhorn version: v1.5.1
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K8s v1.26.5, setup with Kubespray v2.22.1
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 2 Longhorn dedicated ones, 2 generic worker nodes
- Node config
- OS type and version: Ubuntu 22.04.3 LTS
- Kernel version: 5.15.0-79-generic
- CPU per node: 16 vCPUs
- Memory per node: 64GB for Longhorn dedicated nodes
- Disk type(e.g. SSD/NVMe/HDD): 1 has a SSD volume, other has HDD
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster: 12
Additional context
The attached support bundle has a mix of volumes that had to be manually edited so they could be healthy, volumes that are stuck using the v1.3.2 manifest, as well as two volumes (mysql-vol-dev-3 and mongo-vol-dev-3 that have been deployed using the new values on the Volume manifest for v1.5.1
We tried getting more info using kubetail longhorn-manager -n longhorn-system, and we found many entries similar to these ones related to the volumes in stuck state:
[longhorn-manager-skrtd] E0823 15:03:40.092813 1 volume_controller.go:231] failed to sync longhorn-system/mongo-vol-dev-3: Volume.longhorn.io "mongo-vol-dev-3" is invalid: [spec.backupCompressionMethod: Unsupported value: "": supported values: "none", "lz4", "gzip", spec.replicaSoftAntiAffinity: Unsupported value: "": supported values: "ignored", "enabled", "disabled", spec.restoreVolumeRecurringJob: Unsupported value: "": supported values: "ignored", "enabled", "disabled", spec.unmapMarkSnapChainRemoved: Unsupported value: "": supported values: "ignored", "disabled", "enabled", spec.backendStoreDriver: Unsupported value: "": supported values: "v1", "v2", spec.snapshotDataIntegrity: Unsupported value: "": supported values: "ignored", "disabled", "enabled", "fast-check", spec.replicaZoneSoftAntiAffinity: Unsupported value: "": supported values: "ignored", "enabled", "disabled", spec.offlineReplicaRebuilding: Unsupported value: "": supported values: "ignored", "disabled", "enabled"]
below you can see our current volume definition, using the fields that are listed above in the error message:
apiVersion: longhorn.io/v1beta1
kind: Volume
metadata:
finalizers:
- longhorn.io
generation: 1
labels:
app: lab
longhornvolume: mongo-vol-dev-3
type: database
managedFields:
- apiVersion: longhorn.io/v1beta1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.: {}
v:"longhorn.io": {}
f:labels:
.: {}
f:longhornvolume: {}
f:spec: {}
f:status: {}
manager: longhorn-manager
operation: Update
name: mongo-vol-dev-3
namespace: longhorn-system
spec:
Standby: false
accessMode: rwo
backendStoreDriver: v1
backingImage: ""
backupCompressionMethod: gzip
dataLocality: disabled
dataSource: ""
disableFrontend: false
diskSelector: []
encrypted: false
engineImage: longhornio/longhorn-engine:v1.5.1
fromBackup: ""
frontend: blockdev
lastAttachedBy: ""
nodeID:
nodeSelector: []
numberOfReplicas: 2
offlineReplicaRebuilding: disabled
replicaAutoBalance: best-effort
replicaSoftAntiAffinity: ignored
replicaZoneSoftAntiAffinity: ignored
restoreVolumeRecurringJob: ignored
revisionCounterDisabled: false
size: "10737418240"
snapshotDataIntegrity: ignored
staleReplicaTimeout: 20
unmapMarkSnapChainRemoved: ignored
We also have a PV, PVC definition as follows:
apiVersion: v1
kind: PersistentVolume
metadata:
name: mongo-pv-dev-3
namespace: lab-dev-3
labels:
type: database
app: lab
volume_for: mongo-pvc-dev-3
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: longhorn
csi:
driver: driver.longhorn.io
fsType: xfs
volumeAttributes:
numberOfReplicas: '2'
staleReplicaTimeout: '30'
volumeHandle: mongo-vol-dev-3
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mongo-pvc-dev-3
namespace: lab-dev-3
labels:
type: database
app: lab
spec:
selector:
matchLabels:
volume_for: mongo-pvc-dev-3
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: 10Gi
Things we have tried so far:
- Deploying the version 1.3.2 of the engine image and updating the volume to use that makes it become ready, then upgrading it to use engine 1.5.1 works
- Restoring a volume from backup works, even if it was a backup from v1.3.2
- Completely deleting PV, PVC and volume and recreating all from zero does not work
- When trying to delete the volume that is in that stuck state, you have to manually edit the volume to use the previous engine image (1.3.2) so it can finish deleting, same is needed when trying to scale down workloads
- Manually creating a volume from the UI works
- Also, during the upgrade, we had a situation where the volume did not detach, even tough we had scale down all replicas that were using volumes to zero, and the instance that was being shown on UI that the volume was attached to no longer existed.
- We also tried to delete longhorn manager pods to see if they had any weird caching issue on them, but the problem persisted.
- We noticed that deploying a v1.3.2 engine volume, without the updates, and manually upgrading to 1.5.1 adds the missing fields to the volume, but adding then right away on the Kubernetes manifest does not work as the error persists.
As a note about our usage, our E2E test setup requires us to currently delete the volume, as well as the PV, PVC to apply some specifically database dumps. We noticed that after the upgrade, if there’s no need to delete the volume, things keep working as expected, but the moment you have. This is all running on an internally, self hosted Kubernetes cluster.
Some screenshots of the UI with the issue:
The detached volumes here are cases where we had to manually downgrade the engine, then get it back to 1.5.1 using the UI so we could scale down the workloads.
If this is all an error on our manifest definition after the upgrade, I would kindly ask to point us on a general direction of what should be changed or if there are any examples available.
About this issue
- Original URL
- State: open
- Created 10 months ago
- Comments: 22 (10 by maintainers)
Ah! Sorry for the misunderstanding.
I created a volume using your PVC manifest. The volume spec fields are mutated correctly. A weird point in your support bundle is I found mutating webhook actually issued a mutation request to a problematic volume
mysql-vol-test-2, but the fields are still empty.@deadpyxel We are fixing the root cause of the issue and please follow up https://github.com/longhorn/longhorn/issues/6294.
Currently, the workaround is manually patching the resources.
@deadpyxel Do you mean remove all volumes from the cluster and create a single simple deployment for analysis? If yes, it would be good.
@deadpyxel Hmm, the mutation in this simple case worked as expected. Can you provide an example manifest (like the simple case) that fails to create a running and attached volume?
Yes, from @deadpyxel’s support bundle.
Looks weird. I found some volume.spec fields are not upgraded or mutated correctly. These problematic volumes resources are
The issue is tracked in https://github.com/longhorn/longhorn/issues/6294
The workaround to your case
Scale down the applications using the problematic volumes
Disable admission webhook temporarily
Find
Then, remove
UPDATEpolicy of volumesUPDATEback that is removed instep 2You can try to fix one of volumes first and see if it works.