longhorn: [BUG] Failing to mount encrypted volumes v1.5.2
Describe the bug (🐛 if you encounter this issue)
Encrypted volumes work perfectly in v1.5.1 now do not in v1.5.2.
Error message from pod
Normal Scheduled 24m default-scheduler Successfully assigned flask/redis-6db4847b89-7xrq8 to frame2.rfed.me
Warning FailedAttachVolume 34s (x19 over 23m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-584281a4-60cd-403b-a522-d12a0a15eab2" : rpc error: code = Internal desc = volume pvc-584281a4-60cd-403b-a522-d12a0a15eab2 failed to attach to node frame2.rfed.me with attachmentID csi-894ac457228fd8e53fbe692158f44ecdb522e03727fce491e8715a1c093ee7cf: Waiting for volume share to be available
Warning FailedMount 8s (x11 over 22m) kubelet Unable to attach or mount volumes: unmounted volumes=[redis-data], unattached volumes=[redis-data kube-api-access-sf9sd]: timed out waiting for the condition
To Reproduce
deploy longhorn v1.5.2
add storage class https://github.com/clemenko/k8s_yaml/blob/master/longhorn_encryption.yml
deploy app that needs it https://github.com/clemenko/fleet/blob/main/flask/flask.yaml
Expected behavior
volume mount correctly.
Support bundle for troubleshooting
attaching soon
Environment
- Longhorn version: v.1.5.2
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: rke2
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 2
- Node config
- OS type and version: Rocky 9
- Kernel version:
- CPU per node: 4
- Memory per node: 8
- Disk type(e.g. SSD/NVMe/HDD): SSD
- Network bandwidth between the nodes: 1 gigawatt
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Digital Ocean
- Number of Longhorn volumes in the cluster: 1
- Impacted Longhorn resources:
- Volume names:
Additional context
This works in v1.5.1 and it broke in v1.5.2.
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Reactions: 1
- Comments: 17 (9 by maintainers)
that worked!
I added this issue to the outstanding known issue of 1.5.2, https://github.com/longhorn/longhorn/wiki/Outstanding-Known-Issues-of-Releases.
cc @derekbit @khushboo-rancher
Investigating this part. This is a better solution.
The fields in persistentvolume resource are immutable. One possible workaround for an existing volume is replacing (delete and recreate) pv.
longhorn-systemnamespacekubectl get pv <pv name> -o yaml > pv.yamlspec.csi.nodePublishSecretRef.namespaceandspec.csi.nodeStageSecretRef.namespacetolonghorn-systemin pv.yamlkubectl replace --cascade=false --force -f pv.yamlkubectl edit pv <pv name>, then remove the finalizer. The replacement should succeed.Having the secret per volume is vital for a multi-tenant situation.
The issue is due to https://github.com/longhorn/longhorn/issues/6954. We only sync the secrets in
longhorn-systemnamespace for avoiding high memory consumption of secrets, configmap and so on in other namespaces. However, share-manager pod will get the secret of thepv.Spec.CSI.NodePublishSecretRef(https://github.com/longhorn/longhorn-manager/blob/master/controller/share_manager_controller.go#L756)… We didn’t notice this.The workaround for a new volume is creating secrets in
longhorn-systemand set thexxx-secret-namespacein a storageclass tolonghorn-systemrather than ${pvc.namespace}.However, for existing encrypted RWX volumes, it seems no workaround. We need to roll back the change for
secret. cc @innobeadThis is fair.
Then, we should check if getting the secret directly from API server instead of relying on caches in this case instead of just rolling back the new implementation to cause any potential performance issues/concerns if it does matter.