longhorn: [BUG] Failing to mount encrypted volumes v1.5.2

Describe the bug (🐛 if you encounter this issue)

Encrypted volumes work perfectly in v1.5.1 now do not in v1.5.2.

Error message from pod

Normal   Scheduled           24m                 default-scheduler        Successfully assigned flask/redis-6db4847b89-7xrq8 to frame2.rfed.me
  Warning  FailedAttachVolume  34s (x19 over 23m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-584281a4-60cd-403b-a522-d12a0a15eab2" : rpc error: code = Internal desc = volume pvc-584281a4-60cd-403b-a522-d12a0a15eab2 failed to attach to node frame2.rfed.me with attachmentID csi-894ac457228fd8e53fbe692158f44ecdb522e03727fce491e8715a1c093ee7cf: Waiting for volume share to be available
  Warning  FailedMount         8s (x11 over 22m)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[redis-data], unattached volumes=[redis-data kube-api-access-sf9sd]: timed out waiting for the condition

To Reproduce

deploy longhorn v1.5.2 add storage class https://github.com/clemenko/k8s_yaml/blob/master/longhorn_encryption.yml deploy app that needs it https://github.com/clemenko/fleet/blob/main/flask/flask.yaml

Expected behavior

volume mount correctly.

Support bundle for troubleshooting

attaching soon

Environment

Longhorn version: v.1.5.2
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: rke2
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 2
Node config
- OS type and version: Rocky 9
- Kernel version:
- CPU per node: 4
- Memory per node: 8
- Disk type(e.g. SSD/NVMe/HDD): SSD
- Network bandwidth between the nodes: 1 gigawatt
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Digital Ocean
Number of Longhorn volumes in the cluster: 1
Impacted Longhorn resources:
- Volume names:

Additional context

This works in v1.5.1 and it broke in v1.5.2.

About this issue

Original URL
State: closed
Created 8 months ago
Reactions: 1
Comments: 17 (9 by maintainers)

Most upvoted comments

that worked!

clemenko on Nov 5, 2023

I added this issue to the outstanding known issue of 1.5.2, https://github.com/longhorn/longhorn/wiki/Outstanding-Known-Issues-of-Releases.

cc @derekbit @khushboo-rancher

innobead on Nov 6, 2023

Can we do direct query any existing secret in any non longhorn-system namespace as a fix instead of rolling back the implementation. Also, we need to have a note for this behavior change.

Investigating this part. This is a better solution.

BTW, for workaround, where does the CSI node driver get the secret after the volume provisioned? Can users do some change there as a workaround? I believe that info should be saved in a resource like PV.

The fields in persistentvolume resource are immutable. One possible workaround for an existing volume is replacing (delete and recreate) pv.

Scale down the workload
Create a secret in longhorn-system namespace
Executekubectl get pv <pv name> -o yaml > pv.yaml
Update the spec.csi.nodePublishSecretRef.namespace and spec.csi.nodeStageSecretRef.namespace to longhorn-system in pv.yaml
Execute kubectl replace --cascade=false --force -f pv.yaml
In another terminal, kubectl edit pv <pv name>, then remove the finalizer. The replacement should succeed.
Scale up the workload

derekbit on Nov 5, 2023

Having the secret per volume is vital for a multi-tenant situation.

clemenko on Nov 5, 2023

2023-11-05T14:59:54.108520707Z time="2023-11-05T14:59:54Z" level=error msg="Failed to sync Longhorn share manager" func=controller.handleReconcileErrorLogging file="utils.go:72" ShareManager=longhorn-system/pvc-d4f17cad-5773-431f-b9f2-ecba2a7b1a46 controller=longhorn-share-manager error="failed to sync longhorn-system/pvc-d4f17cad-5773-431f-b9f2-ecba2a7b1a46: failed to create pod for share manager: secret \"redis\" not found" node=rke2

The issue is due to https://github.com/longhorn/longhorn/issues/6954. We only sync the secrets in longhorn-system namespace for avoiding high memory consumption of secrets, configmap and so on in other namespaces. However, share-manager pod will get the secret of the pv.Spec.CSI.NodePublishSecretRef (https://github.com/longhorn/longhorn-manager/blob/master/controller/share_manager_controller.go#L756)… We didn’t notice this.

The workaround for a new volume is creating secrets in longhorn-system and set the xxx-secret-namespace in a storageclass to longhorn-system rather than ${pvc.namespace}.

  parameters:
    csi.storage.k8s.io/node-publish-secret-name: ${pvc.name}
    csi.storage.k8s.io/node-publish-secret-namespace: ${pvc.namespace}
    csi.storage.k8s.io/node-stage-secret-name: ${pvc.name}
    csi.storage.k8s.io/node-stage-secret-namespace: ${pvc.namespace}
    csi.storage.k8s.io/provisioner-secret-name: ${pvc.name}
    csi.storage.k8s.io/provisioner-secret-namespace: ${pvc.namespace}

However, for existing encrypted RWX volumes, it seems no workaround. We need to roll back the change for secret. cc @innobead

derekbit on Nov 5, 2023

Having the secret per volume is vital for a multi-tenant situation.

This is fair.

Then, we should check if getting the secret directly from API server instead of relying on caches in this case instead of just rolling back the new implementation to cause any potential performance issues/concerns if it does matter.

innobead on Nov 5, 2023