longhorn: [BUG] Longhorn gives 500 error when trying to provision a volume created using a snapshot.
Describe the bug
In Harvester, use a snapshot to create a volume, and while the volume is in Pending state, attach the volume to a VM immediately. Start the VM and the VM would stuck in Starting state, SSH into one of the node and execute kubectl describe pvc <volume-name> can see some message given by Longhorn:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 23s (x4 over 37s) driver.longhorn.io_csi-provisioner-77b757f445-6gvqc_f518389e-c9b8-4d09-abd4-8e143c33965e failed to provision volume with StorageClass "longhorn-image-8rtv9": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [message=unable to create volume: unable to create volume pvc-bf336afd-ad19-484d-bac8-60fbacedbfd6: failed to verify data source: cannot get client for volume pvc-6b2fd344-3843-48e8-9fff-bc4e5a090b3a: engine is not running, code=Server Error, detail=] from [http://longhorn-backend:9500/v1/volumes]
Normal ExternalProvisioning 8s (x5 over 37s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "driver.longhorn.io" or manually created by system administrator
Normal Provisioning 8s (x8 over 37s) driver.longhorn.io_csi-provisioner-77b757f445-6gvqc_f518389e-c9b8-4d09-abd4-8e143c33965e External provisioner is provisioning volume for claim "default/restored"
Warning ProvisioningFailed 8s (x4 over 37s) driver.longhorn.io_csi-provisioner-77b757f445-6gvqc_f518389e-c9b8-4d09-abd4-8e143c33965e failed to provision volume with StorageClass "longhorn-image-8rtv9": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=Server Error, detail=, message=unable to create volume: unable to create volume pvc-bf336afd-ad19-484d-bac8-60fbacedbfd6: failed to verify data source: cannot get client for volume pvc-6b2fd344-3843-48e8-9fff-bc4e5a090b3a: engine is not running] from [http://longhorn-backend:9500/v1/volumes]
To Reproduce
Reproduce steps here
Expected behavior
Volume should be attached to VM after provisioned by longhorn and VM should boot up without problems.
Log or Support bundle
longhorn-support-bundle_902bc133-4666-44c4-8e51-093f4093bfdf_2022-10-27T01-38-29Z.zip
Environment
- Longhorn version: 1.3.2
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm in Harvester
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2 v1.24.7+rke2r1
- Number of management node in the cluster: 3
- Number of worker node in the cluster:
- Node config
- OS type and version: Harvester v1.1.0
- CPU per node: 8
- Memory per node: 32Gi
- Disk type(e.g. SSD/NVMe): VirtIO HDD
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Virtualized Harvester on Proxmox VE 7.2-3
- Number of Longhorn volumes in the cluster: 4
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 19 (16 by maintainers)
I can’t reproduce the issue in Harvester v1.2.1 with LH v1.4.3 with following steps:
It looks like the source snapshot is on a detached volume:
unable to create volume: unable to create volume pvc-bf336afd-ad19-484d-bac8-60fbacedbfd6: failed to verify data source: cannot get client for volume pvc-6b2fd344-3843-48e8-9fff-bc4e5a090b3a: engine is not running. Could we verify if we don’t have this problem when the source volume is in attached state @masteryyh @weizhe0422Btw, provisioning a new volume from a snapshot of a detach volume, this will require the enhancement https://github.com/longhorn/longhorn-manager/pull/1541. This is a big feature so I think it is not possible to backport it to 1.4.x. cc @innobead
cc @innobead I think we can close the issue for now since @FrankYang0529 has tested and it works as expected now.
@masteryyh Can you help answer the questions from @PhanLe1010 ? thanks.
@hunghvu your case looks different because the error is due to
more than one engine exists, so that just means the source volume could be a migrating volume. Suggest you can create an issue at the harvester github repo instead to clarify the cause there.Result:
There are no issues with the attached volume, PVC could be created from a
VolumeSnapshotCR of the attached volume normally. Scale down the deployment to 0 and the volume will be detached, then it failed to create a PVC from aVolumeSnapshotCR of the detached volume.Steps:
VolumeSnapshotClassby the manifestVolumeSnapshotby the manifestAs mentioned in this section csi-volume-snapshot-associated-with-longhorn-snapshot/#current-limitation
@guangbochen It’s planned for 1.5.0, so it will be naturally backported to 1.4.x.
@innobead we may need to backport this issue to LH v1.4.x milestone, for Harvester, the planning release date is April/04 with v1.2.0, can u please help to double-check if this is possible, thanks?