longhorn: [BUG] Longhorn gives 500 error when trying to provision a volume created using a snapshot.

Describe the bug

In Harvester, use a snapshot to create a volume, and while the volume is in Pending state, attach the volume to a VM immediately. Start the VM and the VM would stuck in Starting state, SSH into one of the node and execute kubectl describe pvc <volume-name> can see some message given by Longhorn:

Events:
  Type     Reason                Age                From                                                                                      Message
  ----     ------                ----               ----                                                                                      -------
  Warning  ProvisioningFailed    23s (x4 over 37s)  driver.longhorn.io_csi-provisioner-77b757f445-6gvqc_f518389e-c9b8-4d09-abd4-8e143c33965e  failed to provision volume with StorageClass "longhorn-image-8rtv9": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [message=unable to create volume: unable to create volume pvc-bf336afd-ad19-484d-bac8-60fbacedbfd6: failed to verify data source: cannot get client for volume pvc-6b2fd344-3843-48e8-9fff-bc4e5a090b3a: engine is not running, code=Server Error, detail=] from [http://longhorn-backend:9500/v1/volumes]
  Normal   ExternalProvisioning  8s (x5 over 37s)   persistentvolume-controller                                                               waiting for a volume to be created, either by external provisioner "driver.longhorn.io" or manually created by system administrator
  Normal   Provisioning          8s (x8 over 37s)   driver.longhorn.io_csi-provisioner-77b757f445-6gvqc_f518389e-c9b8-4d09-abd4-8e143c33965e  External provisioner is provisioning volume for claim "default/restored"
  Warning  ProvisioningFailed    8s (x4 over 37s)   driver.longhorn.io_csi-provisioner-77b757f445-6gvqc_f518389e-c9b8-4d09-abd4-8e143c33965e  failed to provision volume with StorageClass "longhorn-image-8rtv9": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=Server Error, detail=, message=unable to create volume: unable to create volume pvc-bf336afd-ad19-484d-bac8-60fbacedbfd6: failed to verify data source: cannot get client for volume pvc-6b2fd344-3843-48e8-9fff-bc4e5a090b3a: engine is not running] from [http://longhorn-backend:9500/v1/volumes]

To Reproduce

Reproduce steps here

Expected behavior

Volume should be attached to VM after provisioned by longhorn and VM should boot up without problems.

Log or Support bundle

longhorn-support-bundle_902bc133-4666-44c4-8e51-093f4093bfdf_2022-10-27T01-38-29Z.zip

Environment

  • Longhorn version: 1.3.2
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm in Harvester
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2 v1.24.7+rke2r1
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster:
  • Node config
    • OS type and version: Harvester v1.1.0
    • CPU per node: 8
    • Memory per node: 32Gi
    • Disk type(e.g. SSD/NVMe): VirtIO HDD
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Virtualized Harvester on Proxmox VE 7.2-3
  • Number of Longhorn volumes in the cluster: 4

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 19 (16 by maintainers)

Most upvoted comments

I can’t reproduce the issue in Harvester v1.2.1 with LH v1.4.3 with following steps:

  1. Create a VM.
  2. Create a snapshot for the volume in the VM.
  3. After snapshot is finished. Stop the VM. The volume is detached automatically.
  4. Create another PVC from the snapshot and update the VM to the new PVC. (We cannot use a pending PVC in GUI, so I update VM Yaml directly.)
  5. Before the new PVC is bound, starting the VM.
  6. VM can get up without error.

It looks like the source snapshot is on a detached volume: unable to create volume: unable to create volume pvc-bf336afd-ad19-484d-bac8-60fbacedbfd6: failed to verify data source: cannot get client for volume pvc-6b2fd344-3843-48e8-9fff-bc4e5a090b3a: engine is not running. Could we verify if we don’t have this problem when the source volume is in attached state @masteryyh @weizhe0422

Btw, provisioning a new volume from a snapshot of a detach volume, this will require the enhancement https://github.com/longhorn/longhorn-manager/pull/1541. This is a big feature so I think it is not possible to backport it to 1.4.x. cc @innobead

cc @innobead I think we can close the issue for now since @FrankYang0529 has tested and it works as expected now.

It looks like the source snapshot is on a detached volume: unable to create volume: unable to create volume pvc-bf336afd-ad19-484d-bac8-60fbacedbfd6: failed to verify data source: cannot get client for volume pvc-6b2fd344-3843-48e8-9fff-bc4e5a090b3a: engine is not running. Could we verify if we don’t have this problem when the source volume is in attached state @masteryyh @weizhe0422

@masteryyh Can you help answer the questions from @PhanLe1010 ? thanks.

@hunghvu your case looks different because the error is due to more than one engine exists, so that just means the source volume could be a migrating volume. Suggest you can create an issue at the harvester github repo instead to clarify the cause there.

Result:

There are no issues with the attached volume, PVC could be created from a VolumeSnapshot CR of the attached volume normally. Scale down the deployment to 0 and the volume will be detached, then it failed to create a PVC from a VolumeSnapshot CR of the detached volume.

Steps:

  1. Enable CSI Snapshot support
  2. Use Longhorn deployment example to create a PVC, a PV and a deployment.
  3. Create VolumeSnapshotClass by the manifest
kind: VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1
metadata:
  name: longhorn-snapshot-vsc
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
  type: snap
  1. Create the VolumeSnapshot by the manifest
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: test-csi-volume-snapshot-longhorn-snapshot
spec:
  volumeSnapshotClassName: longhorn-snapshot-vsc
  source:
    persistentVolumeClaimName: mysql-pvc
  1. Create the PVC by the manifest
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: restore-from-csi-snapshot-pvc
spec:
  storageClassName: longhorn
  dataSource:
    name: test-csi-volume-snapshot-longhorn-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi

As mentioned in this section csi-volume-snapshot-associated-with-longhorn-snapshot/#current-limitation

@guangbochen It’s planned for 1.5.0, so it will be naturally backported to 1.4.x.

@innobead we may need to backport this issue to LH v1.4.x milestone, for Harvester, the planning release date is April/04 with v1.2.0, can u please help to double-check if this is possible, thanks?